A Novel ST-YOLO Network for Steel-Surface-Defect Detection

Recent progress has been made in defect detection using methods based on deep learning, but there are still formidable obstacles. Defect images have rich semantic levels and diverse morphological features, and the model is dynamically changing due to ongoing learning. In response to these issues, this article proposes a shunt feature fusion model (ST-YOLO) for steel-defect detection, which uses a split feature network structure and a self-correcting transmission allocation method for training. The network structure is designed to specialize the process of classification and localization tasks for different computing needs. By using the self-correction criteria of adaptive sampling and dynamic label allocation, more sufficiently high-quality samples are utilized to adjust data distribution and optimize the training process. Our model achieved better performance on the NEU-DET datasets and the GC10-DET datasets and was validated to exhibit excellent performance.


Introduction
Steel sheets are a fundamental component of many manufactured goods and hence must be of the highest quality.Imperfections in the steel sheet lower its aesthetic value and shorten its lifespan [1].Steel-defect detection is conducted to make sure the final product is of high quality and to find out what went wrong.Metal sheets play a very important role in the field of mechanical processing and are indispensable raw materials.The surface quality of metal sheets is an important standard that determines their price.Due to limitations in current equipment and process conditions, different forms and types of defects are inevitably generated on the surface of metal plates, and the resulting defects vary greatly in size and quantity.Common defects include mesh patterns, inclusions, markings, surface pitting, iron oxide skin indentation, scratches, etc. Surface defects often become the starting point of metal corrosion.The presence of surface defects on steel plates can greatly reduce the fatigue strength of parts, affect the performance and lifespan of machines and instruments, and ultimately affect the performance and quality of products.Therefore, timely detection of small defects on the surface of steel plates and evaluation of the severity of steel-plate quality are of great significance for improving the surface quality and direct economic benefits of metal steel plates.With the application of computer vision in automated optical inspection (AOI), the problems of high labor costs, strong subjective interference, and low inspection efficiency of manual inspection have been solved, and the accuracy and time of inspection have been improved [2].
Defect detection is typically broken down into two stages: locating the problem and quantifying its severity.Both are analogous to computer vision classification and localization tasks [3].Metal surface flaws can be discovered with the help of image processing.The recognition of defects and their type is critical to the analysis of process problems, and the precise position and area percentage of defects are required for quantitative evaluation of product surface quality.Traditional methods based on manually designed features, commonly utilize a pipeline consisting of a series of components to perform defect detection.The process generally includes image pre-processing, region of interest (RoI) searching, feature extraction, feature selection, and pattern recognition.In most of the work, feature extraction relies on hand-crafted features, including approaches such as statistical methods [4,5], spectral methods [6,7], and model-based methods [8,9].However, the approaches rely on a great deal of prior knowledge and design experience from experts [10].In addition, the components of traditional methods are relatively independent of each other and it is difficult to achieve global optimization in an end-to-end manner.
Convolutional-neural-network architectures are widely used in deep learning methods for defect identification because of their ability to automatically learn features that interpret images.It optimizes feature extraction, utilizing a gradient update during training, which allows for more precise feature calculation.The fact that it is data-driven means that it may be used in a variety of contexts and is resilient to a wide range of material surfaces and defect types.The methods have come a long way in terms of speed and accuracy of flaw detection in recent years [11].
Object detection algorithms are widely employed in defect detection [12] due to their low cost of annotation and high discrimination precision.In most cases, deep learning models are used in the approaches to pinpoint problem areas and classify defects.But current methods are not without their own set of problems.Firstly, most general object detectors are designed for the detection task of natural scenes; unlike natural objects with clear outlines, some discrete defects do not have closed overall outlines in the image, as compared in Figure 1a,b.Secondly, the image features of defects are rich in hierarchical levels, and the low-level features also play a critical role [13], which are not well used in most general detectors.Therefore, the model structure of general object detectors is not suitable enough for processing industrial images.Improved usage of semantic features across several layers and model structure optimization for learning classification and localization is also necessary for defect detectors.Third, as shown in Figure 1c, flaws can be discovered in a wide range of sizes and forms, making it impossible for a single sampling technique to account for all of them.Finally, the feature description complexity of different defects is various, and the number and quality of the corresponding adaptable samples generated from the model are also diverse.Moreover, with the progress of model training, the fitting degree of the model to the data, is increasing gradually; thus, the selection criteria of the samples also need to be changed accordingly.Therefore, a dynamic training algorithm with high flexibility is required to supervise the training of the model to adapt to the changeable data distribution of defects.Traditional methods based on manually designed features, commonly utilize a pipeline consisting of a series of components to perform defect detection.The process generally includes image pre-processing, region of interest (RoI) searching, feature extraction, feature selection, and pattern recognition.In most of the work, feature extraction relies on hand-crafted features, including approaches such as statistical methods [4,5], spectral methods [6,7], and model-based methods [8,9].However, the approaches rely on a great deal of prior knowledge and design experience from experts [10].In addition, the components of traditional methods are relatively independent of each other and it is difficult to achieve global optimization in an end-to-end manner.
Convolutional-neural-network architectures are widely used in deep learning methods for defect identification because of their ability to automatically learn features that interpret images.It optimizes feature extraction, utilizing a gradient update during training, which allows for more precise feature calculation.The fact that it is data-driven means that it may be used in a variety of contexts and is resilient to a wide range of material surfaces and defect types.The methods have come a long way in terms of speed and accuracy of flaw detection in recent years [11].
Object detection algorithms are widely employed in defect detection [12] due to their low cost of annotation and high discrimination precision.In most cases, deep learning models are used in the approaches to pinpoint problem areas and classify defects.But current methods are not without their own set of problems.Firstly, most general object detectors are designed for the detection task of natural scenes; unlike natural objects with clear outlines, some discrete defects do not have closed overall outlines in the image, as compared in Figure 1a,b.Secondly, the image features of defects are rich in hierarchical levels, and the low-level features also play a critical role [13], which are not well used in most general detectors.Therefore, the model structure of general object detectors is not suitable enough for processing industrial images.Improved usage of semantic features across several layers and model structure optimization for learning classification and localization is also necessary for defect detectors.Third, as shown in Figure 1c, flaws can be discovered in a wide range of sizes and forms, making it impossible for a single sampling technique to account for all of them.Finally, the feature description complexity of different defects is various, and the number and quality of the corresponding adaptable samples generated from the model are also diverse.Moreover, with the progress of model training, the fitting degree of the model to the data, is increasing gradually; thus, the selection criteria of the samples also need to be changed accordingly.Therefore, a dynamic training algorithm with high flexibility is required to supervise the training of the model to adapt to the changeable data distribution of defects.This work proposes a model for steel-surface-defect identification called ST-YOLO to address the aforementioned issues.Our model employs a shunt feature fusion network  (1) To accomplish classification and localization, a shunt feature fusion network has been developed.The network's fitting capability is optimized by tailoring the calculation methods for each task to the varying semantic levels of the two tasks.
(2) Adaptive core prior is proposed to suit the defects of different shapes, flexibly extracting the sample points.
(3) The self-tuning label assignment strategy is established.Based on the adaptive core prior, the strategy selects high-quality training samples using dynamic criteria involving category and location fitness, and then assigns appropriate target labels for the selected samples.
The rest of this paper proceeds as follows: The newest developments in object detection methods and the related field of defect detection are presented in Section 2. The proposed model framework and method are presented in Section 3. In Section 4, we undertake an experiment to evaluate ST-YOLO and compare the results to those obtained using alternative models.In Section 5, we show how the network structure and training mechanism actually work.Section 6 provides the final analysis.

Deep-Learning-Based Defect Detection
Detection methods based on deep learning, process the images using learned features instead of designed features, contributing to outstanding performance in automatic defect detection.
Among the inspection methods based on deep learning, the research on anchor-based object detectors is relatively well-developed.In most cases, anchor boxes are used to indicate where flaws are located inside the defect region, which is considered the detection object.Cascade heads were used in Faster R-CNN by Liu et al. [15] to detect surface flaws in metal.Cheng et al. [16], building on RetinaNet, fused features using a channel attention mechanism and ASFF to identify flaws in steel surfaces.Ma et al. [17] improved the speed of YOLOv4 by adding an attention module.
However, anchor-based object detectors are limited by the prior anchor configuration; thus, more and more studies prefer anchor-free detectors.The anchor-free mechanism does not require prior boxes manually designed in advance, which helps to achieve end-to-end applications.Moreover, without generating a great number of dense candidate boxes, the detection speed can be improved.Kou et al. [18] transformed YOLOv3 into an anchor-free model, which increased the detection speed of strip defects.To better detect steel-surface flaws, Tian et al. [19] presented DCC-CenterNet, which uses dilated convolution to improve features and center weight to increase heatmaps.To lessen the dropout of imperfect features during fusion, Yu et al. [20] of the FCOS-based work proposed a bidirectional feature fusion network.
Unfortunately, there are still shortcomings in the existing approaches.First, some defects (such as cracks and spots) with ambiguous boundaries bring trouble for the model in determining the region of such defects, but little research has been undertaken on the model structure to improve the feature learning process of such defects.Second, the diversity of defects leads to large variations in the data distribution, which is still not solved by end-to-end learning in most of the works.Third, the training samples of the model change with the training process, but most algorithms adopt static training rules, which will limit the learning ability of the model.

Advances in Object Detection
The object detection method for region-level recognition can recognize the object in the image, simultaneously detecting the category, location (including position and size), and other information of the objects [21].Classification and localization are the two tasks of object detection, and their fitting processes are similar but different in some aspects.Early object detectors such as Faster R-CNN [22], DCN [23], YOLOv3 [24], etc., shared the calculation course in the feature extraction and feature fusion stages, and the classification and localization results were not calculated separately until the last layer of the detection head.In recent years, some works [25][26][27] have re-concentrated the dual-task divergence, replacing sibling heads with parallel heads based on shared features, such as FCOS [28], FoveaBox [29], TSD [25], etc.Because the features that the two tasks are sensitive to are quite different, the features need to be processed differently.The separately computing network design allows the running of dual tasks to optimize their respective sub-networks during training, which means reducing the interference of coupling optimization.Furthermore, the features of interest can be processed individually during inference in the network structure.
With the emergence of anchor-free detectors, the pattern of the assignment of training labels is starting to attract attention.The dense prediction method widely used usually divides candidate samples into positive and negative samples and then matches the target labels for the positive samples to complete the training.Therefore, the training strategy will determine the upper limit of the model performance.To automatically separate training samples into positive and negative subsamples based on statistical features, Zhang et al. [30] suggested an adaptive training sample selection (ATSS) technique.Different from traditional detectors such as RetinaNet [31], which adopt a predetermined fixed threshold of IoU (Intersection over Union value) to filter positives, ATSS determines the dynamic threshold through the overall IoU statistics.In addition, some algorithms are comprehensively designed for co-optimization of classification and localization.Kim et al. [32] proposed a probabilistic anchor assignment method (PAA) to fit the distribution of the training loss of candidates, using a Gaussian mixture model to determine a threshold to distinguish positive and negative anchors.Ge et al. [33] proposed an optimal-transport-based assignment method (OTA), which calculates the sum cost of classification and regression and then treats the assignment as an optimal transport problem.By adjusting the threshold, these methods help the model to accustom to the appropriate matching strategy for different data distributions and different training stages.At the same time, the limitation of hyperparameters on training is greatly reduced.
This paper takes YOLOX as the baseline, which adopted decoupled head and training methods such as multiple positive sampling and a simplified OTA method (SimOTA), which achieves a high level of accuracy and speed.Based on YOLOX, we further propose a shunt network architecture for a dual-task and self-tuning training strategy for label assignment, which will be applied to surface-defect detection.

Methodology
Figure 2 provides an overview of the ST-YOLO architecture.First, the picture features are extracted via the CSPDarkNet53 [34] backbone network, and the feature maps are withdrawn at various stages.The shunt fusion network (SFN) then uses separate fusion flows to produce localization and classification feature pyramids.Finally, we aggregate the results of classification, regression, and objectness to derive the predictions for each location.To train a network, we first apply self-tuning transport assignment (STTA) to pair up predicted samples with their correct labels, and then we compute the loss of positive samples, and finally we backpropagate the gradient to update the network.Defect categories and confidence scores are anticipated after the inference process has completed post-processing, which may include non-maximum suppression (NMS).
In the neck region, our model is an improvement over the baseline.Additionally, STTA enhanced on SimOTA allows for synchronous adjustments to sample criteria based on the geometry of the GT and the degree of difficulty in fitting.

Shunt Fusion Network
Categories and locations have distinct levels of meaning, which creates the problem of incompatible computing needs.Even if features are calculated separately in each brain with the default YOLOX, the same feature map is being fed into both heads.Since the features provided to the dual tasks are identical at the semantic level, the baseline does not perform a full differential computation.To address this shortcoming, we generate distinct feature maps for classification and localization in the model's head.
The semantic level of the final stage of fusion in the feature fusion network is higher than that of the initial stage.As can be seen in Figure 2, the fusion uses a top-down approach in the localization sub-branch.The localization head receives the combined features from the backbone network and the deeper characteristics.Some examples of primitive features are texture, edge, shape, and so on, all of which rely on less-detailed information.Using Equations ( 1) and ( 2), we can derive the feature map.Bottom-up information flow characterizes the fusion mode in the classification subfield.Deep features are created by applying a channel transformation to the features and fusing them with legacy feature maps.These features, also known as logical features, include category attributes and other forms of semantic data.The feature map l N is calculated as Equations ( 3) and ( 4).

Shunt Fusion Network
Categories and locations have distinct levels of meaning, which creates the problem of incompatible computing needs.Even if features are calculated separately in each brain with the default YOLOX, the same feature map is being fed into both heads.Since the features provided to the dual tasks are identical at the semantic level, the baseline does not perform a full differential computation.To address this shortcoming, we generate distinct feature maps for classification and localization in the model's head.
The semantic level of the final stage of fusion in the feature fusion network is higher than that of the initial stage.As can be seen in Figure 2, the fusion uses a top-down approach in the localization sub-branch.The localization head receives the combined features from the backbone network and the deeper characteristics.Some examples of primitive features are texture, edge, shape, and so on, all of which rely on less-detailed information.Using Equations ( 1) and ( 2), we can derive the feature map.Bottom-up information flow characterizes the fusion mode in the classification subfield.Deep features are created by applying a channel transformation to the features and fusing them with legacy feature maps.These features, also known as logical features, include category attributes and other forms of semantic data.The feature map N l is calculated as Equations ( 3) and (4).
The pipeline for CSP(•), which depicts computation by the CSP layer [35], is depicted in Figure 3. Downsample(•) is a Basic conv(3 × 3/2) pipeline, while Upsample(•) is a nearest-sample interpolation.To make the change, Transit(•) uses Basic conv(3 × 3/1) to compress the channels of features.To obtain M l , F l is sent to the CSP layer directly to preserve shallow information, and the information of deeper M l+1 is refined by transition as a supplement.To obtain N l , the Transit(M l ), which is refined before with deep information, is concatenated with the deepened feature extracted from the shallower level.
preserve shallow information, and the information of deeper In forward computing of the network, detailed shallow features close to network are provided for localization, and abstract deep information after volution calculations is provided for classification.Meanwhile, the goals of tasks can focus on training certain components of the network independen increasing the network's computational capacity.

Adaptive Core Prior
For the resulting map l O from the network at level l with ( 4 1) W H C × × + + , the anchor point on it can be denoted as Basic conv In forward computing of the network, detailed shallow features close to the backbone network are provided for localization, and abstract deep information after multiple convolution calculations is provided for classification.Meanwhile, the goals of the two sub-tasks can focus on training certain components of the network independently, therefore increasing the network's computational capacity.

Adaptive Core Prior
For the resulting map O l from the network at level l with the size of W l × H l × (C + 4 + 1), the anchor point on it can be denoted as a l = (x, y), where x ∈ [1, W l ], y ∈ [1, H l ].W l and H l is the width and height of the map.The information at each anchor point includes C classification scores, four regression parameters, and an objectness score.The regression parameters can be denoted as t x , t y , t w , and t h .In the anchor-free mechanism, the predicted box p box (x,y) of the anchor point is obtained with regression as follows: where p box (x,y) is represented by the coordinates of the center of the prediction box, and x c and y c with the width w (x,y) and height h (x,y) of the prediction box.s l indicates the scaling ratio of the map O l relative to the original image.
In order to alleviate the imbalance between positive and negative samples and enhance the stability of early training, the adaptive core prior is developed to perform the initial screening of anchor points.After framing the core area in the ground-truth (GT) box, the anchor points in the area and inside the GT box are marked as potential positive points, which are assigned lower training costs.On the contrary, the anchor points outside the core area or the GT box are assigned higher values of the training cost.The details of the procedure for cost will be elaborated in Section 3.3.
Unlike objects in natural scenes, defects frequently appear in extreme shapes.Figure 4 depicts the aspect-ratio distribution of the GT boxes in the NEU-DET training set.It can be considered that when the aspect ratio is in the interval [0.667, 1.5), the bounding-box size of GT is close to 1:1, and the number of such GTs only accounts for 30.9%.A large proportion of bounding boxes are thin or flat.Therefore, instead of applying a fixed square center prior in the original YOLOX, we set the core area with corresponding ratios by calculating the aspect ratio of GT boxes, as shown in Equations ( 10) and (11).The width and height of the GT box are denoted as w g and h g .The area of the core A core is set to 24, which is determined suitable by attempts.
Sensors 2023, 23, x FOR PEER REVIEW 7 of 1 In order to alleviate the imbalance between positive and negative samples and en hance the stability of early training, the adaptive core prior is developed to perform th initial screening of anchor points.After framing the core area in the ground-truth (GT box, the anchor points in the area and inside the GT box are marked as potential positiv points, which are assigned lower training costs.On the contrary, the anchor points outsid the core area or the GT box are assigned higher values of the training cost.The details o the procedure for cost will be elaborated in Section 3.3. Unlike objects in natural scenes, defects frequently appear in extreme shapes.Figure depicts the aspect-ratio distribution of the GT boxes in the NEU-DET training set.It can be considered that when the aspect ratio is in the interval [0.667, 1.5), the bounding-box size of GT is close to 1:1, and the number of such GTs only accounts for 30.9%.A larg proportion of bounding boxes are thin or flat.Therefore, instead of applying a fixed squar center prior in the original YOLOX, we set the core area with corresponding ratios by calculating the aspect ratio of GT boxes, as shown in Equations ( 10) and (11).The width and height of the GT box are denoted as g w and g h .The area of the core core A is set to 24, which is determined suitable by attempts.This method is utilized to adapt the core area to GTs of different shapes, thus covering more potential positive anchor points than the original multi-positives method.As th example shown in Figure 5, for the anchor points { } a generated from the result map both methods select the points inside the core area and GT.As a result, fewer ancho points within the square center area are selected using the multi-positives method, which filters the yellow points outside the GT.However, the adaptive core prior can preserv more points both inside the core area and inside GT.Plenty of positives make the training more adequate.The method is integrated into the training process, which enables th model to learn to accommodate defects of different shapes in end-to-end training.propotion of defect bounding-boxes (%) This method is utilized to adapt the core area to GTs of different shapes, thus covering more potential positive anchor points than the original multi-positives method.As the example shown in Figure 5, for the anchor points {a} generated from the result map, both methods select the points inside the core area and GT.As a result, fewer anchor points within the square center area are selected using the multi-positives method, which filters the yellow points outside the GT.However, the adaptive core prior can preserve more points both inside the core area and inside GT.Plenty of positives make the training more adequate.The method is integrated into the training process, which enables the model to learn to accommodate defects of different shapes in end-to-end training.

Self-Tuning Transport Assignment
Dense prediction produces many potential outcomes for each image; thus, it is important to cherry-pick the best training examples.In order to quantify the learning challenge posed by the training sample's similarity to the goal label, the training cost was established.The number of training samples is too little and the quality of supervision is too low.If all samples are considered positive, then the optimization process will center on the many low-quality samples that will overwhelm the few high-quality ones.
Since the size and shape of each defect label affects the distribution of candidate samples that match to the GT, it is important to be able to divide samples flexibly into positive and negative categories.Meanwhile, it is important to pair each positive sample with an appropriate GT as its training goal.Therefore, a self-tuning transport assignment (STTA) is proposed based on the OTA [33] technique.A dynamic determination of the number of positive samples is made using the GT dimension, and the matching relationship between the sample set and the label set is then established.
Instead of gauging the relative cost difference between each GT, the SimOTA approach simply uses the total of IoU (the dynamic k) as the sample size for each GT.It leads to certain samples being wrongly classified into the k-positive samples, even though the gap between the two is substantial.The cost disparity between individuals in a sample is taken into account by STTA, in addition to the overall sample level.Incorporating the categorical and geographical quality of training samples reduces the overall training cost.Dynamic criteria can be used to filter out low-quality training samples, allowing you to fine-tune the network's education.

Datasets and Metrics
Our proposed model is evaluated, respectively, for two datasets, NEU-DET and GC10-DET.
NEU-DET [36] is a dataset of steel surfaces consisting of 1800 photos of hot-rolled steel strips.Crazing, inclusion, patches, pitted surface, rolled-in scales, and scratches are just some of the six types of flaws depicted in the photographs.It's a 200 × 200 pixel image.A total of 1440 photos are used as the training set, while 360 are used as the test set in experimental settings.

Self-Tuning Transport Assignment
Dense prediction produces many potential outcomes for each image; thus, it is important to cherry-pick the best training examples.In order to quantify the learning challenge posed by the training sample's similarity to the goal label, the training cost was established.The number of training samples is too little and the quality of supervision is too low.If all samples are considered positive, then the optimization process will center on the many low-quality samples that will overwhelm the few high-quality ones.
Since the size and shape of each defect label affects the distribution of candidate samples that match to the GT, it is important to be able to divide samples flexibly into positive and negative categories.Meanwhile, it is important to pair each positive sample with an appropriate GT as its training goal.Therefore, a self-tuning transport assignment (STTA) is proposed based on the OTA [33] technique.A dynamic determination of the number of positive samples is made using the GT dimension, and the matching relationship between the sample set and the label set is then established.
Instead of gauging the relative cost difference between each GT, the SimOTA approach simply uses the total of IoU (the dynamic k) as the sample size for each GT.It leads to certain samples being wrongly classified into the k-positive samples, even though the gap between the two is substantial.The cost disparity between individuals in a sample is taken into account by STTA, in addition to the overall sample level.Incorporating the categorical and geographical quality of training samples reduces the overall training cost.Dynamic criteria can be used to filter out low-quality training samples, allowing you to fine-tune the network's education.

Experiments 4.1. Datasets and Metrics
Our proposed model is evaluated, respectively, for two datasets, NEU-DET and GC10-DET.NEU-DET [36] is a dataset of steel surfaces consisting of 1800 photos of hot-rolled steel strips.Crazing, inclusion, patches, pitted surface, rolled-in scales, and scratches are just some of the six types of flaws depicted in the photographs.It's a 200 × 200 pixel image.A total of 1440 photos are used as the training set, while 360 are used as the test set in experimental settings.
To measure the precision of the model, we use the COCO metrics mAP [38].The average precision (AP) is obtained by taking the integral of the P-R curve, which is ob-tained by computing the precision and recall of the detection findings at varying levels of confidence.The median AP is then calculated by averaging the APs for each category.The COCO criterion calculates mAP under a series of IoU thresholds (IoU thresholds from 0.5 to 0.95 with steps of 0.05) and takes the average value, denoted as mAP @[0.5:0.95] ; this indicator is more comprehensive.In addition, when mAP is calculated under the IoU threshold of 0.5, it is denoted as mAP @0.5 , which is commonly used in defect detection.
Frames per second (FPS) is used to measure how quickly a model can process an image, and the inference time of the model for a single image (including NMS) is used to determine how many frames per second contain recognizable images.

Implementation Details
The implementation settings of our model ST-YOLO are set as follows.Since the downsampling factor of our model is 32, the input size is set to 224 × 224 for the NEU-DET dataset, and it is set to 512 × 256 for the GC10-DET dataset.Before training a model, we load the CSPDarkNet-53 weights that have already been trained on ImageNet into the backbone network.In the first phase, the core network is frozen, and in the second, it is thawed.We've decided on a batch size of 16.In the freezing phase, the starting learning rate is 0.001, and the training period is 150 epochs.It takes 100 epochs to complete training after an initial learning rate of 0.0001.The Adam optimizer utilized, and the training is stabilized by a warm-up for the learning rate.With a ratio of 0.97 each epoch, the exponential decay is applied to the learning rate schedule.
The same strategy settings of training are set for other methods, including weights of backbone network pre-trained on ImageNet, two-stage training for frozen and unfrozen, and the warm-up and decay schedule.But due to the various computation of different models, the batch size needs to be adjusted accordingly, and the initial learning rate scales accordingly.The termination condition of training is set as the accuracy of the test set converges to a standard deviation of less than 0.005 mAP @0.5 every 10 epochs and a standard deviation of less than 0.001 mAP @[0.5, 0.95] every 10 epochs.
The GeForce RTX3060 GPU serves as the experimental platform for all the tests.One FPS is measured at a time on the test set.

Experiment Results
S-YOLO refers to the SFN-structured model, and ST-YOLO refers to the S-YOLO model trained with STTA.Tables 1 and 2 show the mAP @0.5 and AP @0.5 of each category, respectively, based on experimental findings of the proposed model for the defects on the test set (with the evaluation threshold set to IoU = 0.5).The photos in Figures 5 and 6 are a few instances of those discovered by ST-YOLO.The blue boxes indicate the model's predictions, while the tags in the top left corner illustrate the types of flaws and the confidence with which the model predicts that they exist.The defect's ground-truth label is displayed in the green box, and the defect kind is shown in the upper right.the shunt fusion structure allows the model to accurately distinguish and characterize defect locations in the presence of discretely distributed defect groups.Small faults are also easily identifiable using the detector, as shown in Figure 6b,c.Figure 6b,f demonstrates how the adaptive core prior may adjust the bounding box so that it perfectly encloses irregularly shaped defects like inclusions and scratches.In some cases where multiple types of defects coexist or even overlap, the algorithm is still competent, as depicted in Figure 6d,f.In the experiment of the GC10-DET dataset, compared with the baseline YOLOX, the SFN improves our model by 1.5 mAP@0.5.After training with the self-tuning TA method, When put to the test on the NEU-DET dataset, the S-YOLO model with SFN outperforms the baseline YOLOX by 2.2 mAP @0.5 .Based on these results, ST-YOLO trained by STTA achieves 80.3 mAP @0.5 , which is an improvement of 3.2 mAP @0.5 over the baseline.Table 1 shows significant progress in the detection of tough discrete-type flaws such as crazing and rolled-in scale.
As shown in Figure 6, the detector can correctly identify the defect category.In an environment with poor lighting and low contrast, the detector still has good detection ability for the collected images, as illustrated in Figure 6b,e.As can be seen in Figure 6a,e, the shunt fusion structure allows the model to accurately distinguish and characterize defect locations in the presence of discretely distributed defect groups.Small faults are also easily identifiable using the detector, as shown in Figure 6b,c.Figure 6b,f demonstrates how the adaptive core prior may adjust the bounding box so that it perfectly encloses irregularly shaped defects like inclusions and scratches.In some cases where multiple types of defects coexist or even overlap, the algorithm is still competent, as depicted in Figure 6d,f.
In the experiment of the GC10-DET dataset, compared with the baseline YOLOX, the SFN improves our model by 1.5 mAP @0.5 .After training with the self-tuning TA method, ST-YOLO can further improve by 1.2 mAP @0.5 to 72.8.In terms of category shown in Table 2, the shunt fusion network architecture resulted in a large improvement in the detection of discrete defects like silk spot.The adaptive core prior has stronger adaptability to long-shaped defects (such as welding lines), and it makes the localization more accurate.
It can be found that the model can accurately locate defects of a small size and extreme aspect ratio (such as punching holes, oil spots, and welding lines), from Figure 7a,c.For speckled-texture-type defects (such as silk spot and inclusion), the model shows promising feature-extraction ability, as shown in the results in Figure 7d,e.Although some non-salient defects are blurred with the background, the model is robust and can detect defects (such as water spots and waist folding), as shown in Figure 7b,h.It is worth noting that the detector can still detect small defects omitted in the annotations of datasets, such as the water spot on the left side of Figure 7e.
2, the shunt fusion network architecture resulted in a large improvement in the detection of discrete defects like silk spot.The adaptive core prior has stronger adaptability to longshaped defects (such as welding lines), and it makes the localization more accurate.
It can be found that the model can accurately locate defects of a small size and extreme aspect ratio (such as punching holes, oil spots, and welding lines), from Figure 7a,c.For speckled-texture-type defects (such as silk spot and inclusion), the model shows promising feature-extraction ability, as shown in the results in Figure 7d,e.Although some non-salient defects are blurred with the background, the model is robust and can detect defects (such as water spots and waist folding), as shown in Figure 7b,h.It is worth noting that the detector can still detect small defects omitted in the annotations of datasets, such as the water spot on the left side of Figure 7e.

Comparison with the State-of-the-Art Methods
To test the performance of the model more comprehensively, the proposed model is compared with the existing mainstream detection models in terms of accuracy, inference speed, and parameter scale.The compared models involve anchor-based detectors (including Faster R-CNN, RetinaNet, and YOLOv4) and anchor-free detectors (including FCOS, CenterNet, and YOLOX).The results of different methods on the same test set are shown in Tables 3 and 4.

Comparison with the State-of-the-Art Methods
To test the performance of the model more comprehensively, the proposed model is compared with the existing mainstream detection models in terms of accuracy, inference speed, and parameter scale.The compared models involve anchor-based detectors (includ-ing Faster R-CNN, RetinaNet, and YOLOv4) and anchor-free detectors (including FCOS, CenterNet, and YOLOX).The results of different methods on the same test set are shown in Tables 3 and 4. The proposed ST-YOLO achieves the highest accuracy, with steady improvement over the baseline on both NEU-DET and GC10-DET datasets, similar to YOLOX in speed and scale.In addition, STTA is only performed during training and does not affect the inference speed and parameter scale of the model.
From the perspective of anchor manner, the anchor-free methods generally outperform the anchor-based methods in terms of accuracy and speed.Because of a large variation in the shape and size of defects, the pre-defined anchors perform poor fitting ability to the data when regressing to calculate the bounding box.As a result, the flexible anchor-free mechanism has a greater advantage in accuracy.Furthermore, due to the large number of predicted boxes generated by the anchor-based methods, which brings a certain computational burden, the methods are inferior in speed.
Among the anchor-based detectors, Faster R-CNN (with FPN) reaches the highest mAP due to the architecture of the two-stage network which fine-tunes the candidates in the second stage.But also, because of the second calculation, the speed is slower in contrast with the one-stage model (RetinaNet and YOLOv4).Meanwhile, YOLOv4 adopts clustering to optimize the size of anchor templates on the training set to suit the data distribution of defects.Therefore, benefiting from the effect of the backbone and the optimization of anchors, the mAP of YOLOv4 is higher than that of RetinaNet.Among the three, RetinaNet showed the largest drop (22% mAP @0.5 ) in accuracy in the GC10-DET dataset, which differs in the size of defects.
Among the anchor-free detectors, the structure of FCOS is most similar to that of RetinaNet.The difference between the two lies in the label assignment method except for the anchor mechanism.FCOS employs spatial constraints and scales constraints to assign labels, instead of the fixed IoU threshold utilized in RetinaNet.Both factors increase the speed and accuracy of the FCOS.Among the detectors, only the CenterNet calculation results are performed on a single-scale feature map instead of the multi-scale feature maps, so the speed is the fastest.However, the robustness of the single-scale feature map is poor when testing on the GC10-DET.
In the YOLO series of detectors, compared with YOLOv4, YOLOX replaces the anchor mechanism in an anchor-free manner, and replaces the sibling head with decoupled heads, and adopts the dynamic label assignment method SimOTA.As a result, YOLOX has improved speed and accuracy compared to YOLOv4.Based on YOLOX, our model ST-YOLO further extends the feature decoupling to the stage of feature fusion, and employs the more flexible label assignment in a self-tuning manner, to achieve stable and outstanding accuracy with little additional computation and parameters.
In addition to comparing with some specific classic defect-recognition techniques mentioned above, we also compare ST-YOLO with related research, including Li's Optimized-Inception-ResnetV2 [39], Chen's FRCN [40], Cheng's DE_RetinaNet [16], Liu's RAF-SSD [41], Tang's ECA+MSMP [42], and improved YOLOX with boosting [43].The detailed comparison results are recorded in Table 5.Of course, compared to related research, ST-YOLO still achieves the dual advantages of recognition accuracy and speed.Sufficient comparative experiments have demonstrated that the proposed ST-YOLO defect-recognition framework has strong capabilities and advantages.

Shunt Fusion Network Structure
The feature network fuses features from different depths of the backbone network to provide feature maps with different receptive fields for the head.In this section, five feature fusion networks with different structures are compared with the original YOLOX feature fusion network, to explore the optimal feature fusion method.The tested models adopt the same backbone CSPDarkNet-53 and the same training strategy SimOTA, and the training parameters are consistent as well.The test results on the test set of the NEU-DET dataset are listed in Table 6.A top-down and bottom-up bi-directional fusion flow is adopted in the original YOLOX, and the structure is FPN-PAN.This structure can facilitate information sharing between deep and shallow features.Its structure is shown in Figure 8a, and mAP @0.5 is only 77.1.When the single top-down fusion flow is applied and the structure is single FPN, as shown in Figure 8b, mAP @0.5 has been improved slightly.This suggests that some valid information may be submerged in the late PAN structure.In the structure of Figure 8c, when the fusion process for classification and localization are separated by two parallel FPNs, the structure is double FPN, whose mAP @0.5 reaches 78.7.It is demonstrated that the feature fusion of the shunted way helps to improve the detection of the model.A top-down and bottom-up bi-directional fusion flow is adopted in the original YOLOX, and the structure is FPN-PAN.This structure can facilitate information sharing between deep and shallow features.Its structure is shown in Figure 8a, and mAP@0.5 is only 77.1.When the single top-down fusion flow is applied and the structure is single FPN, as shown in Figure 8b, mAP@0.5 has been improved slightly.This suggests that some valid information may be submerged in the late PAN structure.In the structure of Figure 8c, when the fusion process for classification and localization are separated by two parallel FPNs, the structure is double FPN, whose mAP@0.5 reaches 78.7.It is demonstrated that the feature fusion of the shunted way helps to improve the detection of the model.In our proposed SFN structure, which can also be called S-Loc-Cls, the localization features are first fused from the top down, and then the classification features are fused from the bottom up, as illustrated in Figure 8f.Furthermore, the inverse structure of SFN is designed, which is named S-Cls-Loc shown in Figure 8e, and its mAP @0.5 is inferior to our method.It shows that the features in the early stage are more helpful for localization learning, while the features fused in the later stage are more helpful for classification learning.To verify the superiority of SFN in YOLOX, the recent method proposed in [44], the BTFPN, is also compared and is shown in Figure 8d.Summarizing the above comparisons, our method achieves the highest 79.3 mAP @0.5 , as the best structure in feature fusion.

DET
In essence, from the perspective of the calculation course, in Figure 8a, as the receptive field expands with the convolution stack, the feature information deepens layer by layer from the bottom to the top, and gradually deepens from left to right.The F l extracted by the backbone network contains much position information, which needs to be captured by the localization head in time.In the fusion process, convolution computing further abstracts the feature information, to gain features related to the category and input it into the classification head.As depicted in Table 6, when the localization head is fed with shallow features, such as the structure of Figure 8b,c,f, the accuracy of discrete defects (such as crazing, pitted surface, and rolled-in scale) have improved.Because the distinctive pixels of these defects are discontinuous and small, the deep convolutions drown out the detailed position information.However, the features of these categories, which are represented by the combination of distinctive pixel groups, require deep extraction.By contrast, for continuous defects (such as patches and scratches), the depth of feature maps processed by the head has little influence on the detection.Because such defects appear as simply connected spaces in the image, feature extraction is relatively easy and concentrated.
Therefore, the computational characteristics of the proposed SFN conform to the feature depth of defects, and the structure performs better on detection.

Self-Tuning Label Assignment for Training
A suitable label assignment strategy is very important because the quality of the samples affects the training of the detector.In order to measure the quality of the samples, for the positive sample set {p} finally matched to the GT g, the quality score of the positive sample set is defined as Q g,{p} : where c g,{p} represents the costs of positive samples {p}.The min c g,{p} represents the minimum cost of positive samples matched to g, which also means the cost of the best sample.The overall quality of the selected positive samples is evaluated by comparing the difference between the average cost of the positive samples and the cost of the best samples.The closer Q g,{p} is to 1, the higher the quality of the sample set.
During the training process of NEU-DET, the quality scores of the positive sample set corresponding to each GT are counted, and their distribution is illustrated in Figure 9.  Further, in order to test the robustness of STTA, it is employed for training on the network structures of YOLOX and S-YOLO, and evaluated on the test set of NEU-DET and GC10-DET, as shown in Table 7.It can be found that compared with the model using SimOTA, STTA can stably improve the model performance and the extra time consumed for training is negligible.Further, in order to test the robustness of STTA, it is employed for training on the network structures of YOLOX and S-YOLO, and evaluated on the test set of NEU-DET and GC10-DET, as shown in Table 7.It can be found that compared with the model using SimOTA, STTA can stably improve the model performance and the extra time consumed for training is negligible.

Limitations and Future Works
Although our model ST-YOLO has made good progress in testing, there are still some limitations to be improved.First, the calculation amounts of the two parallel branches after splitting in SFN are different, which leads to a part of the computing time being wasted during model operation.Therefore, it is necessary to balance the computational load between the two branches to optimize the computational efficiency of the network.Second, the image form of discrete defects is special, and it is helpful to explore the impact of features extracted with convolutional networks at different semantic levels on detection.In addition, the STTA still has some hyperparameters to be set.In the future, the calculation of descriptive statistics will replace the hyperparameters used for sample screening, to adaptively determine the number of primary screening samples according to the distribution of the samples.In addition, how to achieve efficient information flow collaboration between cameras and image data directly, as well as information exchange on industrial sites, are also key issues that need to be addressed to truly implement defectmonitoring technology in the field, which is also our future research focus [45,46].

Conclusions
In this work, we present the ST-YOLO model for identifying surface defects in steel.Discrete defect identification has been enhanced thanks to an architecture that uses a shunt feature fusion network to optimize the calculation process for defect classification and localization.Our model's learning and fitting abilities are enhanced by the algorithm's flexibility to account for the dynamic nature of the data distribution of faults and its use of dynamic training criteria to choose samples of high quality for training.Our model scores 80.3 mAP on the NEU-DET dataset, while running at 46.0 FPS in the experiment, and 72.8 mAP, while running at 44.7 FPS in the GC10-DET dataset.Accuracy is enhanced, and overall performance is exceptional compared to the baseline.The mechanism of the proposed methods is illustrated by contrast in experiments and analysis from the discussion.
In future work, we will continue to optimize the structure of the network from the perspective of computational efficiency, and further explore the relationship between the semantic level of defect images and the mechanics of the detection network.In addition, developing a training strategy with less hyperparameter intervention is also a direction for future research.

Figure 1 .
Figure 1.Images of different objects: (a) defect of rolled-in scale without a closed outline, (b) car with a clear outline in the natural scene, and (c) different defects of various shapes and various sizes.

Figure 1 .
Figure 1.Images of different objects: (a) defect of rolled-in scale without a closed outline, (b) car with a clear outline in the natural scene, and (c) different defects of various shapes and various sizes.

Figure 3 .
Downsample(•) is a Basic conv(3 × 3/2) pipeline, while Upsample(•) is a nearest-sample interpolation.To make the change, Transit(•) uses Basic conv(3 × 3/1) to compress the channels of features.To obtain l M , l F is sent to the CSP layer directly to

Figure 4 .
Figure 4. Distribution of ground-truth boxes' aspect ratios in NEU-DET dataset.

Figure 4 .
Figure 4. Distribution of ground-truth boxes' aspect ratios in NEU-DET dataset.

Figure 5 .
Figure 5. Groups of positive anchor points sampled by different core priors in the same groundtruth box.GT is indicated by a solid green line and the core areas are indicated by blue dashed lines.The anchor points within the core area are shown in the illustration, but only those in GT can be selected.The green points are finally selected as potential positive anchor points, while the yellow ones are abandoned.(a) Anchor points are sampled using multi positives method.(b) Anchor points are sampled with adaptive core prior.

Figure 5 .
Figure 5. Groups of positive anchor points sampled by different core priors in the same ground-truth box.GT is indicated by a solid green line and the core areas are indicated by blue dashed lines.The anchor points within the core area are shown in the illustration, but only those in GT can be selected.The green points are finally selected as potential positive anchor points, while the yellow ones are abandoned.(a) Anchor points are sampled using multi positives method.(b) Anchor points are sampled with adaptive core prior.

Figure 6 .
Figure 6.Defect-detection results on NEU-DET.The blue boxes represent the prediction results from our model, and the ground-truth labels of defects are indicated by the green boxes.

Figure 6 .
Figure 6.Defect-detection results on NEU-DET.The blue boxes represent the prediction results from our model, and the ground-truth labels of defects are indicated by the green boxes.
When self-tuning TA is applied for training, the overall quality of positive samples is high, and the quality scores of the sample sets matched by most of the GTs are close to 1.When training by SimOTA, the quality scores of most positive sample sets are concentrated around 0.8, and there are some sample sets whose scores are seriously low.SimOTA only considers the relative rank of the training cost and does not consider the value of the training cost c itself, which leads to some low-quality samples with large gaps as positive samples.However, STTA adjusts the concerned anchors through the adaptive core prior, and then discards deteriorating samples and adds tolerance bands.The method corrects the screening threshold for training costs, thereby improving the quality of the positive sample set.Sensors 2023, 23, x FOR PEER REVIEW 16 of 19 samples.However, STTA adjusts the concerned anchors through the adaptive core prior, and then discards deteriorating samples and adds tolerance bands.The method corrects the screening threshold for training costs, thereby improving the quality of the positive sample set.

Figure 9 .
Figure 9. Distribution of quality scores of positive sample sets: (a) samples trained by STTA, and (b) samples trained by SimOTA.

Figure 9 .
Figure 9. Distribution of quality scores of positive sample sets: (a) samples trained by STTA, and (b) samples trained by SimOTA.
and l H is the width and height of the map.The i

Table 1 .
Results of detection on NEU-DET.

Table 2 .
Results of detection on GC10-DET.

Table 3 .
Performance comparison results on NEU-DET.

Table 5 .
Performance comparison results on NEU-DET with some related researches.

Table 6 .
Accuracy of different fusion structures in the neck of YOLOX model.

Table 7 .
The influence of different label assignment methods on model accuracy.
number of positive sample sets number of positive sample sets

Table 7 .
The influence of different label assignment methods on model accuracy.