Enhancing Precision with an Ensemble Generative Adversarial Network for Steel Surface Defect Detectors (EnsGAN-SDD)

Defects are the primary problem affecting steel product quality in the steel industry. The specific challenges in developing detect defectors involve the vagueness and tiny size of defects. To solve these problems, we propose incorporating super-resolution technique, sequential feature pyramid network, and boundary localization. Initially, the ensemble of enhanced super-resolution generative adversarial networks (ESRGAN) was proposed for the preprocessing stage to generate a more detailed contour of the original steel image. Next, in the detector section, the latest state-of-the-art feature pyramid network, known as De-tectoRS, utilized the recursive feature pyramid network technique to extract deeper multi-scale steel features by learning the feedback from the sequential feature pyramid network. Finally, Side-Aware Boundary Localization was used to precisely generate the output prediction of the defect detectors. We named our approach EnsGAN-SDD. Extensive experimental studies showed that the proposed methods improved the defect detector’s performance, which also surpassed the accuracy of state-of-the-art methods. Moreover, the proposed EnsGAN achieved better performance and effectiveness in processing time compared with the original ESRGAN. We believe our innovation could significantly contribute to improved production quality in the steel industry.


Introduction
Vision-based steel surface inspection system (detectors) are the key to maintaining the production quality of steel products. Even when a high-resolution camera is utilized to acquire images of the steel surface, some difficult features of defects affect the performance of detectors. More specifically, defects that are vague in appearance and tiny may be challenging to detect since they are shown against a wide background area [1]. Therefore, enhancing the quality of input images plays an important role in improving the localization performance of defect detectors. These two problems mean a preprocessing stage, such as recalibrating the pixel quality, is required before the steel image enters the detection stage. The output of this process might increase the difference between the defect and the background area.

Related Works
Super-resolution generative adversarial network (SRGAN) [36] uses a deep neural network combined with an adversary network to generate higher resolution images. It employs a perceptual loss function that includes an adversarial loss and a content loss. Specifically, a high-resolution image (HR) is downsampled to a low-resolution image (LR) during the training. Then, using a GAN generator, LR images are upsampled to super-resolution images (SR). Finally, it uses a discriminator to distinguish the HR images. Moreover, the backpropagation uses the GAN loss to train the discriminator and generator by employing a perceptual loss function comprising an adversarial loss and a content loss. The adversarial loss directs the solution to the natural image manifold by employing a trained discriminator network to distinguish between super-resolved images and original photo-realistic images. Rather than relying on similarity in pixel space, the proposed work employed a content loss motivated by perceptual similarity. To improve visual quality, this study [9] enhanced three critical components of SRGAN architecture, adversarial loss, and perceptual loss by proposing Enhanced SRGAN (ESRGAN). Specifically, Residual-in-Residual Dense Block (RRDB) without batch normalization was introduced as the basic network building unit. Furthermore, relativistic GAN [37] enables the discriminator to predict relative reality rather than absolute value. Finally, using the features prior to activation, ESRGAN improves perceptual loss and provides stronger supervision for brightness consistency and texture recovery. Furthermore, Real-ESRGAN [10] applies the powerful ESRGAN to a practical restoration application trained on purely synthetic data. Specifically, a high-order degradation modeling process was introduced to simulate complex real-world degradations better. Ringing and overshoot artifacts were also taken into account during the synthesis process. In addition, a U-Net discriminator with spectral normalization improves discriminator capability and stabilizing training. Some of the the-state-of-the-art methods [11,12] propose using not only a single generator, but also multiple generators in GAN. Specifically, various generators were combined as a mixture of the probabilistic model, and one was selected as the best generator to yield the final output.
A feature pyramid network (FPN) [15] is a feature extractor that takes a single-scale image of any size as the input and outputs proportionally sized feature maps at multiple levels in a fully convolutional manner. This process is separate from the backbone convolutional architectures. Therefore, the FPN can be used as a generic solution for building feature pyramids inside deep convolutional networks for tasks like object detection [19,33,34,[38][39][40]. The construction of the pyramid applies a bottom-up pathway and a top-down pathway. Specifically, the bottom-up pathway is the backbone ConvNet's feedforward computation, which computes a feature hierarchy consisting of feature maps at various scales with a scaling step. On the other hand, higher resolution features are produced by the top-down pathway, which upsamples spatially coarser but semantically stronger feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway through lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up and top-down pathways. For instance, the R-CNN family, such as Faster R-CNN [19] and Cascade R-CNN [33], as well as free anchor methods like Foveabox [39], always use the FPN inside their structures. The YOLO Family, including YOLOv4 [34], YOLOv5 [41], and YOLOX [38], as well as the most recent pyramid vision transformer or PVTv2 [40], utilized enhancement of feature pyramid networks as the primary structural component. Moreover, the latest feature pyramid model, DetectoRS [18], proposed recursive feature pyramid network (RFP) to incorporate extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers, which significantly improves object detection performance.
To hypothesize object locations, state-of-the-art object detection networks [18,19,33,39] rely on region proposal algorithms. A region proposal network (RPN) [19] shares full-image convolutional features with the detection network, allowing for near-free region proposals. It is a fully convolutional network that predicts object bounds and objectness scores at each position at the same time. The RPN was trained inside the end-to-end network to generate high-quality region proposals. Furthermore, Side-Aware Boundary Localization (SABL) [20] was proposed for precise localization proposals in object detection. Each side of the bounding box is localized with a dedicated network branch. Empirically, this method observes that when they manually annotate a bounding box for an object, it is frequently easier to align each side of the box to the object boundary than to move the box while tuning the size. In response to this observation, SABL positions each side of the bounding box based on its surrounding context.
Currently, deep learning-based methods have been widely adopted in many application tasks, including the development of vision-based inspection systems. For instance, the work in ref. [42] applied deep learning convolutional neural networks to determine defect areas on highway roads. Furthermore, the proposed system in ref. [43] used deep learning object detection to develop a defect inspection system for highway roads. Moreover, object detection for defect detection systems (detector) has been widely applied in the manufacturing process, including the steel industry [21][22][23][24][25][26]. Feature pyramid networks (FPN) and region proposal network (RPN) are the main detector components in these previous works. Specifically, the proposed approach in refs. [21,22] utilized FPN inside of a YOLO structure to extract multi-scale features from images of steel and aircraft products. The proposed framework in [23][24][25][26] also adopted FPN to acquire multi-scale features from images of steel, wood, solar cells, and electrical products combined with RPN to localize the defect regions accurately.
The proposed EnsGAN-SDD is mainly inspired by the literature reviewed above to construct the steel surface inspection system. First, ESRGAN was selected as the input image enhancement technique rather than Real ESRGAN since Real ESRGAN produces more noise when facing lower resolutions. Moreover, we proposed a novel EnsGAN that adopted the concept of the multi-Generator GAN to improve the performance of the baseline method in terms of accuracy and processing speed. Then, in the detection stage or steel defect detector (SDD), Side-Aware Boundary Localization (SABL) was integrated into the latest state-of-the-art feature pyramid network, or DetectoRS. The experiments on the common steel surface dataset and Severstal steel dataset showed the proposed work on top of the residual network, and the aggregate residual network backbone achieved significant improvement over the state-of-the-art object detection methods [19,33,34,[38][39][40]. Moreover, the proposed EnsGAN SDD also met the criteria of inference time [44,45] in the steel surface inspection system, making it suitable for real-world scenarios in the steel industry.

Proposed Methods
We proposed two processes for detecting steel defect areas: preprocessing and feature extraction, as shown in Figure 1. The preprocessing stage in our method, named the EnsGAN model, is integrated with the state of the art recursive feature pyramid network (DetectoRS) [18] on top of a residual network (ResNet) [46] or residual aggregation network (ResNext) [47], with Boundary Localization [20] as the localization and recognition network in the feature extraction stage. Moreover, in this work, we aimed to improve super resolution quality from the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) and reduce the latency of process inference for the steel image enhancement method. This section introduces the base architectures of the proposed EnsGAN (generator and discriminator) used in this research and our strategy to improve quality and reduce latency for the steel defect problem. The resulting image from the EnsGAN is used as the input for DetectorRS with the SABL prediction head [20] as the steel defect detector (SDD) to produce a highly accurate defect localization system for steel surface images.

Proposed Methods
We proposed two processes for detecting steel defect areas: preprocessing and feature extraction, as shown in Figure 1. The preprocessing stage in our method, named the EnsGAN model, is integrated with the state of the art recursive feature pyramid network (DetectoRS) [18] on top of a residual network (ResNet) [46] or residual aggregation network (ResNext) [47], with Boundary Localization [20] as the localization and recognition network in the feature extraction stage. Moreover, in this work, we aimed to improve super resolution quality from the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) and reduce the latency of process inference for the steel image enhancement method. This section introduces the base architectures of the proposed EnsGAN (generator and discriminator) used in this research and our strategy to improve quality and reduce latency for the steel defect problem. The resulting image from the EnsGAN is used as the input for DetectorRS with the SABL prediction head [20] as the steel defect detector (SDD) to produce a highly accurate defect localization system for steel surface images. In the first step of the preprocessing procedure, the discriminator should be trained to classify an image as "High Resolution" or "Low Resolution". The generator will train to generate the super-resolution image and test it in the discriminator. The loss and information from the discriminator will be used in the generator to enhance the weight to create better resolution. For further detail, the Algorithm 1 shows the complete process flow from preprocessing through detection: Algorithm 1 Algorithm of the proposed method 1. Procedure EnsGAN: 2.
Train discriminator (VGG28) with HR and LR to identify HR/LR 3.
For i = 1 to iteration do 6.
Generate HR with generator with each piece of LR image 7.
Combine all result generators into 1 image according to positions 8.
Check whether Xgenerator is HR or LR using the discriminator 9.
Update weight and bias in generator 13.
Repeat training generator In the first step of the preprocessing procedure, the discriminator should be trained to classify an image as "High Resolution" or "Low Resolution". The generator will train to generate the super-resolution image and test it in the discriminator. The loss and information from the discriminator will be used in the generator to enhance the weight to create better resolution. For further detail, the Algorithm 1 shows the complete process flow from preprocessing through detection: Algorithm 1 Algorithm of the proposed method 1.
Train discriminator (VGG28) with HR and LR to identify HR/LR 3.
For i = 1 to iteration do 6.
Generate HR with generator with each piece of LR image 7.
Combine all result generators into 1 image according to positions 8.
Check whether X generator is HR or LR using the discriminator 9.
Update weight and bias in generator 13.
Boundary localization using SABL

Proposed EnsGAN
Our base model uses ESRGAN [9], which achieves better quality than previous superresolution models such as SRCNN [48], EDSR [49], SRGAN [36], and RCAN [50]. The ESRGAN model is an improvement on the model from SRGAN that removes batch normalization (BN) in the generator structure and uses the residual in residual dense block (RRDB), as depicted in Figure 2.

Proposed EnsGAN
Our base model uses ESRGAN [9], which achieves better quality than previous super-resolution models such as SRCNN [48], EDSR [49], SRGAN [36], and RCAN [50]. The ESRGAN model is an improvement on the model from SRGAN that removes batch normalization (BN) in the generator structure and uses the residual in residual dense block (RRDB), as depicted in Figure 2. Removing batch normalization [49] can reduce memory usage by up to four times, achieve better performance than the standard ResNet structure model, and serve as the best simple strategy for sharpening and deblurring images [51]. In the generator structure, ESRGAN keeps the high-level SRGAN architecture. The main difference is that each block contains some convolution layer and activation functions using Leaky Rectified Linear Unit (LReLU) to prevent dead nodes [9]. Improvement can be seen by scaling down the residuals by multiplying a constant between 0 and 1 before adding them to the main path to prevent instability and using initialization when training has a slight variance.
For the discriminator, ESRGAN uses a relativistic discriminator to approximate the probability of an image being real or fake. A generator uses a linear combination of perceptual difference between real and fake images using the VGG-28 network architecture, pixel-wise absolute difference between real and fake images, and the relativistic average loss between real and fake images function during adversarial training. In standard discriminator GAN, the discriminator can be defined as: where R x is a non-transformed layer, and D x is a discriminator [36]. A relative discriminator can be represented as: Removing batch normalization [49] can reduce memory usage by up to four times, achieve better performance than the standard ResNet structure model, and serve as the best simple strategy for sharpening and deblurring images [51]. In the generator structure, ESRGAN keeps the high-level SRGAN architecture. The main difference is that each block contains some convolution layer and activation functions using Leaky Rectified Linear Unit (LReLU) to prevent dead nodes [9]. Improvement can be seen by scaling down the residuals by multiplying a constant between 0 and 1 before adding them to the main path to prevent instability and using initialization when training has a slight variance.
For the discriminator, ESRGAN uses a relativistic discriminator to approximate the probability of an image being real or fake. A generator uses a linear combination of perceptual difference between real and fake images using the VGG-28 network architecture, pixel-wise absolute difference between real and fake images, and the relativistic average loss between real and fake images function during adversarial training. In standard discriminator GAN, the discriminator can be defined as: where R(x) is a non-transformed layer, and D(x) is a discriminator [36]. A relative discriminator can be represented as: where E represents the operation of taking the average of the data in the mini-batch. The discriminator loss is then defined as: And the adversarial loss for a generator in the symmetrical form can be written as: where: We present our model to achieve better results and time consumption for steel defect images. Our idea is to use an ensemble multi-Generator rather than a single one in the general ESRGAN model. The input data used for training are a part of the split image, as shown in Figure 3. Specifically, we used a split array image to implement the splitting operation. Next, the proposed EnsGAN inputs the generator using a split low-resolution image, as shown in Figure 4.
where represents the operation of taking the average of the data in the mini-batch. The discriminator loss is then defined as: And the adversarial loss for a generator in the symmetrical form can be written as: where: We present our model to achieve better results and time consumption for steel defect images. Our idea is to use an ensemble multi-Generator rather than a single one in the general ESRGAN model. The input data used for training are a part of the split image, as shown in Figure 3. Specifically, we used a split array image to implement the splitting operation. Next, the proposed EnsGAN inputs the generator using a split low-resolution image, as shown in Figure 4.   Figure  3, splitting an array image or Image LR from one single image and inputting it into a multi-Generator G , G , … , G then uses the ensemble technique to reconstruct the output Image HR.
where represents the operation of taking the average of the data in the mini-batch. The discriminator loss is then defined as: And the adversarial loss for a generator in the symmetrical form can be written as: where: We present our model to achieve better results and time consumption for steel defect images. Our idea is to use an ensemble multi-Generator rather than a single one in the general ESRGAN model. The input data used for training are a part of the split image, as shown in Figure 3. Specifically, we used a split array image to implement the splitting operation. Next, the proposed EnsGAN inputs the generator using a split low-resolution image, as shown in Figure 4.   Figure  3, splitting an array image or Image LR from one single image and inputting it into a multi-Generator G , G , … , G then uses the ensemble technique to reconstruct the output Image HR.  Figure 3, splitting an array image or Image LR from one single image and inputting it into a multi-Generator (G 1 , G 2 , . . . , G n ) then uses the ensemble technique to reconstruct the output Image HR.

Detection Baseline
In the proposed steel defect detector (SDD), we utilized DetectoRS as a detection baseline. This complex deep learning model has two main cores: recursive feature pyramid (RFP) and switchable atrous convolution (SAC). The RFP in DetectoRS, as shown in Figure 5, was placed at the micro-level, which incorporates extra feedback connections in bottom-up backbone layers, and SAC was placed at the macro-level, which convolves the features with different atrous rates and gathers the result using switch functions.

Detection Baseline
In the proposed steel defect detector (SDD), we utilized DetectoRS as a detection baseline. This complex deep learning model has two main cores: recursive feature pyramid (RFP) and switchable atrous convolution (SAC). The RFP in DetectoRS, as shown in Figure 5, was placed at the micro-level, which incorporates extra feedback connections in bottom-up backbone layers, and SAC was placed at the macro-level, which convolves the features with different atrous rates and gathers the result using switch functions. In feature pyramid networks, the output feature can be defined by: where denotes the i-th stage of the bottom-up backbone, and denotes the i-th topdown FPN operation. The backbone, equipped with FPN, outputs a set of feature maps | i 1, … , S , where S is the number of the stages, is the input image, and 0. After adding the recursive feature pyramid (RFP) for feedback connections, the output feature will be defined by: where denotes the feature transformations before connecting them back to the bottomup backbone. To implement the recursive operation, we unroll it to a sequential network, i.e., ∀i 1, … , S; t 1, … T.
where T is the number of unrolled iterations, and superscript t is used to denote operations and features at the unrolled step t. The detection baseline using DetectoRS with Res-NeXt backbone, ResNext, performs very well in extracting the deeper feature information. Moreover, the RFP has a robust network because the output of the FPN is brought back to each stage of the bottom-up backbone thought-feedback connection, which means there is a double reading of the feature information from the defect image.

Boundary Localization
To improve the defect localization accuracy, we add Side-Aware Boundary Localization (SABL) [20] to replace the prediction head on the DetectoRS. SABL is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. There is a two-step localization scheme that first predicts a range of movement through bucket prediction and then pinpoints the precise position within the predicted bucket. Figure 6 shows that it first looks for the correct bucket, i.e., the one in which the boundary is located. Fine regression is then performed by predicting offsets using the selected bucket's centerline as a coarse estimate. This scheme enables very precise localization even in the presence of large variances in displacements. In feature pyramid networks, the output feature δ i can be defined by: where β i denotes the i-th stage of the bottom-up backbone, and α i denotes the i-th topdown FPN operation. The backbone, equipped with FPN, outputs a set of feature maps {δ i | i = 1, . . . , S}, where S is the number of the stages, x 0 is the input image, and δ s+1 = 0. After adding the recursive feature pyramid (RFP) for feedback connections, the output feature δ i will be defined by: where ϑ i denotes the feature transformations before connecting them back to the bottom-up backbone. To implement the recursive operation, we unroll it to a sequential network, i.e., ∀i = 1, . . . , S; t = 1, . . . T.
where T is the number of unrolled iterations, and superscript t is used to denote operations and features at the unrolled step t. The detection baseline using DetectoRS with ResNeXt backbone, ResNext, performs very well in extracting the deeper feature information. Moreover, the RFP has a robust network because the output of the FPN is brought back to each stage of the bottom-up backbone thought-feedback connection, which means there is a double reading of the feature information from the defect image.

Boundary Localization
To improve the defect localization accuracy, we add Side-Aware Boundary Localization (SABL) [20] to replace the prediction head on the DetectoRS. SABL is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. There is a two-step localization scheme that first predicts a range of movement through bucket prediction and then pinpoints the precise position within the predicted bucket. Figure 6 shows that it first looks for the correct bucket, i.e., the one in which the boundary is located. Fine regression is then performed by predicting offsets using the selected bucket's centerline as a coarse estimate. This scheme enables very precise localization even in the presence of large variances in displacements.

Experiments
For the experiments, we used the common steel surface defect dataset provided by Kaggle, which consists of 12,568 images. This dataset has been used for the real industrial application scenarios [52][53][54]. According to the original source (https://www.kaggle.com/c/severstal-steel-defect-detection, accessed on 24 April 2022), this dataset has four class categories: Class 1, Class 2, Class 3, and Class 4, respectively. Specifically, the study in ref. [54] explained that Class 1 has pitted surface defect conditions, Class 2 has crazing defect conditions, Class 3 has scratch defect conditions, and Class 4 has patch defect conditions. A clear presentation of the defect conditions can be found in Figure 7. Moreover, we trained the dataset using an Nvidia RTX 3090 Ti 24GB with a splitting ratio of 80% of images for the training, 10% for validation, and 10% for the testing. Originally, this dataset used pixel-wise annotation labels. However, in this experiment, we converted the label into a bounding box using the Pascal VOC annotation format [55] to reduce the annotation cost and make the system more suitable for real-world scenarios.

Experiments
For the experiments, we used the common steel surface defect dataset provided by Kaggle, which consists of 12,568 images. This dataset has been used for the real industrial application scenarios [52][53][54]. According to the original source (https://www.kaggle.com/ c/severstal-steel-defect-detection, accessed on 24 April 2022), this dataset has four class categories: Class 1, Class 2, Class 3, and Class 4, respectively. Specifically, the study in ref. [54] explained that Class 1 has pitted surface defect conditions, Class 2 has crazing defect conditions, Class 3 has scratch defect conditions, and Class 4 has patch defect conditions. A clear presentation of the defect conditions can be found in Figure 7. Moreover, we trained the dataset using an Nvidia RTX 3090 Ti 24GB with a splitting ratio of 80% of images for the training, 10% for validation, and 10% for the testing. Originally, this dataset used pixel-wise annotation labels. However, in this experiment, we converted the label into a bounding box using the Pascal VOC annotation format [55] to reduce the annotation cost and make the system more suitable for real-world scenarios.
Specifically, in the generator of the proposed preprocessing EnsGAN, for RRDB Network, we used 2D Convolution with a 3 × 3 kernel size and 1 × 1 strides with the same padding, and 2D Transpose Convolution with a 3 × 3 kernel size and 2 × 2 strides with the same padding; for the activation function, we used Leaky ReLU with α = 0.2. Furthermore, in VGG28 Architecture, we used Convolution 2D with a 3 × 3 kernel size with the same padding, Leaky ReLU activation function with α = 0.2, and fully connected layers with 100 neurons. In the localization and recognition network experiments, we used a single image as the batch size with stochastic gradient descent as the training optimizer. Moreover, the initial learning rate was set to 0.001, and the decay rate step was 0.0001 in epochs 16 and 19 of the 40 epochs. In addition, we only used one general data augmentation method, random flipping, to increase the diversity feature of the input image.
Finally, in this section, we discuss the results of our scope comparison for super resolution with ESRGAN [9] and Real-ESRGAN [10] in the preprocessing stage. Furthermore, to offer a complete view of performance results for the proposed method, we used the state-of-the-art object detection models that are feature pyramid network-based. From the YOLO family, these include: YOLOv4 [34], YOLOv5 [41], YOLOX [38]; from the free anchor method: FoveaBox [39]; from the transformer network: PVTv2 [40]; and from the R-CNN family: Faster R-CNN [19], and Cascade R-CNN [33]; for comparison with our proposed inspection system, EnsGAN-SDD.

Results and Discussion of Proposed Preprocessing
In this subsection, we demonstrate the performance of our proposed method in the preprocessing stage. This experiment tried to sample the best split number ranging from 2 up to 32 based on the Peak Signal-to-noise Ratio (PSNR) to measure the quality between the original and the reconstructed image. We used three generated defect images produced by Real-ESRGAN, ESRGAN, and proposed EnsGAN to perform the quantitative comparison.
Moreover, we cropped the specific regions in these three defect samples images and have shown them in Figure 8 as Image 1 on the top, Image 2 in the middle, and Image 3 on the bottom for the qualitative comparison. As shown in Table 1, the objective results of the proposed EnsGAN surpass the quality of the original Real-ESRGAN and ESRGAN with the maximum PSNR numbers on the three images of 42.005, 39.278, and 43.135 when the split ratio was equal to 16. Moreover, Table 2 shows the benefit of our proposed EnsGAN in terms of processing speed efficiency. As the split ratio increased, the processing time simultaneously decreased. As shown in this table, for the three images, Real-ESRGAN and ESRGAN took 11.927, 12.473, 11.638, and 11.687, 11.723, 11.656 s, respectively. Our proposed method with a split ratio of 16 showed highly superior performance of 7.937, 8.250, and 8.123 s. Our proposed method with a split number of 16 improved greatly to 7.937, 8.250, and 8.123 s. Figure 6. Visualization of Side Aware Boundary Localization (SABL). To replace the proposal: the bucketing estimation is used to predict the buckets candidate and employs the fine regression to acquire the final bounding box (bbox) prediction.

Experiments
For the experiments, we used the common steel surface defect dataset provided by Kaggle, which consists of 12,568 images. This dataset has been used for the real industrial application scenarios [52][53][54]. According to the original source (https://www.kaggle.com/c/severstal-steel-defect-detection, accessed on 24 April 2022), this dataset has four class categories: Class 1, Class 2, Class 3, and Class 4, respectively. Specifically, the study in ref. [54] explained that Class 1 has pitted surface defect conditions, Class 2 has crazing defect conditions, Class 3 has scratch defect conditions, and Class 4 has patch defect conditions. A clear presentation of the defect conditions can be found in Figure 7. Moreover, we trained the dataset using an Nvidia RTX 3090 Ti 24GB with a splitting ratio of 80% of images for the training, 10% for validation, and 10% for the testing. Originally, this dataset used pixel-wise annotation labels. However, in this experiment, we converted the label into a bounding box using the Pascal VOC annotation format [55] to reduce the annotation cost and make the system more suitable for real-world scenarios. Specifically, in the generator of the proposed preprocessing EnsGAN, for RRDB Network, we used 2D Convolution with a 3 × 3 kernel size and 1 × 1 strides with the same padding, and 2D Transpose Convolution with a 3 × 3 kernel size and 2 × 2 strides with the same padding; for the activation function, we used Leaky ReLU with α = 0.2. Furthermore, in VGG28 Architecture, we used Convolution 2D with a 3 × 3 kernel size with the same padding, Leaky ReLU activation function with α = 0.2, and fully connected layers with 100 neurons. In the localization and recognition network experiments, we used a single image as the batch size with stochastic gradient descent as the training optimizer. Moreover, the initial learning rate was set to 0.001, and the decay rate step was 0.0001 in epochs 16 and 19 of the 40 epochs. In addition, we only used one general data augmentation method, random flipping, to increase the diversity feature of the input image.
Finally, in this section, we discuss the results of our scope comparison for super resolution with ESRGAN [9] and Real-ESRGAN [10] in the preprocessing stage. Furthermore, to offer a complete view of performance results for the proposed method, we used the state-of-the-art object detection models that are feature pyramid network-based. From the YOLO family, these include: YOLOv4 [34], YOLOv5 [41], YOLOX [38]; from the free anchor method: FoveaBox [39]; from the transformer network: PVTv2 [40]; and from the R-CNN family: Faster R-CNN [19], and Cascade R-CNN [33]; for comparison with our proposed inspection system, EnsGAN-SDD. For the qualitative comparison or subjective comparison performance, Figure 8 shows the original sample defect image and the generated defect sample images of results produced by Real-ESRGAN, ESRGAN, and the proposed EnsGAN. The Figure shows that the adversarial image of the proposed method enhanced the resolution quality, especially in the defect region marked by the red and yellow boxes. Moreover, the green box shows that the proposed method has better noise reduction on the original images and attained a smoother texture than ESRGAN did. On the other hand, the output of Real-ESRGAN enhanced the resolution, but unfortunately increased the noise because of aliasing issues. The problem arises because the traditional degradation model, which consists of a blur, downsampling, noise, and JPEG compression, leads to insufficient ability to model real-world degradations.
Based on the discussion above, we conclude that the proposed super-resolution method EnsGAN performs well in all aspects compared with the state-of-the-art methods. As we know, good sample data are very important for the neural network model. There-fore, the quality of the produced image and efficiency of the speed performance from our proposed method can benefit the performance of the steel surface inspection system.    Figure 8 in the following order: the top is Image 1, the middle is Image 2, and the bottom is Image 3.

Results and Discussion of Proposed Inspection System
In the experiments on the localization and recognition network highlighted in this section, we used recall (REC), average precision (AP), and mean average precision (mAP) to align with the quantitative evaluation in Pascal VOC object detection [55]. First, Table 3 shows the ablation improvement of the proposed method. We used two state-of-the-art models, known as residual network 50 (ResNet-50), as the light backbone, and residual aggregation network 101 (X-101) as the deeper backbone for the DetectoRS as the defect detector's baseline. With the first ResNet-50 backbone, the mAP accuracy of the system was 77.1%. When we added EnsGAN to the system, the accuracy increased to 78.8%. After Side-Aware Boundary Localization (SABL) was substituted for the general prediction head, accuracy improved to 79%. Furthermore, we demonstrated the proposed system in a deeper network backbone. The initial mAP accuracy of the DetectorRS on top of X-101 was 78.5%. After we integrated the proposed EnsGAN and SABL, the mAP improved to 79.2% and 80.4%, respectively.  Figure 8 in the following order: the top is Image 1, the middle is Image 2, and the bottom is Image 3.

Input Image
Real  Figure 8 in the following order: the top is image 1, the middle is image 2, and the bottom is image 3.   Table 4 further shows the proposed system's objective performance over the state-ofthe-art object detection models. The proposed EnsGAN-SDD on top of ResNet-50 improved the accuracy of the YOLO family: YOLOv4, YOLOv5, and YOLOX, anchor free method: Foveabox; transformer network: PVTv2; and R-CNN family: Faster R-CNN and Cascade R-CNN with a margin of 18.2%, 18.9%, 13.8%, 12.4%, 7.7%, 13.7%, and 11.5%, respectively. Figure 9, in turn, shows how accurate the proposed steel surface inspection system was in localizing four types of each defect class in the Severstal steel defect dataset. Moreover, Figure 10 shows the proposed system's capabilities in multi-class defect localization. These results show that the proposed method obtained high performance in detecting challenging steel product defects. In addition, the proposed system achieved 11.6 frames per second (FPS) in a single GPU Nvidia RTX 3090 Ti on top of the ResNet-50 backbone. This shows that EnsGAN-SDD can be applied in the steel industry, which requires a processing speed above 10 FPS, as explained in refs. [44,45].

Conclusions
The proposed EnsGAN-SDD can offer a solution to improve steel surface inspection systems. EnsGAN achieved a good PSNR on the steel defect dataset and had a low processing time relative to ESRGAN and Real-ESRGAN. Moreover, when it was integrated with robust detectors or DetectoRS and Side-Aware Boundary Localization (SABL) as steel defect detectors (SDD), the localization accuracy significantly improved over those of state-of-the-art object detection models while maintaining the necessary processing speed to meet standards for steel surface inspection systems. Furthermore, in the steel defect detection domain, we found that the main challenge lies not only in the lower resolution, but also in the difficulty of examples, e.g., the tiny defect size in Class 1 or pitted surface defect conditions, which have lower accuracy than other classes. Future work must extend the proposed preprocessing stage to address this issue.

Conclusions
The proposed EnsGAN-SDD can offer a solution to improve steel surface inspection systems. EnsGAN achieved a good PSNR on the steel defect dataset and had a low processing time relative to ESRGAN and Real-ESRGAN. Moreover, when it was integrated with robust detectors or DetectoRS and Side-Aware Boundary Localization (SABL) as steel defect detectors (SDD), the localization accuracy significantly improved over those of state-of-the-art object detection models while maintaining the necessary processing speed to meet standards for steel surface inspection systems. Furthermore, in the steel defect detection domain, we found that the main challenge lies not only in the lower resolution, but also in the difficulty of examples, e.g., the tiny defect size in Class 1 or pitted surface defect conditions, which have lower accuracy than other classes. Future work must extend the proposed preprocessing stage to address this issue.