High Quality Coal Foreign Object Image Generation Method Based on StyleGAN-DSAD

Research on coal foreign object detection based on deep learning is of great significance to safe, efficient, and green production of coal mines. However, the foreign object image dataset is scarce due to collection conditions, which brings an enormous challenge to coal foreign object detection. To achieve augmentation of foreign object datasets, a high-quality coal foreign object image generation method based on improved StyleGAN is proposed. Firstly, the dual self-attention module is introduced into the generator to strengthen the long-distance dependence of features between spatial and channel, refine the details of the generated images, accurately distinguish the front background information, and improve the quality of the generated images. Secondly, the depthwise separable convolution is introduced into the discriminator to solve the problem of low efficiency caused by the large number of parameters of multi-stage convolutional networks, to realize the lightweight model, and to accelerate the training speed. Experimental results show that the improved model has significant advantages over several classical GANS and original StyleGAN in terms of quality and diversity of the generated images, with an average improvement of 2.52 in IS and a decrease of 5.80 in FID for each category. As for the model complexity, the parameters and training time of the improved model are reduced to 44.6% and 58.8% of the original model without affecting the generated images quality. Finally, the results of applying different data augmentation methods to the foreign object detection task show that our image generation method is more effective than the traditional methods, and that, under the optimal conditions, it improves APbox by 5.8% and APmask by 4.5%.


Introduction
Coal is the most abundant and widely distributed conventional energy source in the world, and is also an important strategic resource [1]. In the process of coal production, coal is frequently mixed with gangue, woods, anchor rods, woven bags, iron, and other foreign objects, which seriously affects the safe, efficient, and green production of coal mines [2,3], and it is urgent to choose an automated and intelligent foreign object detection and separation method [4]. With the rapid development of coal mine intelligence, foreign object detection methods based on deep learning have received wide attention from scholars [5]. However, deep learning is a data-driven method [6], and the training process usually requires the support of large datasets to prevent model overfitting. For example, Zhang [7] used 18,714 images to train a semantic segmentation model containing four types of coal foreign objects and achieved 91.24% segmentation accuracy. Hao [8] used 2300 images to train a large foreign object detection model, and the authors noted in their conclusion that the total number of foreign object samples had limited the detection accuracy. According to the investigation, there are significant differences in the features of foreign bodies between different coal mines, and there is no public high-quality foreign object dataset to support model training, due to constraints such as sparse foreign object sample content in regular production and harsh foreign object collection conditions. Therefore, achieving efficient data augmentation on limited datasets becomes the key to improve foreign object detection performance.
Image data augmentation methods are mainly divided into traditional augmentation and GAN (Generative Adversarial Network) based augmentation. Traditional augmentation methods are based on image processing, which mainly include geometric and color transformations, adding noise and filtering, random elimination, image blending, etc. [9,10]. At present, most of the research on coal foreign object image data augmentation focuses on traditional methods [11,12]. Traditional augmentation methods have great efficiency in image preprocessing, but they do not fundamentally solve the problem of insufficient foreign object data diversity, and are of limited help in foreign object detection. With the development of the data augmentation, GAN provides a new direction for image data augmentation, which can effectively solve the problems of small samples and data imbalance by generating a large number of samples with the same distribution as the real dataset [13]. The early GAN has disadvantages, such as gradient disappearance, gradient explosion, convergence difficulties, and low resolution of generated images. With scholars' in-depth research on GAN, several works, such as WGAN [14], WGAN-GP [15], DCGAN [16], ProGAN [17], and BigGAN [18], have been introduced to gradually improve the training stability of GAN and the resolution of generated images, which laid the foundation for its application in machine vision tasks. Shi [19] added Wasserstein divergence to WGAN to improve the diversity of defect samples on the surface of micro-electromechanical systems (MEMS), and the experimental results showed that the mAP and F1 scores of defect detection were improved by 8.16% and 6.73%, respectively, after data augmentation. Deng [20] applied WGAN-GP to data augmentation for facial expression recognition, and improved the recognition accuracy of multi-angle facial expressions. In the field of coal mining, the application of GAN is still in the initial stage. In order to improve the accuracy of coal and rock recognition, Wang [21] introduced a pyramid structure in consinGAN to generate coal and rock images with a resolution of up to 250 × 250 by increasing from coarse to fine. Wang [22] incorporated a new convolution module in DCGAN to expand the resolution of generated images from 64 × 64 to 512 × 512, thus facilitating the training of coal-gangue detection models. The above research demonstrates the feasibility of applying GAN to visual recognition and detection tasks. However, coal foreign objects with diverse categories and irregular shapes impose higher requirements on the quality of the dataset, and the images generated by the above methods fail to achieve a balance between resolution and diversity to meet the needs of this paper.
StyleGAN [23] is one of the best performing generative frameworks, which proposes a combination of style module (AdaIN) and progressive generation strategy for generative networks that can generate 1024 × 1024 high-resolution images while increasing the diversity of datasets, and has been widely used in many fields [24,25]. Nevertheless, limited by the receptive field of the convolutional structure, it is difficult for the generative model to capture the long-term and global features of the object, which makes CNN-based models such as StyleGAN detect detail defects, such as teardrop artifacts and shape distortion, in the generated images. Aiming at such problems, Li [26] proposes a generative model based on the WD (wide and deep feature extraction block) module, which improves the quality of the generated images by combining the depth information extracted by ResNet with the global information extracted by Inception V1. The WD module contains a large number of convolution operations and has a large impact on the complexity of the model. SAGAN [27] introduced spatial self-attention into GAN to construct connections between different regions, and verified the ability of the self-attention mechanism to grasp the local details of the generated images; however, it ignored the relationship between channels of feature maps, which led to anomalous structures in the generated images. Yang [28] adds channel attention to SAGAN to further refine the texture, structure, and other features of generated images, but the suppressive effect of channel attention also leads to the absence of partial details.
In summary, devoting attention to the need for fast and massive acquisition of highresolution, high-diversity, and sufficiently detailed coal foreign object images, to provide data support for coal foreign object detection models, a novel high-quality coal foreign object image generation method based on StyleGAN-DSAD is proposed. The main contributions of this paper are summarized as follows:

•
We introduced a dual self-attention module (DSAM) into the generator of StyleGAN to strengthen the long-distance dependence of features between spatial and channel, which could refine the details of the generated images and solve the problems of artifacts, distortions, and front background adhesion in the generated images. • Through research and experiments, we found that the discriminator part has little effect on the quality of the generated images; thus, we replaced the standard convolution in the discriminator with a depthwise separable convolution (DSC) to reduce the time and space complexity of StyleGAN and improve the training efficiency.

•
Compared with the baseline method, images generated by the proposed method can generate better quality and more diverse foreign object images. Meanwhile, the accuracy of coal foreign object detection was effectively improved after data augmentation using the proposed method, indicating that the application of StyleGAN-DSAD to coal foreign object image augmentation is feasible.

Generative Adversarial Network
The generative adversarial network is based on the idea of a zero-sum game, as shown in Figure 1, which mainly consists of a generator G and a discriminator D [13]. The input of G is a set of random noise z and the output is generated image, and the input of D is the image and the output is the probability that the input image is a real sample.
ations and has a large impact on the complexity of the model. SAGAN [Error! Reference source not found.] introduced spatial self-attention into GAN to construct connections between different regions, and verified the ability of the self-attention mechanism to grasp the local details of the generated images; however, it ignored the relationship between channels of feature maps, which led to anomalous structures in the generated images. Yang [Error! Reference source not found.] adds channel attention to SAGAN to further refine the texture, structure, and other features of generated images, but the suppressive effect of channel attention also leads to the absence of partial details.
In summary, devoting attention to the need for fast and massive acquisition of highresolution, high-diversity, and sufficiently detailed coal foreign object images, to provide data support for coal foreign object detection models, a novel high-quality coal foreign object image generation method based on StyleGAN-DSAD is proposed. The main contributions of this paper are summarized as follows:


We introduced a dual self-attention module (DSAM) into the generator of StyleGAN to strengthen the long-distance dependence of features between spatial and channel, which could refine the details of the generated images and solve the problems of artifacts, distortions, and front background adhesion in the generated images.  Through research and experiments, we found that the discriminator part has little effect on the quality of the generated images; thus, we replaced the standard convolution in the discriminator with a depthwise separable convolution (DSC) to reduce the time and space complexity of StyleGAN and improve the training efficiency.  Compared with the baseline method, images generated by the proposed method can generate better quality and more diverse foreign object images. Meanwhile, the accuracy of coal foreign object detection was effectively improved after data augmentation using the proposed method, indicating that the application of StyleGAN-DSAD to coal foreign object image augmentation is feasible.

Generative Adversarial Network
The generative adversarial network is based on the idea of a zero-sum game, as shown in Figure 1, which mainly consists of a generator G and a discriminator D [Error! Reference source not found.]. The input of G is a set of random noise z and the output is generated image, and the input of D is the image and the output is the probability that the input image is a real sample. GAN performs model optimization by a maximum-minimum optimization objective function, as shown in Equation (1): where x and z denote the real data and input noise, respectively, and P data (x) and P z (x) denote the probability distributions obeyed by the real data and input noise, respectively. During the entire training process, the generator and the discriminator trained alternately. The parameters of generator were frozen during the training of the discriminator, and vice versa during the training of the generator. As the training process is executed, the parameters of the generator and discriminator will reach an equilibrium (Nash equilibrium), at which time the generator could generate a large number of fake images that resemble real samples.

StyleGAN
StyleGAN is proposed by Karras to generate high resolution and high diversity images. As shown in Figure 2, it is mainly composed of a generator (consisting of a mapping network and a synthesis network), a discriminator, and a loss function.

tively.
During the entire training process, the generator and the discriminator trained alternately. The parameters of generator were frozen during the training of the discriminator, and vice versa during the training of the generator. As the training process is executed, the parameters of the generator and discriminator will reach an equilibrium (Nash equilibrium), at which time the generator could generate a large number of fake images that resemble real samples.

StyleGAN
StyleGAN is proposed by Karras to generate high resolution and high diversity images. As shown in Figure 2, it is mainly composed of a generator (consisting of a mapping network and a synthesis network), a discriminator, and a loss function.

Mapping network
The role of the mapping network is to cooperate with the synthesis network to control the visual features of the generated images. The input of the mapping network is a set of Gaussian distributed random vectors Z, and the input vectors are encoded into intermediate vectors W' of the same size with feature deconvolution through eight fully connected layers. The different elements in W' are used to control the image generation style, such as color, texture, and shape.  Synthesis network The synthesis network consists of several sub-networks, for which the initial input is a constant feature of size 4 × 4 × 512, and after forward propagation, the resolution of the

•
Mapping network The role of the mapping network is to cooperate with the synthesis network to control the visual features of the generated images. The input of the mapping network is a set of Gaussian distributed random vectors Z, and the input vectors are encoded into intermediate vectors W' of the same size with feature deconvolution through eight fully connected layers. The different elements in W' are used to control the image generation style, such as color, texture, and shape.

Synthesis network
The synthesis network consists of several sub-networks, for which the initial input is a constant feature of size 4 × 4 × 512, and after forward propagation, the resolution of the generated image increases smoothly from 4 × 4 to 8 × 8 to the highest resolution set, which solves the problems of large training costa and crashing of the training process in the direct generation method. The synthesis sub-network consists of an upsampling layer, a convolutional layer, an Adaptive Instance Normalization (AdaIN) module, a control vector A, and a noise B, which act jointly to enhance the diversity of the generated images. The noise is first added to the feature x along each channel, and then the AdaIN module influences the generation style, as shown in Figure 3 and Equation (2).
where x i denotes the feature map of layer i, µ(x i ) and σ(x i ) denote the mean and variance, respectively, and y s,i and y b,i , denote the deflation factor and deviation factor, respectively. generated image increases smoothly from 4 × 4 to 8 × 8 to the highest resolution set, which solves the problems of large training costa and crashing of the training process in the direct generation method. The synthesis sub-network consists of an upsampling layer, a convolutional layer, an Adaptive Instance Normalization (AdaIN) module, a control vector A, and a noise B, which act jointly to enhance the diversity of the generated images. The noise is first added to the feature x along each channel, and then the AdaIN module influences the generation style, as shown in Figure 3 and Equation (2).
where denotes the feature map of layer i, ( ) and ( ) denote the mean and variance, respectively, and , and , , denote the deflation factor and deviation factor, respectively.
Firstly, the mean and variance of the convolutional layer output are normalized by channel, while W' is expanded into a deflation factor and deviation factor by a learnable affine transformation A. The two factors are then weighted and summed with the convolution layer to complete W's influence on the original output. AdaIN is added after each convolution operation; it affects each sub-network twice with different styles affected by images at different resolutions.

Discriminator
The discriminator is like an inverse of the generator, which reduces the input image to a feature map of the same size as the input through filtering, downsampling, and convolution. The feature map is scored by a fully connected layer to judge the input image quality.  Loss function The loss function consists of two parts: generator loss and discriminator loss. The generator uses a standard loss function. The discriminator loss uses WGAN-GP [Error! Reference source not found.], which improves the stability of the model training process by combining the Wasserstein loss with the gradient penalty. The loss functions are as follows: where denotes the distribution of generated data and denotes the distribution of real data, Firstly, the mean and variance of the convolutional layer output are normalized by channel, while W' is expanded into a deflation factor and deviation factor by a learnable affine transformation A. The two factors are then weighted and summed with the convolution layer to complete W's influence on the original output. AdaIN is added after each convolution operation; it affects each sub-network twice with different styles affected by images at different resolutions.

• Discriminator
The discriminator is like an inverse of the generator, which reduces the input image to a feature map of the same size as the input through filtering, downsampling, and convolution. The feature map is scored by a fully connected layer to judge the input image quality.

Loss function
The loss function consists of two parts: generator loss and discriminator loss. The generator uses a standard loss function. The discriminator loss uses WGAN-GP [15], which improves the stability of the model training process by combining the Wasserstein loss with the gradient penalty. The loss functions are as follows: where p g denotes the distribution of generated data and p r denotes the distribution of real data, E G(z)∼p g [D(G(z))] denotes the mathematical expectation when the generated data are used as inputs to the discriminator, E x∼p r [D(x)] denotes the mathematical expectation when the real data are used as inputs to the discriminator, λ (=10) denotes the penalty coefficient of gradient penalty item, andx denotes the gradient penalty object sampled uniformly from the generated and real data,

Proposed Methods
Coal foreign object detection usually uses features such as shape, texture, and color of the foreign object as the main bases for detection, which places higher demand on the quality, detail, and diversity of the generated images. StyleGAN has a clear advantage in generating high-resolution images, but there are still the following problems when directly applying it to coal foreign object image generation:

•
The limited receptive field of the convolutional structure makes it difficult to learn global, long-term dependencies between features, resulting in missing details in key parts of the generated foreign object images [27], producing the phenomena of artifacts, shape distortion, and front background adhesion.

•
The multi-level convolutional structure leads to a large number of model parameters, which increases the time and space complexity of the model training process.
Thus, in order to improve the quality of generated images and assist in foreign object detection model training, this paper makes corresponding improvements to StyleGAN for the above problems, and the structure of the improved StyleGAN is shown in Figure 4. Firstly, to match the original dataset (resolution is 640 × 480), the maximum resolution of the generated images is adjusted to 512 × 512. Secondly, to solve the impact of receptive field limitation on the quality of the generated images, DSAM is introduced in the last three sub-networks of the synthesis network, which enables the synthesis network to capture the details of image shape and texture by learning the spatial interdependence of features and depict more detailed and realistic images, and to distinguish the foreground and background information more accurately by learning the channel interdependence of features and improve the phenomenon of front and background adhesion. Finally, to address the inefficiency caused by the large number of network parameters, we replaced the standard convolution in the discriminator with DSC, to reduce the number of network parameters and improve the training efficiency.

DSAM
It has beenshown that the introduction of the self-attention mechanism in the generative network helps to model the long-distance and multi-level dependencies in the generated images, and improves the generation quality. To solve the low quality problems such as artifacts, distortions, and adhesions in the foreign object images generated by StyleGAN, the DSAM [Error! Reference source not found.] is introduced in the generator of StyleGAN, and its principle is shown in Figure 5.

DSAM
It has beenshown that the introduction of the self-attention mechanism in the generative network helps to model the long-distance and multi-level dependencies in the generated images, and improves the generation quality. To solve the low quality problems such as artifacts, distortions, and adhesions in the foreign object images generated by StyleGAN, the DSAM [29] is introduced in the generator of StyleGAN, and its principle is shown in Figure 5.
It has beenshown that the introduction of the self-attention mechanism in the generative network helps to model the long-distance and multi-level dependencies in the generated images, and improves the generation quality. To solve the low quality problems such as artifacts, distortions, and adhesions in the foreign object images generated by StyleGAN, the DSAM [Error! Reference source not found.] is introduced in the generator of StyleGAN, and its principle is shown in Figure 5. The original feature map is fed into two modules of DSAM: Spatial Self-Attention Module and Channel Self-Attention Module. The refined feature maps of spatial and channel are obtained by extracting the interdependencies of features in spatial and channel and adding them with the original features in the two modules. Finally, the two output features are fused to obtain features with rich context information, thus improving the quality of the generated images. The entire process is as follows: The original feature map is fed into two modules of DSAM: Spatial Self-Attention Module and Channel Self-Attention Module. The refined feature maps of spatial and channel are obtained by extracting the interdependencies of features in spatial and channel and adding them with the original features in the two modules. Finally, the two output features are fused to obtain features with rich context information, thus improving the quality of the generated images. The entire process is as follows: where A denotes the input feature map, and A S ( * ) and A C ( * ) denote the spatial and channel attention operations on the input features, respectively. The final refined feature map A is obtained by element-by-element summation of the two parts of the result. The specific details of the Spatial and Channel Self-Attention Modules are shown in Figure 6. The specific details of the Spatial and Channel Self-Attention Modules are shown in Figure 6. The Spatial Self-Attention Module is shown in Figure 6a. Firstly, the features map ∈ C×H×W is fed into three 1 × 1 convolution layers to obtain new feature maps Q, K, V(Q, K, V∈ C×H×W ); subsequently, the shapes of Q, K, V are adjusted to C × N (N = H × W), the transpose of Q is matrix multiplied with K to obtain the spatial similarity measures of Q and K, and the multiplied results are processed by softmax to obtain the spatial attention map S∈ N×N . Finally, the feature V is multiplied with the attention matrix S, and the shape is adjusted back to C × H × W. The feature is then multiplied by a scale factor α and perform a element-wise sum operation with A to obtain the spatial refined feature map E. S and E are calculated as in Equation (6) and Equation (7), respectively. The Spatial Self-Attention Module is shown in Figure 6a. Firstly, the features map A ∈ R C×H×W is fed into three 1 × 1 convolution layers to obtain new feature maps Q, K, V(Q, K, V∈ R C×H×W ); subsequently, the shapes of Q, K, V are adjusted to C × N (N = H × W), the transpose of Q is matrix multiplied with K to obtain the spatial similarity measures of Q and K, and the multiplied results are processed by softmax to obtain the spatial attention map S∈ R N×N . Finally, the feature V is multiplied with the attention matrix S, and the shape is adjusted back to C × H × W. The feature is then multiplied by a scale factor α and perform a element-wise sum operation with A to obtain the spatial refined feature map E. S and E are calculated as in Equation (6) and Equation (7), respectively.
where S ji denotes the influence of the i-th position on the j-th position.
where α is initialized as 0 and gradually learns to allocate more weight, A j indicating the original feature map. The Channel Self-Attention Module is shown in Figure 6b. Unlike the spatial attention module, the channel attention module computes the channel attention map directly from the original feature map A. The reason is that the original feature map better maintains the relationship between different channel feature maps. Firstly, the feature map A is adjusted to three features of shape R C×N (N = H × W), denoted as B, C, and D. Subsequently, the result of multiplying the transpose matrix of D with C is fed into the softmax layer to obtain the channel attention map X∈ R C×C . Finally, the feature B is multiplied with the attention matrix X and the shape is adjusted back to C × H × W. The feature is then multiplied by a scale factor β and perform a element-wise sum operation with A to obtain the channel refined feature map E. X and E are calculated as in Equation (8) and Equation (9), respectively.
where X ji denotes the influence of the i-th channel on the j-th channel.
where β is initialized to 0 and gradually learns to allocate more weight, A j indicating the original feature map. From Equation (5) to Equation (9), it is clear that DSAM is a "plug-and-play" module that does not affect the size and dimension of the input features, and can be seamlessly connected to all parts of the model. Meanwhile, it is tested that inserting the DSAM module to the deep sub-networks of the generator can achieve better generative results. Thus, DSAM is chosen to be inserted to the last three sub-networks of the generator. As shown in Figure 4, each sub-network of generator contains a convolution operation and a noise addition operation before AdaIN module, and our DSAM is inserted between the two operations. Firstly, the input size and dimension of the DSAM is adjusted according to the input features (the feature map size of the final three subnetworks is divided into R 128×64×64 , R 64×128×128 , and R 32×512×512 ), and then the refined features of the output of the DSAM are subsequently added with noise B and then fed into the AdaIN module, so that the model generates better quality and more diverse foreign object images.

DSC
Depthwise separable convolution was first proposed in mobilenetV1 [30], which reduces the number of computational amounts and the number of parameters of CNN, at the cost of losing a small amount of accuracy, by decomposing the standard convolution into Depthwise (DW) Convolution and Pointwise (PW) Convolution. For GANs, the discriminator's results for true and false samples only reflect the loss value of the current Sensors 2023, 23, 374 9 of 21 GAN model back propagation, which has little impact on the quality of the generated images [26]. Therefore, in this paper, we introduced the depthwise separable convolution in the StyleGAN's discriminator to speed up the overall training efficiency of the network while reducing the complexity of network. The depthwise separable convolution structure is shown in Figure 7.

DSC
Depthwise separable convolution was first proposed in mobilenetV1 [Error! Reference source not found.], which reduces the number of computational amounts and the number of parameters of CNN, at the cost of losing a small amount of accuracy, by decomposing the standard convolution into Depthwise (DW) Convolution and Pointwise (PW) Convolution. For GANs, the discriminator's results for true and false samples only reflect the loss value of the current GAN model back propagation, which has little impact on the quality of the generated images [Error! Reference source not found.]. Therefore, in this paper, we introduced the depthwise separable convolution in the StyleGAN's discriminator to speed up the overall training efficiency of the network while reducing the complexity of network. The depthwise separable convolution structure is shown in Figure  7.  As can be seen from Figure 7, DSC is mainly divided into two parts: DW convolution and PW convolution. The DW convolution differs from the standard convolution in that each convolution kernel is responsible for only one channel, and its complexity reduction ratio is a multiple of the number of channels; the PW convolution is a standard convolution with a kernel size of 1 × 1, and its complexity reduction ratio is the product of the length and width of the convolution kernel. The linear combination of the two parts of convolution can meet the similar feature extraction effect of standard convolution and reduce the model computations and parameters to a certain extent. For example, suppose the input feature size is C F × C F × M, the output feature size is C F × C F × N, the convolution kernel size is C K × C K × N, and the computational amount and number of parameters of the standard convolution are: where C SC and P SC denote the computational amount and number of parameters of the standard convolution, respectively. The computational amount and number of parameters of DSC are: where C DSC and P DSC denote the computational amount and number of parameters of the standard convolution, respectively, and the first half denotes DW convolution, the second half denotes PW convolution.
The complexity ratio of DSC to the standard convolution can be derived by combining Equation (10) to Equation (13): Sensors 2023, 23, 374 10 of 21 As can be seen from Equations (14) and (15), the DSC has a larger compression in the computational amount and number of parameters compared to the standard convolution, and the efficiency of compression gets better as the complexity of the network increases. In this paper, we added the DSC in the discriminator part as shown in Figure 4. Except for the last sub-network, each DSC is placed before the downsampling operation for replacing the standard convolution, and in the last sub-network, the DSC is added before the fully connected layer (FC), and by the above way, the time and space complexity of the model could be reduced.

Dataset
The training samples of the model were divided into two parts. The first part was collected in the field at the mine area, mainly including coal, gangue, and a small amount of other samples, and the second part samples were collected by scrap recycling, mainly including bags, wood, and iron. Our dataset consists of a total of 7825 images (including 1009 coal, 1050 gangue, 1495 wood, 1590 bags, 856 irons, and 1825 multi-target images), and the resolution is 640 × 480. Part of the foreign object image data as shown in Figure 8.

Experimental Settings
Experiments were executed in Ubuntu 18.04 with the following configurations: the graphics card is an Nvidia Tesla A100 with 40 GB of video memory, the processor is an Intel Xeon 4212R with 128 GB of RAM, and the training environment is Python 3.9 + Pytorch 1.9.0 + Cuda 11.6. Adam optimizer was used to train the model with 80 K iterations and the batchsize is set to 24. The learning rate of the synthesis network and discriminator is set to 0.0005, and the learning rate of the mapping network is set to 0.005. Figure  9 shows the loss curve of our model; the loss curve fluctuates more in the early stage of training because the GAN needs to train two networks at the same time. As the training proceeds, the losses of the generator and discriminator gradually approach each other and level off at a certain number of iteration steps.

Experimental Settings
Experiments were executed in Ubuntu 18.04 with the following configurations: the graphics card is an Nvidia Tesla A100 with 40 GB of video memory, the processor is an Intel Xeon 4212R with 128 GB of RAM, and the training environment is Python 3.9 + Pytorch 1.9.0 + Cuda 11.6. Adam optimizer was used to train the model with 80 K iterations and the batchsize is set to 24. The learning rate of the synthesis network and discriminator is set to 0.0005, and the learning rate of the mapping network is set to 0.005. Figure 9 shows the loss curve of our model; the loss curve fluctuates more in the early stage of training because the GAN needs to train two networks at the same time. As the training proceeds, the losses of the generator and discriminator gradually approach each other and level off at a certain number of iteration steps.
Intel Xeon 4212R with 128 GB of RAM, and the training environment is Python 3.9 + Pytorch 1.9.0 + Cuda 11.6. Adam optimizer was used to train the model with 80 K iterations and the batchsize is set to 24. The learning rate of the synthesis network and discriminator is set to 0.0005, and the learning rate of the mapping network is set to 0.005. Figure  9 shows the loss curve of our model; the loss curve fluctuates more in the early stage of training because the GAN needs to train two networks at the same time. As the training proceeds, the losses of the generator and discriminator gradually approach each other Experiments consisted of two major parts. The first part was an intuitive evaluation of the performance of StyleGAN-DSAD. Firstly, the quality and diversity of the generated images were evaluated, and the evaluation metrics were selected as the most credible Inception Score [Error! Reference source not found.] (IS) and Frechet Inception Distance [Error! Reference source not found.] (FID) in the field of GAN, where the IS is proportional to the quality and diversity of the generated images and the FID is inversely proportional to the quality and diversity of the generated images. Secondly, the model complexity is evaluated in terms of the model parameters (Params) and the total model training time (time/min). The second part tests the practical effect of data augmentation on the foreign object detection task, and the test model is chosen as the instance segmentation model BlendMask [Error! Reference source not found.], and the average segmentation accuracy APmask and the average detection accuracy APbox are used to evaluate the instance segmentation results.

Performance Evaluation of StyleGAN-DSAD
To verify the performance of the improved model in the image data augmentation task, GAN models are trained for five types of foreign objects and multi-target foreign objects, respectively. Subsequently, the quality, diversity, and model complexity of the images generated by each model were evaluated comprehensively. The comparison mod- Experiments consisted of two major parts. The first part was an intuitive evaluation of the performance of StyleGAN-DSAD. Firstly, the quality and diversity of the generated images were evaluated, and the evaluation metrics were selected as the most credible Inception Score [31] (IS) and Frechet Inception Distance [32] (FID) in the field of GAN, where the IS is proportional to the quality and diversity of the generated images and the FID is inversely proportional to the quality and diversity of the generated images. Secondly, the model complexity is evaluated in terms of the model parameters (Params) and the total model training time (time/min). The second part tests the practical effect of data augmentation on the foreign object detection task, and the test model is chosen as the instance segmentation model BlendMask [33], and the average segmentation accuracy AP mask and the average detection accuracy AP box are used to evaluate the instance segmentation results.

Performance Evaluation of StyleGAN-DSAD
To verify the performance of the improved model in the image data augmentation task, GAN models are trained for five types of foreign objects and multi-target foreign objects, respectively. Subsequently, the quality, diversity, and model complexity of the images generated by each model were evaluated comprehensively. The comparison models were selected as DCGAN, ProGAN, BigGAN, StyleGAN, StyleGAN

•
Comparison of model generation quality and diversity The quality and diversity comparison results of different models are shown in Figures 10 and 11. It can be seen from the figures that the FID and IS of DCGAN and ProGAN are far behind other networks and are not suitable for foreign object image generation. BigGAN is slightly better than StyleGAN in IS metrics, but its FID metrics relative to StyleGAN in coal, gangue, woven bag, and three types of foreign objects on the gap is large, indicating that BigGAN has certain advantages in generating high quality images, but, in contrast, StyleGAN can achieve a better balance between image quality and diversity. BigGAN is slightly better than StyleGAN in IS metrics, but its FID metrics relative to Style-GAN in coal, gangue, woven bag, and three types of foreign objects on the gap is large, indicating that BigGAN has certain advantages in generating high quality images, but, in contrast, StyleGAN can achieve a better balance between image quality and diversity.   BigGAN is slightly better than StyleGAN in IS metrics, but its FID metrics relative to Style-GAN in coal, gangue, woven bag, and three types of foreign objects on the gap is large, indicating that BigGAN has certain advantages in generating high quality images, but, in contrast, StyleGAN can achieve a better balance between image quality and diversity.

•
Comparison of model complexity The comparison results of the complexity of the StyleGAN model before and after optimization are shown in Figure 12. G-Params, D-Params, and A-Param indicate the parameters of the generator, discriminator, and whole model (sum of G-Params and D-Params), respectively, and Time indicates training time. As can be seen from the figure, the number of parameters accounted for by the discriminator part of StyleGAN is larger, about 65.8% of the whole network, so it is necessary to reduce the parameters of the discriminator. After the introduction of DSAM in the generator, the parameters of the generator rose a certain amount, accounting for about 11.8% of the original generator, and the total training time of the model increased by 157 min, accounting for 4% of the original training time; after replacing the standard convolution with a depthwise separable convolution, the number of parameters of the discriminator was compressed substantially, and the parameters were reduced to 9.5% of the original discriminator, the total parameters of the model were reduced to 40.5% of the original model, and the total training time is reduced to 57.7% of the original model. In summary, after the introduction of depthwise separable convolution, the number of parameters and training time of the model are effectively reduced, and the time and space complexity of the model is simplified.


Overall performance evaluation Table 1 shows the comparison results of the overall performance of the model after adding DSAM and DSC at the same time (IS and FID in the table indicate the average metrics of the generative models for each category, respectively). From the third row of Table 1, after adding only the DSAM module, the IS and FID metrics of model improve significantly, while the number of parameters and training time only increase by about 4% and 3.8%, respectively, which is due to the fact that the DSAM is added at the last three layers of the generative model, which only requires a small complexity cost. From the fourth row of Table 1, the IS and FID metrics each fluctuate slightly after adding the DSC module, indicating that the discriminator is not the main factor affecting the quality of the generated images, and the compression of its convolutional structure can substantially improve the model training efficiency and reduce the space occupation rate without changing the generation quality. From the fifth row of Table 1, after adding DSAM and DSC modules simultaneously, the quality and diversity of the StyleGAN-DSAD model generation is almost the same as that of StyleGAN-DSAM, but the number of parameters is reduced to 44.5% of the original model, the training time is reduced to 58.8%, and the comprehensive performance of the model is optimized.

•
Overall performance evaluation Table 1 shows the comparison results of the overall performance of the model after adding DSAM and DSC at the same time (IS and FID in the table indicate the average metrics of the generative models for each category, respectively). From the third row of Table 1, after adding only the DSAM module, the IS and FID metrics of model improve significantly, while the number of parameters and training time only increase by about 4% and 3.8%, respectively, which is due to the fact that the DSAM is added at the last three layers of the generative model, which only requires a small complexity cost. From the fourth row of Table 1, the IS and FID metrics each fluctuate slightly after adding the DSC module, indicating that the discriminator is not the main factor affecting the quality of the generated images, and the compression of its convolutional structure can substantially improve the model training efficiency and reduce the space occupation rate without changing the generation quality. From the fifth row of Table 1, after adding DSAM and DSC modules simultaneously, the quality and diversity of the StyleGAN-DSAD model generation is almost the same as that of StyleGAN-DSAM, but the number of parameters is reduced to 44.5% of the original model, the training time is reduced to 58.8%, and the comprehensive performance of the model is optimized. In order to visually demonstrate the effect of the improved model, six types of foreign object images generated by StyleGAN and StyleGAN-DSAD were selected for comparison. As shown in Figure 13, the first row of each type of generated images was generated by StyleGAN, and the second row was generated by StyleGAN-DSAD. Observing the generated images of each group, it is obvious that the StyleGAN generated images have teardrop artifacts, and there are phenomena such as adhesion of the front background and shape distortion. For example, in column 2 of Figure 13a, the coal generated by StyleGAN seriously overlaps with the background, and in column 3, the shape and color of the StyleGAN generated images have produced more obvious distortion, and the generated images are no longer recognizable as coal; similar problems are shown in columns 3, 4, and 5 of Figure 13b, columns 3 and 4 of Figure 13c, columns 3, 4, and 5 of Figure 13d, column 4 of Figure 13e,f. On the contrary, observing StyleGAN-DSAD, the improvement of the generator makes the texture, contour, and color of the generated images clearer and more fitting to the original dataset. At the same time, problems such as front background adhesion and shape distortion are also well improved.  In order to visually demonstrate the effect of the improved model, six types of foreign object images generated by StyleGAN and StyleGAN-DSAD were selected for comparison. As shown in Figure 13, the first row of each type of generated images was generated by StyleGAN, and the second row was generated by StyleGAN-DSAD. Observing the generated images of each group, it is obvious that the StyleGAN generated images have teardrop artifacts, and there are phenomena such as adhesion of the front background and shape distortion. For example, in column 2 of Figure 13a, the coal generated by StyleGAN seriously overlaps with the background, and in column 3, the shape and color of the Style-GAN generated images have produced more obvious distortion, and the generated images are no longer recognizable as coal; similar problems are shown in columns 3, 4, and 5 of Figure 13b, columns 3 and 4 of Figure 13c, columns 3, 4, and 5 of Figure 13d, column 4 of Figure 13e,f. On the contrary, observing StyleGAN-DSAD, the improvement of the generator makes the texture, contour, and color of the generated images clearer and more fitting to the original dataset. At the same time, problems such as front background adhesion and shape distortion are also well improved.

Practical Effects of Data Augmentation for Foreign Object Detection
To further demonstrate the effectiveness of our research, the instance segmentation framework BlendMask was selected as the foreign object detection model to test the actual effect of data augmentation. The original dataset is the same as that of the training StyleGAN used, and 20% and 10% are randomly selected as the validation set and test set. By replacing different training sets for experimental comparison, in order to balance the dataset, the final ratio of all kinds of foreign object data amplified is kept as 1.2:1.2:1:1:1.2:1 for coal, gangue, bag, iron and wood. The hardware environment for BlendMask training remains unchanged, and the deep learning environment is the dectectron2 framework under Python 3.6 + Pytorch 1.7.0. The training details are as follows: initialize the backbone weights and accelerate the model convergence using the ResNet101 pre-trained model on ImageNet. SGD is used to train the model with 48 K iterations and a batchsize of 8. The learning rate is changed using a warm up strategy, with the initial learning rate set to 0.002, and the learning rate is reduced by a factor of 10 at 60% and 90% of the total number of iterations. The remaining hyperparameters are kept the same as the original text.
Firstly, the usefulness of our method for the coal foreign object detection is verified by comparing the training results of the generated images with the real images, and the experimental results are shown in Table 2. As can be seen from the table, although there is a certain gap compared with the training results of real data, using the generated data as the training set can still improve the foreign object detection accuracy, and the training effect of the generated data is gradually improved with the increase in the training data. When the generated data is expanded to 10,000, the model accuracy is no longer improved, and the AP box is 71.9% and the AP mask is 62.6%, which are 3.9% and 4.9% different from the optimal results of the real data, indicating that although the generated images can support the training of foreign object detection model to a certain extent, the training of the generated model is generally a process of learning the distribution of the real dataset, and cannot completely replace the real data. Secondly, in order to verify the advantages and disadvantages of the generation method of this paper and the traditional data augmentation methods, comparison experiments were conducted under different augmentation methods, and the traditional data augmentation methods used are shown in Figure 14, which mainly include geometric transformation, Cutout, CutMix, and so on.
The comparison results are shown in Figures 15 and 16, where the horizontal coordinates indicate the original dataset plus different amounts of augmentation data. It can be seen that the performance of the foreign object detection model is improved by using both augmentation methods, but the generation method is more effective, because the images obtained by the traditional augmentation method are based on the original image, and the diversity is not enough, while the generation method improves the diversity of the dataset to a certain extent, thus better facilitating the model training. When the traditional data augmentation method augments the data to be basically the same as the original dataset, the model accuracy is no longer improved, and the final AP box is improved by 2.4% and AP mask is improved by 1.6%. When the data is expanded to 8000 in the generation method, the model accuracy is basically no longer improved, and when it is expanded to 10,000, the model performance reaches the optimum, and then the AP box is improved by 5.8% and AP mask is improved by 4.5%. The comparison results are shown in Figures 15 and 16, where the horizontal coordinates indicate the original dataset plus different amounts of augmentation data. It can be seen that the performance of the foreign object detection model is improved by using both augmentation methods, but the generation method is more effective, because the images obtained by the traditional augmentation method are based on the original image, and the diversity is not enough, while the generation method improves the diversity of the dataset to a certain extent, thus better facilitating the model training. When the traditional data augmentation method augments the data to be basically the same as the original dataset, the model accuracy is no longer improved, and the final APbox is improved by 2.4% and APmask is improved by 1.6%. When the data is expanded to 8000 in the generation method, the model accuracy is basically no longer improved, and when it is expanded to 10,000, the model performance reaches the optimum, and then the APbox is improved by 5.8% and APmask is improved by 4.5%.   The comparison results are shown in Figures 15 and 16, where the horizontal coordinates indicate the original dataset plus different amounts of augmentation data. It can be seen that the performance of the foreign object detection model is improved by using both augmentation methods, but the generation method is more effective, because the images obtained by the traditional augmentation method are based on the original image, and the diversity is not enough, while the generation method improves the diversity of the dataset to a certain extent, thus better facilitating the model training. When the traditional data augmentation method augments the data to be basically the same as the original dataset, the model accuracy is no longer improved, and the final APbox is improved by 2.4% and APmask is improved by 1.6%. When the data is expanded to 8000 in the generation method, the model accuracy is basically no longer improved, and when it is expanded to 10,000, the model performance reaches the optimum, and then the APbox is improved by 5.8% and APmask is improved by 4.5%.  The PR curves of each type of foreign object under different data augmentation methods are shown in Figure 17. From Figure 17d,f, it can be seen that the accuracy of the three datasets is close to each other because the characteristics of bag and wood are more obvious, and the accuracy of our method is slightly higher. From Figure 17a-c,e, it can be seen that the overall accuracy of the model and the detection accuracy of coal, gangue, and iron are significantly improved after using our method compared with nonaugmentation and traditional augmentation methods, which indicates that our method has a better promotion effect on the detection of each type of foreign object. Sensors 2022, 22, x FOR PEER REVIEW 19 of 22 Figure 16. Comparison of APmask.
The PR curves of each type of foreign object under different data augmentation methods are shown in Figure 17. From Figure 17d,f, it can be seen that the accuracy of the three datasets is close to each other because the characteristics of bag and wood are more obvious, and the accuracy of our method is slightly higher. From Figure 17a-c,e, it can be seen that the overall accuracy of the model and the detection accuracy of coal, gangue, and iron are significantly improved after using our method compared with non-augmentation and traditional augmentation methods, which indicates that our method has a better promotion effect on the detection of each type of foreign object. The actual detection results of the model before and after data augmentation using the method in this paper are shown in Figure 18. It can be found that after the data augmentation, the foreign object detection effect is significantly improved, and the segmentation contour is clearer while effectively reducing the phenomenon of false detection and missed detection. The PR curves of each type of foreign object under different data augmentation methods are shown in Figure 17. From Figure 17d,f, it can be seen that the accuracy of the three datasets is close to each other because the characteristics of bag and wood are more obvious, and the accuracy of our method is slightly higher. From Figure 17a-c,e, it can be seen that the overall accuracy of the model and the detection accuracy of coal, gangue, and iron are significantly improved after using our method compared with non-augmentation and traditional augmentation methods, which indicates that our method has a better promotion effect on the detection of each type of foreign object. The actual detection results of the model before and after data augmentation using the method in this paper are shown in Figure 18. It can be found that after the data augmentation, the foreign object detection effect is significantly improved, and the segmentation contour is clearer while effectively reducing the phenomenon of false detection and missed detection.

Conclusions
Our goal was to solve the problems of difficult training and low detection accuracy of coal foreign object detection models caused by the lack of datasets. In this paper, we perform foreign object data augmentation by image generation in order to improve the quality and diversity of foreign object datasets and facilitate foreign object detection model training. Specifically, a high-quality foreign object image generation method based  The actual detection results of the model before and after data augmentation using the method in this paper are shown in Figure 18. It can be found that after the data augmentation, the foreign object detection effect is significantly improved, and the segmentation contour is clearer while effectively reducing the phenomenon of false detection and missed detection.

Conclusions
Our goal was to solve the problems of difficult training and low detection accuracy of coal foreign object detection models caused by the lack of datasets. In this paper, we perform foreign object data augmentation by image generation in order to improve the quality and diversity of foreign object datasets and facilitate foreign object detection model training. Specifically, a high-quality foreign object image generation method based

Conclusions
Our goal was to solve the problems of difficult training and low detection accuracy of coal foreign object detection models caused by the lack of datasets. In this paper, we perform foreign object data augmentation by image generation in order to improve the quality and diversity of foreign object datasets and facilitate foreign object detection model training. Specifically, a high-quality foreign object image generation method based on StyleGAN-DSAD is proposed. Firstly, the quality and diversity of the generated images are improved by introducing a dual self-attention module to improve the artifacts, shape distortion, and front background adhesion of the generated images; secondly, the number of parameters of the model is greatly reduced and the training efficiency of the model is improved by replacing the convolutional structure of the discriminator with a depthwise separable convolution; finally, a high-quality foreign object image with a resolution of 512 × 512 is generated to achieve coal foreign object data set expansion. The experimental results show that the improved model effectively improves the quality and diversity of the generated images, and the complexity of the model is greatly reduced. After data augmentation, the performance of the foreign object detection model is significantly improved compared with both non-augmentation and traditional augmentation methods, which effectively reduces the occurrence of false detection and missed detection, and proves the feasibility of applying StyleGAN-DSAD to the field of coal foreign object detection.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to project confidentiality.