DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection

Ding, Shuhui; Guo, Zhongyuan; Chen, Xiaolong; Li, Xueyi; Ma, Fai

doi:10.3390/electronics13112047

Open AccessArticle

DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection

by

Shuhui Ding

¹,

Zhongyuan Guo

¹,

Xiaolong Chen

¹,

Xueyi Li

^1,*

and

Fai Ma

²

¹

College of Mechanical and Electronic Engineering, Shandong University of Science and Technology, Qingdao 266590, China

²

Department of Mechanical Engineering, University of California, Berkeley, CA 94709, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2047; https://doi.org/10.3390/electronics13112047

Submission received: 17 April 2024 / Revised: 21 May 2024 / Accepted: 23 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network)

Download

Browse Figures

Versions Notes

Abstract

The online detection of surface defects in irregularly shaped products such as rawhide sticks, a kind of pet food, is still a challenge for the food industry. Developing deep learning-based detection algorithms requires a diverse defect database, which is crucial for artificial intelligence applications. Acquiring a sufficient amount of realistic defect data is challenging, especially during the beginning of product production, due to the occasional nature of defects and the associated costs. Herein, we present a novel image data augmentation method, which is used to generate a sufficient number of defect images. A Deep Convolution Generation Adversarial Network (DCGAN) model based on a Residual Block (ResB) and Hybrid Attention Mechanism (HAM) is proposed to generate massive defect images for the training of deep learning models. Based on a DCGAN, a ResB and a HAM are utilized as the generator and discriminator in a deep learning model. The Wasserstein distance with a gradient penalty is used to calculate the loss function so as to update the model training parameters and improve the quality of the generated image and the stability of the model by extracting deep image features and strengthening the important feature information. The approach is validated by generating enhanced defect image data and conducting a comparison with other methods, such as a DCGAN and WGAN-GP, on a rawhide stick experimental dataset.

Keywords:

image data augmentation; defect detection; deep convolution generation adversarial network (DCGAN); residual block (ResB); hybrid attention mechanism (HAM)

1. Introduction

With the development of deep learning technology, the method of using images for product defect detection has been widely applied in various aspects of industrial production. As an emerging light industry, the pet food manufacturing industry has developed rapidly in recent years, and various intelligent technologies have been applied in this industry. Rawhide sticks are a kind of pet food composed of rawhide as a raw material, which is mainly produced by semi-automatic production in the production process of enterprises. Given the constraints regarding the raw materials, production processes, and production environment, defects such as surface stains, black cores, irregular shapes, and bending are inevitable in the finished product. Restricting the entry of defective products into the market is an important way to reduce customer complaints, protect pet health, and improve product quality. Therefore, all pet food manufacturers have invested significantly to improve the efficiency and accuracy of testing and reduce the testing costs. However, due to the irregularity of the shape of rawhide stick products, the traditional automatic detection method is difficult to implement. At present, all manufacturers use manual detection. The use of neural network models for online product defect detection can replace the current manual detection, improving the detection efficiency and accuracy. However, due to the lack of a sufficient number of rawhide stick defect images, it is difficult to train an ideal rawhide stick defect detection model. We aim to establish a data augmentation method that utilizes Generative Adversarial Networks (GANs) to generate images and enhance the existing images with typical defects, thereby establishing the training and test datasets that are used to train deep learning models for rawhide stick defect detection.

Traditional data augmentation expands the training samples by performing geometric affine transformation, color space transformation, rotation distortion transformation, polar coordinate transformation, and other methods on the existing image data samples. These methods are simple to operate but cannot enrich the distribution of samples in high-dimensional feature spaces. GANs can combine the high-dimensional features from the existing dataset to generate images that are completely different from the original dataset, providing more image features for model training and improving the classification performance of the model. A Traditional Deep Convolutional Generation Adversarial Network (DCGAN) can be used to generate rawhide stick defect images, but there are problems such as an unstable training process and the poor image quality generated. Based on the above, this article plans to propose a rawhide stick defect image augmentation method based on a DCGAN. This model is based on a DCGAN and applies a Residual Block (ResB) and the Hybrid Attention Mechanism (HAM) to the model generator and discriminator to improve the quality of the generated rawhide stick defect images by extracting deep-level image features, and then we calculate the loss function using the Wasserstein distance with a gradient penalty and update the model training parameters.

The remainder of the paper is organized as follows. Section 2 reviews the literature relating to the GAN, ResB, and attention mechanism approaches. In Section 3, a novel method of image augmentation is introduced. The proposed method is validated through a case in Section 4, followed by the conclusion in Section 5.

2. Literature Review

Images that are completely different from the original dataset can be generated by a GAN, providing more image features for model training and improving the classification performance of the model. Scholars have conducted active research in fields such as GANs and their derivative models, image data augmentation based on GANs, residual networks, and attention mechanisms.

2.1. GANs and Their Derivative Models

Goodfellow et al. [1] proposed the GAN, consisting of a generator and a discriminator. The former generates fake data, while the latter distinguishes between fake and real data. Both are trained simultaneously in an adversarial manner until reaching a Nash equilibrium state [2]. A GAN can generate fake images similar to the training set images, but, due to the implicit training process of GANs, the specific parameters of the model cannot be determined, resulting in mode collapse [3], unstable training, and low image quality during GAN training.

With the optimization and development of the GAN, various improved variants based on GANs continue to emerge. Ratliff et al. [4] combined a traditional GAN with a Convolutional Neural Network (CNN) to propose the DCGAN approach, which used multi-layer convolution to replace the fully connected layer in traditional GANs to extract image features, which greatly improved the training stability and generated image quality but did not solve the problems such as the difficult training and unstable gradient in the traditional GAN.

During the GAN training process, due to the lack of overlap between the real data distribution and the generated data distribution, the Jensen–Shannon (JS) divergence will remain unchanged, leading to the phenomenon of gradient disappearance. In response to the aforementioned issues, Mao et al. [5] proposed Least Squares Generative Adversarial Networks (LSGANs), replacing the cross-entropy loss function with the least squares loss function to enhance the quality of the generated images, yet failed to fundamentally address the instability problem in traditional GAN training. Arjovsky et al. [6] proposed the Wasserstein GAN (WGAN), which employs the Wasserstein distance instead of JS divergence to assess the disparity between real and generated image distributions, mitigating the instability in traditional GAN training, yet without explicitly providing a calculation method for the Wasserstein distance. Gulrajani et al. [7] proposed the WGAN-GP (Wasserstein GAN-Gradient Penalty) on the basis of the WGAN, which used the Wasserstein distance with a gradient penalty instead of the weight clipping strategy to improve the training speed and convergence of the GAN, but they did not fundamentally solve the problem of the 1-Lipschitz restriction of the discriminator. Wei et al. [8] proposed the Wasserstein GAN improvement scheme CTGAN based on the WGAN-GP, which adds a new penalty term to the discriminator to make the loss function of the discriminator more Lipschitz-continuous, effectively improving the problem that a simple WGAN-GP penalty term cannot be effective throughout the training process. Miyato et al. [9] proposed the Spectral Normalization Generative Adversarial Network (SNGAN), which constrains the spectral norm of the parameter matrix W for each layer of the network in the GAN discriminator and then constrains the Lipschitz constant of the discriminator, limiting the gradient within a fixed range to improve the training stability of the GAN, but with the generated image quality being not high. Wu et al. [10] proposed the gradient normalization GAN (Gradient Normalization for Generative Adversarial Networks, GNGAN), which only applies gradient norm constraints to the function of the discriminator, improving the capacity of the discriminator.

In order to further improve the image generation quality of GANs, researchers have gradually incorporated Attention Mechanisms (AMs) into GANs. Zhang et al. [11] proposed the Self-Attention Generative Adversarial Network (SAGAN), which, for the first time, introduces a self-attention module in the generator and discriminator, solving the problems of unclear local details in the image distant space and training stability. Wu et al. [12] proposed the Twin Attention mechanism-based Generative Adversarial Network (TAGAN), which simulates the dependency relationship between the local and global features to model real images and generate realistic fake images. Liu et al. [13] proposed a faster and more stable GAN (Towards Faster and Stable GAN training for High Fidelity few shot Image Synthesis, FastGAN), which is a classic model for generating high-resolution images with fewer samples. Hinz et al. [14] proposed ConSinGAN as an improved version of the SinGAN (Learning a Generative Model from a Single Natural Image), enhancing the efficiency of the generative model. Chen et al. [15] proposed a diversity image style transformation framework by implementing reversible cross-spatial mapping to achieve significant diversity.

There is also a great deal of research on GAN-based image data augmentation by relevant scholars. Zheng et al. [16] used a DCGAN for data augmentation to improve the pedestrian recognition accuracy. However, the generated image quality was not high. Shi et al. [17] proposed a data augmentation method using a StyleGAN to improve the training data. In the model, styles and semantic labels are extracted from the dataset, and, for each semantic label, a style is randomly chosen to synthesize enhanced CT images, with the experiments confirming its ability to generate realistic nodule samples and enable accurate nodule segmentation. Tran et al. [18] proposed a data augmentation optimization GAN framework to enhance the GAN learning by capturing the original dataset’s distribution, utilizing various data augmentation techniques to improve the stability of both the generators’ and discriminators’ learning processes. Upadhyay et al. [19] developed a novel deep learning framework for automatic borescope inspection, which partially solves the problems of data imbalance, small defects, and data availability by testing different loss functions and using customized Generative Adversarial Networks to generate composite images. He et al. [20] achieved a controllable facial super-resolution reconstruction model by designing a style modulation module and a feature modulation module. Grigorev et al. [21] proposed a 3D character object model based on one or more full-body images of the human body that corresponds to the clothing. The model achieved good generation results by combining polygon network modeling with neural rendering, but its modeling efficiency was low. Jiang et al. [22] proposed the GPEN (GAN Prior Embedded Network) model, which first learns a GAN for generating high-quality face images, and then it is embedded into the U-shaped Deep Neural Network (DNN) as a prior decoder, and then, by feeding low-quality face images and fine-tuning the GAN network embedded in the DNN, the high-resolution face reconstruction image can be obtained, solving the problem of the excessive smoothness in GAN-based super-resolution reconstruction networks. Esser et al. [23] proposed a new network model that combines the inductive bias of CNNs with the high expressive power of transformer, which combines the advantages of a CNN’s ability to effectively extract the global and local visual features of images, as well as the advantages of Transformer’s ability to learn long-distance interaction information, improving the visual effect of the generated images. Suthar et al. [24] focused on using a multi-scale SinGAN model to generate additional Kurtogram images to effectively train machine learning models for bearing composite fault diagnosis in the presence of limited experimental data. Jalayer et al. [25] addressed imbalanced fault data by using the Wasserstein GAN and WGAN-GP to generate composite samples to fill in rare fault categories and enrich the training set. Kim et al. [26] utilized a GAN to expand the infrared small-target detection dataset and demonstrated through experiments that the network trained by mixing real data and synthetic data is better than the network trained by using real data alone. It can be seen that data enhancement using a GAN effectively improves the detection performance.

In summary, many studies have been conducted on GANs in image data augmentation, but GANs still face problems, such as unstable training and low image quality during data augmentation.

2.2. Residual Network

In the training process of ordinary CNNs, increasing the depth of the network can lead to problems such as gradient vanishing, which can cause the network not to converge and therefore fail to extract deeper features. To address the aforementioned issues, He et al. [27] proposed a residual network that solves the problem of the functional degradation caused by the increasing network depth in ordinary CNNs. The basic structure of a residual network is the ResB, which uses skip connections to improve the information exchange ability of each layer of the network, effectively solving the problems of gradient disappearance and model degradation caused by too many layers of the model, and to some extent accelerating the convergence speed of the neural network.

Park et al. [28] proposed a new shortcut method to construct the structure of GANs, which inherits the advantages of residual modules and further improves the performance of GANs. Based on the gating mechanism, the proposed method preserves the information related to the generated image in the remaining blocks. Zhu et al. [29] proposed a residual network structure with spectral normalization, constructed a new generator model and discriminator model, and introduced the Wasserstein distance with a gradient penalty, resulting in enhanced image features compared to the conventional methods and effectively boosting the classification network recognition accuracy. Ling et al. [30] proposed a generative adversarial network with a hybrid attention mechanism group residual module, which embeds a hybrid attention mechanism to adaptively learn the key region features and enhances the learning ability of the image key regions. Li et al. [31] proposed a GAN based on ResBs and a self-attention mechanism. By introducing ResBs, the characteristics of the different-scale receptive fields in signals are extracted and the data dimensions are expanded and reduced. By introducing the self-attention mechanism, the time correlation between the discrete moments is excavated. The GAN has the ability to generate close to real EEG samples.

The residual module enables the network to learn deep-level feature information and avoid the performance degradation caused by increasing network depth through skip connections, providing an effective method to alleviate the degradation problem of deep neural networks and providing useful experience for the design of subsequent deep neural networks. The residual module has become a very important network structure and has been widely applied in multiple fields of computer vision.

2.3. AM

AM is a resource allocation method that can allocate computing resources reasonably and solve information overload. It can capture the weight difference between different information to measure the importance of different information, and can dynamically adjust the weight, strengthen the attention to important feature information, ignore unimportant feature information, and improve the algorithm efficiency.

In deep neural networks, different channels in different feature maps often represent different objects. Woo et al. [32] combined channel attention with spatial attention, where channel attention dynamically calibrates channel weights to enhance channel feature expression, and spatial attention adaptively selects spatial regions of interest, thus preserving the positional information of important features. However, due to the use of multiple pooling operations, the texture and spatial information in the image is lost, which directly affects the texture details of the generated image. Hu et al. [33] proposed a Squeeze and Excitation Network (SENet) that can generate weights for each feature channel through parameters and then represent the importance of each feature channel through weights, thus achieving the recalibration of the original features in the channel dimension and excellent performance with low computational complexity. Wang et al. [34] proposed a local cross-channel information exchange strategy without dimensionality reduction and an ECA mechanism with an adaptive one-dimensional convolution kernel size, which efficiently achieves SENet-like effects. Hou et al. [35] proposed a Coordinate Attention (CA) mechanism that embeds position information into channel information, which decomposes the channel attention mechanism into two parallel 1D feature codes to effectively integrate the spatial coordinate information and generate attention feature maps. Compared to the SENet, the CA module can process features across channels and is more purposeful. Wang et al. [36] proposed a dual discrimination generative adversarial network combining hybrid attention and designed a hybrid attention mechanism to fully capture the image feature information from two dimensions, generating more realistic and detailed images. Yang et al. [37] proposed a generative adversarial network based on a hybrid attention mechanism for data augmentation, which improves the detail ability of GANs to generate images by relating the remote features in images and further increases the stability of training. Wang et al. [38] proposed a multi-scale fusion self-attention GAN to solve the problems of the image quality and accuracy of visual processing algorithms affected by bad weather. By introducing self-attention and using different scales to extract the input features of the rain line, the network pays more attention to the extracted features of different scales. The experiment shows that the introduction of a self-attention mechanism can achieve a better denoising effect. Zhang et al. [39] designed a multi-scale spatial AM based on a GAN to process the changes in underwater image scenes, enhancing the ability of the network to extract semantic and contextual information and improving the similarity between the generated image and the original image.

2.4. Discussion

The traditional GAN model and derivative model mentioned above, as generative models, rely on adversarial learning through generators and discriminators to generate samples that approximate the true distribution of training data. However, the GAN model is implicitly trained, and there are still problems, such as mode collapse and the instability during its training process. Many studies have been conducted on GAN-based image data augmentation, but GAN models still face the problem of low generated image quality during image data augmentation, which cannot meet the needs of the model training. Therefore, it is still necessary to conduct in-depth research on GAN-based image data augmentation and develop and design efficient and high-quality GAN models to meet the needs of the subsequent model training.

A DCGAN is a product of the combination of a traditional GAN and CNN. However, considering the characteristics of rawhide stick defect images and the generation of rawhide stick images, a DCGAN still has the following problems. Firstly, the size of the rawhide stick defect image is large, and a DCGAN needs to increase the depth of the network. However, with the increase in the network depth, the phenomena of gradient disappearance and gradient explosion will occur, which will lead to network function degradation and training process instability, ultimately making it difficult for a DCGAN to stably generate high-quality rawhide stick images. Secondly, as the parameters of the DCGAN network increase, the amount of information stored in the model also increases. Due to traditional CNNs only focusing on local features, useful information may be ignored, which can easily lead to gradient vanishing during gradient updates, resulting in difficulties in network convergence and other issues. Thirdly, the image generation model of DCGAN adopts a fully convolutional structure, which cannot make the model focus on richer image features. When data augmentation is applied to rawhide stick defect images, there may be problems, such as overlapping and blurring of the generated image regions, loss of the image feature information, and low quality of the generated images.

Using rawhide stick images as training and test datasets, some pre-existing models cannot obtain ideal augmented image data. It can be seen that there is currently no applicable model for generating images of rawhide stick defects. Based on the above research and a DCGAN as the basic framework of the deep learning model, a new generative data augmentation model is established by applying a ResB, a HAM, and the Wasserstein distance loss function with a gradient penalty. This model can effectively solve the problem of generating rawhide stick defect images and provide a sufficient training dataset for the deep learning models used for rawhide stick defect detection.

3. Residual Block and Hybrid Attention Mechanism-Based DCGAN

The proposed approach utilizes an artificial intelligence model to generate an augmented dataset of rawhide stick images. It introduces a novel method based on a Residual Block and hybrid attention mechanism within a DCGAN framework (ResB–HAM–DCGAN). This method offers a unique approach to augmenting images for training rawhide stick defect detection models.

3.1. Theoretical Basis

The GAN generator converts random noise into fake data that approximate the real data, and then its discriminator distinguishes the generated fake data from the real data. Its principle is shown in Figure 1. Random noise z is fed into the generator to generate fake data G(z). The discriminator receives both fake data G(z) and real data x and outputs the probability that G(z) is true. Then, based on the output of the discriminator, loss function is used to calculate the loss and it is backpropagated to the generator and discriminator to update their training parameters. When the discriminator is unable to distinguish whether the input data come from real or fake data, the model reaches its optimal state.

The objective function of GAN is shown as follows:

\min_{G} \max_{D} V (D, G) = E_{x ~ P_{data} (x)} [\log D (x)] + E_{z ~ P_{z} (z)} [\log z (1 - D (G (z)))]

(1)

where x represents real data and z represents random noise, G and D represents generator and discriminator,

P_{d a t a} (x)

represents the distribution of real samples,

P_{z} (z)

represents the distribution of generated samples, and

D (x)

represents the probability that the discriminator inputs real data.

DCGAN combines traditional GAN with CNN and uses multi-layer convolution to replace the fully connected layer in GAN when extracting image features. DCGAN outperforms GAN in terms of training stability and generated image quality.

3.2. ResB–HAM–DCGAN Model Framework

To address the issues of unstable training, generated image overlapping, and blurring when using the DCGAN to augment rawhide stick images, ResB and HAM are utilized based on DCGAN, and the original loss function is replaced by Wasserstein distance with gradient penalty. A ResB–HAM–DCGAN model is established to solve the above question in the original DCGAN. The network structure is shown in Figure 2, where the ResB and HAM introduced in the generator and discriminator can enhance the parameter transfer between different layers and strengthen the important image feature information. Replacing the original loss function with Wasserstein distance with gradient penalty can maintain the deep learning training stability.

In Figure 2, ResB-G and ResB-D represent the residual blocks in generator and discriminator. Skip connections are utilized in ResB-G and ResB-D to improve the information exchange ability between different layers of the network, thereby alleviating the problems of gradient disappearance and explosion caused by the increase in the model layers and enhancing the model’s ability to extract deeper level features. HAM assesses the importance of features by capturing the weight difference between different features, dynamically adjusts the weight parameters of the corresponding region, strengthens attention to the important feature information, ignores the unimportant feature information, and enhances the detail features of the generated image. The Wasserstein distance with gradient penalty can enhance the generator’s ability to generate images.

The generator of the ResB–HAM–DCGAN is a residual hybrid attention network containing transposed convolutional layers, and the discriminator is a binary convolutional network containing residual hybrid attention. After feeding a random noise z that conforms to a normal distribution into the generator, a virtual image G(z) is generated. The discriminator judges the virtual image based on the distribution of real image data and finally optimizes the network parameters of the generator and discriminator based on the discrimination results until a balanced state is reached.

3.2.1. ResB

The image size of the rawhide stick detection is 512 × 512, but the DCGAN can only generate images no larger than 64 × 64. In order to increase the network depth to generate the required large-size rawhide stick defect images and avoid the problem of gradient disappearance due to increasing the network depth, ResBs are utilized for the generator and discriminator on the basis of residual network [30]. Parameter transfer between different network layers in the model is realized through skip connection, which stabilizes the learning process and produces high-quality images. At the same time, normalization operation is added to the backbone and skip connection so that the model can learn more data features stably and efficiently. The normalization operations in the generator include Spectrum Normalization (SN) and Batch Normalization (BN), and the normalization operations in the discriminator are SN and Instance Normalization (IN).

ResB-G mainly consists of deconvolution layers (including TranConv2d, SN, BN, and ReLU) and convolutional layers (including Conv2d and BN) on the backbone path, and deconvolution layers (TranConv2d, SN, and BN) on the skip connection. Finally, the outputs on the backbone path and skip connection path are summed and pass through ReLU activation function to increase the nonlinear fitting ability of our machine learning model. The structure of ResB is shown in Figure 3.

ResB-D mainly consists of convolutional layers (including Conv2d, SN, IN, and LeakyReLU), convolutional layers (including Conv2d and IN) on the backbone path, and convolutional layers (including Conv2d, SN, and IN) on the skip connection. Finally, the outputs on two paths are summed and then passed through LeakyReLU activation function. The structure of ResB-D is shown in Figure 4.

3.2.2. Hybrid Attention Mechanism

CBAM [32] combines the advantages of channel attention and spatial attention (SA) to enhance the ability of channel expression and spatial region expression. In order to alleviate the problems such as low image quality and difficult model convergence caused by the increase in DCGAN parameters, on the basis of upgrading the channel attention in CBAM to Efficient Channel Attention (ECA), a Hybrid Attention Mechanism (HAM) is proposed, which combines the advantages of ECA and SA to focus on important image features, enhance the detailed features of the generated image, and stabilize the training process. The structure of HAM is shown in Figure 5.

For an input feature map

F 1 \in R^{H \times W \times C}

, the ECA and SA are sequentially used to calculate the one-dimensional channel attention weight matrix

M_{E} (F 1) \in R^{1 \times 1 \times C}

and two-dimensional spatial attention weight matrix

M_{S} (F 2) \in R^{H \times W \times 1}

, in order to capture rich image features. The above process is shown in Equations (2) and (3),

F 2 = M_{E} (F 1) \otimes F 1

(2)

F 3 = M_{S} (F 2) \otimes F 2

(3)

where

\otimes

represents matrix dot product.

Specifically, for a given input feature

F 1 \in R^{H \times W \times C}

, it is aggregated by global average pooling to obtain one-dimensional feature

x \in R^{1 \times 1 \times C}

. And then, ECA uses k one-dimensional convolution without dimensionality reduction to achieve channel level feature interactions to capture local cross-channel interaction weights. At last, the output feature map F2 can be obtained by multiplying the weight

M_{E} \in R^{1 \times 1 \times C}

and the dimension corresponding to F1. The calculation is shown as follows,

M_{E} (F) = σ (C l D_{k} (x))

(4)

k = ψ (C) = {| \frac{\log_{2} (C)}{γ} + \frac{b}{γ} |}_{o o d}

(5)

where

C l D

represents one-dimensional convolution,

k

represents the size of the convolution kernel,

C

represents the number of channels,

{| |}_{o o d}

represents that only odd numbers can be taken,

γ

and

b

represents that are set to 2 and 1 in the paper to change the ratio between the number of channels and the size of the convolution kernel.

In the spatial attention mechanism, the input feature graph

F 2 \in R^{H \times W \times C}

is globally maximum pooled and globally average pooled along the channel direction to generate two 2D maps

F_{a v g}^{s} \in R^{H \times W \times 1}

and

F_{m a x}^{s^{H \times W \times 1}}

, aggregate the two maps to obtain channel information of a feature map, filter important features of spatial dimensions, and then use convolution operations to generate spatial attention weight matrix

M_{S} (F) \in R^{H \times W \times 1}

. Finally, the input feature F2 is multiplied by the weight matrix

M_{s}

to get the output feature map F3. The calculation is shown as follows,

M_{S} F = σ (f^{7 \times 7} (F_{a v g}^{s}; F_{m a x}^{s} ()) ())

(6)

where σ represents the sigmoid f^7×7 function, represents convolution operation, and its convolution kernel size is 7 × 7.

3.2.3. Generators and Discriminators

The generator of ResB–HAM–DCGAN is mainly responsible for learning the probability distribution of real data and generating fake data to confuse the discriminator. Its structure is shown in Figure 6. The generator has a total of 10 layers of network structure. Taking random noise (with a size of 1 × 1 × 100) as input, after passing through the first layer of transposed convolutional layers, the random noise is converted into image data. By sequentially passing through 5 ResBs in layers 2–6, deep-level image features are extracted to generate clearer and more stable images. The seventh layer HAM is used to extract features according to their importance. Then, important features are further extracted through the eighth and ninth layers, respectively. Finally, pass through the tenth transposed convolutional layer and apply the Tanh activation function to obtain the generated image sample. In order to ensure the stability of model training and alleviate the problem of gradient disappearance, SN and BN are used in the first transposed convolutional layer and 6 residual modules, and then the feature map is sent to the next layer of network through the ReLu activation function.

The structure of the ResB–HAM–DCGAN generator is shown in Table 1. Its main workflow is as follows. In the first layer, first, a 100-dimensional noise vector is input to the first layer to conduct the transposed convolution operation. The input channel is 100, output channel is 512, and the convolution kernel is 4 × 4 with a step size of 1. Then, SN and BN operation are conducted, respectively, and ReLU activation function is operated. As a result of the first layer operation, feature maps are output with the size of 4 × 4 × 512. In the following 8 layers, 6 ResBs and 2 HAMs are used to improve the capture ability of deep-level and important features, deepening the depth of the network. In the 10th layer, after the operation of transposed convolution and Tanh activation function, generated images are output with a size of 512 × 512 × 1. Other parameters are set as follows. The input channel is 8, the output channel is 1, the size of convolution kernel is 4 × 4, and the step size is 2.

The discriminator of ResB–HAM–DCGAN is mainly responsible for distinguishing between the fake samples generated by the generator and the real samples, which has a 10-layer network structure, as shown in Figure 7. Taking the generated images as input, after passing through the first convolutional layer, the input image is converted into image data. Then, through 5 ResBs in 2–6 layers, deep-level image features are extracted to generate clearer and more stable images. Next, in the seventh layer, HAM is utilized to extract important features to obtain an attention feature map. Then, important features are further extracted through the eighth and ninth layers, respectively. Finally, pass through the tenth convolutional layer and apply the LeakyRelu activation function to obtain the generated image sample. In order to ensure the stability of model training and alleviate the problem of gradient disappearance, SN and IN are used in the first convolutional layer and 6 ResBs.

The structure of ResB–HAM–DCGAN discriminator is shown in Table 2. Its main workflow is as follows. First, in the first layer, the images with the size of 512 × 512 × 1 perform SN and IN operation after the convolution operation, and, after executing the LeakyReLU activation function, the feature images with the size of 256 × 256 × 8 are output. Then, 6 ResB and 2 HAM are utilized to deepen the depth of the network and improve the capturing ability of local features. In the tenth layer, the convolution operation is performed and the LeakyReLU activation function is executed to output the decision result. Its parameters are set as follows. The input channel is 512, the output channel is 1, the convolution kernel is 4 × 4, the step size is 1, and the output size is 1 × 1 × 1.

3.2.4. Improved Loss Function

During the training process of traditional DCGAN, the stronger the discriminator, the more severe the vanishing gradient of the generator, and, the stronger the generator, the faster the loss function of the discriminator will grow. In ResB–HAM–DCGAN, the Wasserstein distance loss functions with gradient penalty are used to replace the loss function calculated based on JS divergence of traditional DCGAN, and the constraint on the weight of the discriminator is realized according to the Lipschitz function, which makes the network easier to converge.

The Wasserstein distance [10] represents the minimum energy consumed to move from one distribution to coincide with another and can measure the degree of overlap between two distributions. Compared to JS divergence, Wasserstein distance can also provide effective gradients for completely non-overlapping distributions. The calculation of Wasserstein distance is as follows:

W (P_{d a t a}, P_{g}) = \inf_{γ ~ \prod (P_{d a t a} - P_{g})} E_{(x, y) ~ γ} [‖ x - y ‖]

(7)

where

P_{d a t a}

represents the true data distribution,

P_{g}

represents the generated data distribution,

γ

is a joint distribution,

i n f

is the lower bound,

\prod (P_{d a t a}, P_{g})

is the set of all possible joint distributions of

P_{d a t a}

and

P_{g}

combined,

x

represents the real data, y is the generated data, and

E_{(x, y) ~ γ} [‖ x - y ‖]

is the expected distance between the real data and the generated data under the joint distribution.

The Wasserstein distance is continuous and almost differentiable everywhere, which enables the model to be trained to its optimal state and effectively avoids the phenomenon of gradient vanishing.

The calculation of Wasserstein distance cannot be directly solved, and further transformation is needed to generate the loss function for GAN. Based on Cauchy–Lipschitz Theorem [10], the Wasserstein distance is calculated as follows:

W (P_{d a t a}, P_{g}) = \frac{1}{K} \underset{{‖ f ‖}_{L} < K}{s u p} E_{x ~ P_{d a t a}} [f (x)] - E_{x ~ P_{g}} [f (x)]

(8)

where,

s u p

is the minimum upper limit and f is a 1-Lipschitz function that follows this constraint.

Equation (8) adds a constraint to the continuous function f, such that f satisfies the condition that

| f (x) - f (y) | \leq K | x - y |

, where k is Lipschitz constant. Equation (8) can be approximately transformed into the following,

K \cdot W (P_{d a t a}, P_{g}) = \underset{w : {‖ f_{w} ‖}_{L} < K}{m a x} E_{x ~ P_{d a t a}} [f_{w} (x)] - E_{x ~ P_{g}} [f_{w} (x)]

(9)

For the set of functions

f_{w}

that satisfy the K-Lipschitz constraint, by substituting the Wasserstein distance calculation formula, the optimization objective function for GAN can be transformed into Equation (10),

L = E_{x ~ P_{d a t a}} [f_{w} (x)] - E_{x ~ P_{g}} [f_{w} (x)]

(10)

There are various ways to add K-Lipschitz constraints to function f_w. The ResB–HAM–DCGAN loss function adopts a gradient penalty-based method [11]. So, the ResB–HAM–DCGAN loss function is as follows:

L = E_{x ~ P_{d a t a}} [f_{w} (x)] - E_{x ~ P_{g}} [f_{w} (x)] + λ E_{n ~ P_{n}} [{({‖ \nabla_{n} D (n) ‖}_{2} - 1)}^{2}]

(11)

where

n

is the random difference between the real image data and the generated image data,

P_{n}

is the difference between the real data distribution and the generated data distribution,

λ

is the adjustment penalty parameter, and

\nabla_{n} D (n)

is the gradient constraint.

3.3. ResB–HAM–DCGAN Model Training Process

When ResB–HAM–DCGAN model is trained, the parameters of generator and discriminator are initialized first, and augmented data are generated by random noise, which is fed into the generator. Then, the augmented data and real data are input into the discriminator to calculate the loss and update the parameters. Finally, when the model cannot distinguish whether the input data are real data or augmented data, the model reaches an equilibrium state. The training process is shown in Figure 8, and the training steps are as follows.

(1) Truncated normal distribution with a variance of 0.02 is utilized to initialize the weight parameter W and the bias value b of the convolutional kernel. The learning rate η, namely the amplitude of each parameter update, is initialized. During the training process, parameter updates tend to decrease in the direction of the gradient of the loss function, which is as follows:

W_{n + 1} = W_{n} - η Δ

(12)

where

Δ

represents a gradient, which is the derivative of the loss function.

(2) An initialized random noise with uniform distribution with interval [−1, 1] is generated.

(3) Training samples of batch size are obtained at random, and the data are preprocessed in the input queue.

(4) The random noise generated in step (2) is input into the generator to generate augmented image. The generated image and the training samples obtained in step (3) are input into the discriminator at the same time to determine whether the generated image is a real image. Calculate the discriminator loss and reverse-update the discriminator parameters.

(5) Calculate the gradient penalty and impose the penalty on the discriminator loss, and then use the optimizer to update the discriminator parameters in reverse.

(6) Determine whether the specified number of discriminator optimizations has been reached. If so, proceed to step (7). Otherwise, re-enter step (3).

(7) The random noise generated in step (2) is fed into the generator, the generator loss is calculated, and the discriminator parameters are reversely updated using the optimizer.

(8) Determine whether the specified number of iterations has been reached, that is, whether the entire sample has been traversed. If so, enter step (9); otherwise, re-enter step (2).

(9) Determine whether the scheduled training epochs number has been reached. If so, end it. If not, re-enter step (2).

4. Experimental Results and Analysis

To validate the proposed image data augmentation algorithm, case studies are performed in collaboration with a food enterprise. This enterprise is specialized in rawhide stick production and their further processing, and the rawhide sticks produced need to be inspected for defects. Based on the above application background, a limited number of rawhide stick images with defects are used to construct the rawhide stick defect image dataset, which we refer to as the original dataset. Our image data augmentation method is verified based on the original dataset.

This experiment is run on the Pytorch deep learning framework in the Pycharm integrated development environment. The server runs on a Windows operating system using an AMD Ryzen 7 5800H CPU and NVIDIA GeForce RTX3060 GPU (18 GB memory). The cooperative enterprise provides defective rawhide sticks, and the original dataset is constructed by photographing these rawhide sticks and performing data cleaning operations, such as removing duplicate images and size normalization. The original dataset includes 1800 images with the size of 1024 × 1024 and the grayscale color mode. The defects mainly include surface stains and local irregular shapes, as shown in Figure 9.

The evaluation of the generated images occurred using three assessment methods, Inception Score (IS), Fréchet Inception Distance (FID), and Structure Similarity Index Measure (SSIM). The IS evaluates image quality through probability distribution, and the higher the score, the higher the diversity and accuracy of the generated samples, and the better the model effect. The FID represents the Fréchet distance between the real image and the generated image in the feature space Gaussian distribution, and a smaller value indicates a higher similarity between the generated image and the real image. The SSIM is used to evaluate the similarity of two images regarding the brightness, contrast and structure levels, and the larger its value is, the more similar the image generated is to the real image.

4.1. Comparative Experiment

To verify the effectiveness of our ResB–HAM–DCGAN model, data augmentation experiments were conducted and the model was compared with the DCGAN and WGAN-GP models. The above four experiments use the Adam [40] optimizer to ensure the training speed of the model. The parameter settings of each machine learning model are shown in Table 3. Using the original set as the training set, each experiment spans 2500 epochs, the batch size is set to 32, and each model undergoes a total of 140,625 iterations. The training times of our proposed ResB–HAM–DCGAN and the comparison methods DCGAN, and WGAN-GP were 48 h, 10 h, 25 h, and 20 h, respectively.

4.1.1. Analysis of Images Generated at Different Training Stages of ResB–HAM–DCGAN

The ResB–HAM–DCGAN model is trained according to the above process, and the generated images are recorded every 500 epochs, as shown in Figure 10.

It can be seen that the original input noise is displayed at epoch 0, and the images at epoch 500 already have recognizable rawhide stick contours, but the images are blurred and heavily gridded. At epoch 1000, the gridding phenomenon of the generated image is improved, and the outline of the rawhide sticks is relatively clear. At epoch 1500, some areas in the image begin to show color differences, but they are not obvious and have low clarity. At epoch 2000, the stains and irregular shape in the generated image are close to the real image, but the image background is blurred. At epoch 2500, more realistic rawhide stick images are generated.

4.1.2. Comparison Experiment of Images Generated by Different Machine Learning Models

In this study, we qualitatively compared the original images and the images generated by the different deep learning models. The result is shown in Figure 11: (a) displays the original images obtained through capture, while (b)–(d) depict the images generated after 2500 epochs of training using DCGAN, WGAN-GP, and ResB-HAM-DCGAN models respectively. From the figure, it can be seen that the image generated by DCGAN shows most of the rawhide sticks contours, but the image is blurry. The images generated by WGAN-GP have almost no noise points, which is better than DCGAN. The image generated by ResB-HAM-DCGAN proposed in this paper does not have noise points, and can generate a clearer rawhide sticks image, and the stains and irregular shape defects in the image are more obvious, which is closer to the real image.

The quantitative comparison results of our ResB-HAM-DCGAN model with existing DCGAN, and WGAN-GP models in terms of IS, FID, and SSIM evaluation indicators are shown in Table 4. The data in the table shows the evaluation results of the 2000 images generated by each of the four models after training with the same number of iterations on the original dataset. The results show that the FID values of the ResB-HAM-DCGAN are the lowest, while the IS and SSIM values are the highest, indicating that the images generated by the proposed model are the most similar to the real images. Especially in the FID evaluation index, the value of FID decreased by 14.1 compared with DCGAN, and 8.61 lower than WGAN-GP, indicating that our proposed model generates higher image quality. In terms of IS and SSIM indicators, the improvement of evaluation value indicates that the HAM embedded in our model is conducive to spatially constrain the generated image and make the spatial distribution of the generated image closer to the real image.

4.2. Ablation Study of ResB–HAM–DCGAN Model

Ablation experiments are conducted to verify the effectiveness of the above modules by removing the ResB, HAM, and improved loss functions separately. The experiment still uses the original images as the training set, and each experimental scheme is iterated for 10 epochs, and the same learning rate and optimizer are used for each model in the training process. Figure 12 shows a qualitative comparison of the images generated after removing each module during the ablation experiment. Figure 12a shows the images generated by the DCGAN, while Figure 12b shows the images trained by adding a ResB to the generator and discriminator of the DCGAN. Compared to Figure 12a, Figure 12b alleviates the phenomena of image overlap and blurring. Figure 12c is the image trained with the HAM added on the basis of the DCGAN + ResB, and the image is further focused on the rawhide stick area. Figure 12d shows the image generated by the model established in this paper, and its detailed features are clearer.

The comparison of the images generated by the four schemes in the ablation study in terms of the three evaluation indicators of the IS, FID, and SSIM is shown in Table 5. The data in the table represent the evaluation results of 2000 images generated after training for the four schemes mentioned above. The comparison of the first two rows of data indicates that the addition of the ResB enhances the ability of the DCGAN model to extract deep-level features. All the evaluation indicators in the third row of data have improved, indicating that the HAM helps the model to learn the details and spatial information. The improvement in the last row of data indicates that using the Wasserstein distance with a gradient penalty to calculate the loss function can also improve the quality of the generated images. From the above experimental data, it can be concluded that the above three improvement measures are necessary for the data enhancement model, which is conducive to the generation of high-quality images.

4.3. Effectiveness Assessment of Augmented Image Dataset

This study accessed the effectiveness of our machine learning model by comparing the rawhide stick defect image dataset generated by our ResB–HAM–DCGAN and other data augmentation models, including the DCGAN and WGAN-GP. The classic LeNet-5 CNN is utilized as the learning model, and its parameters are shown in Table 6. The dataset used in the training process consists of two parts. The first is the original dataset, which contains 1800 photographed images and is divided into a training set of 1440 and a test set of 360. The other is the augmented dataset, which is obtained by using the above three deep learning models to generate 4000 images and expand them to their respective training sets. In this way, three sets of datasets were obtained, namely 5440 images from the training set and 360 images from the testing set corresponding to each model. The LeNet-5 deep learning model is used to carry out image classification experiments on the above datasets with the parameters of batch_size as 32 and epoch as 50.

Using the above four datasets, the LeNet-5 CNN was trained for image classification. The training loss trend is shown in Figure 13. In the case of the same number of iterations, the training loss value decreases the fastest when trained with the dataset generated by the ResB–HAM–DCGAN, and the loss is close to 0 at epoch 50.

Based on four types of datasets, the image classification accuracy of the LeNet-5 CNN is shown in Figure 14. When using the DCGAN to generate a dataset, the classification accuracy fluctuates the most, while, when using the ResB–HAM–DCGAN, the classification accuracy fluctuates the least. At epoch 50, when using the dataset generated by the ResB–HAM–DCGAN for the image classification, the classification accuracy is higher than the other three models, proving that the image augmentation method proposed in this paper is more capable of generating rawhide stick images with defects.

5. Conclusions

In the application of an artificial intelligence model to generate an augmented rawhide stick defect image dataset, the challenges of mode collapse, an unstable training process, and the low quality of the generated image can be addressed by our proposed DCGAN model based on a ResB and HAM. ResBs are added to the generator and discriminator of the DCGAN-based deep learning network model, which ensures that the model can reliably learn the deep data features. The proposed HAM based on ECA and SA enhances the model’s ability to capture image features. The Wasserstein distance loss function with a gradient penalty is used to replace the traditional JS divergence-based DCGAN loss function, and the Lipschitz function is utilized to restrict the weight of the discriminator and improve the model convergence ability. The image data augmentation experiment using the rawhide stick defect images provided by the enterprise has confirmed the superiority and effectiveness of our proposed ResB–HAM–DCGAN model. This experiment also offers valuable data support for training deep learning models for rawhide stick defect recognition.

At present, our research can only be appropriately applied to the image data augmentation of rawhide sticks, and the learning speed is slowed down due to the introduction of the ResB and HAM. Our future work will focus on the optimization of ResBs and the HAM structure to improve the computational efficiency of the model and explore its application in other industries, such as healthcare.

Author Contributions

S.D. contributed to the conceptualization and methodology. X.L. contributed to the investigation and project administration. X.C. contributed to the resources. Z.G. contributed to the software and writing—original draft. F.M. contributed to the supervision and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant no. 52105463).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Comput. Sci. 2015, 3–5. [Google Scholar] [CrossRef]
Ratliff, L.J.; Burden, S.A.; Sastry, S.S. Characterization and computation of local Nash equilibria in continuous games. In Proceedings of the 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–4 October 2013; pp. 917–924. [Google Scholar] [CrossRef]
Mao, X.; Li, Q.; Xie, H.; Lau, R.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2813–2821. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 9–11 August 2017; pp. 214–223. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved training of wasserstein GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5769–5779. [Google Scholar]
Wei, X.; Gong, B.; Liu, Z.; Lu, W.; Wang, L. Improving the improved training of wasserstein gans: A consistency term and its dual effect. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–17. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wu, Y.; Shuai, H.; Tam, Z.; Chiu, H. Gradient normalization for generative adversarial networks. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, Virtual, Online, Canada, 11–17 October 2021; pp. 6353–6362. [Google Scholar] [CrossRef]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Wu, S.; Yang, J.; Shan, Y.; Xu, B. Research on Generative Adversarial Networks Using Twins Attention Mechanism. J. Front. Comput. Sci. Technol. 2020, 14, 833–840. [Google Scholar]
Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In Proceedings of the 9th International Conference on Learning Representations, Virtual, Online, 3–7 May 2021. [Google Scholar]
Hinz, T.; Fisher, M.; Wang, O.; Wermter, S. Improved techniques for training single-image gans. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Virtual, Online, USA, 5–9 January 2021; pp. 1299–1308. [Google Scholar] [CrossRef]
Chen, H.; Zhao, L.; Zhang, H.; Wang, Z.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. Diverse image style transfer via invertible cross-space mapping. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, Virtual, Online, Canada, 11–17 October 2021; pp. 14860–14869. [Google Scholar] [CrossRef]
Zheng, Z.; Yang, X.; Yu, Z.; Zheng, L.; Yang, Y.; Kautz, J. Joint discriminative and generative learning for person reidentification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2133–2142. [Google Scholar] [CrossRef]
Shi, H.; Lu, J.; Zhou, Q. A novel data augmentation method using style-based GAN for robust pulmonary nodule segmentation. In Proceedings of the 2020 Chinese Control and Decision Conference, Hefei, China, 22–24 August 2020; pp. 2486–2491. [Google Scholar] [CrossRef]
Tran, N.; Tran, V.; Nguyen, N.; Nguyen, T.; Cheung, N. On Data Augmentation for GAN Training. IEEE Trans. Image Process. 2020, 30, 1882–1897. [Google Scholar] [CrossRef] [PubMed]
Upadhyay, A.; Li, J.; King, S.; Addepalli, S. A Deep-Learning-Based Approach for Aircraft Engine Defect Detection. Machines 2023, 11, 192. [Google Scholar] [CrossRef]
He, J.; Shi, W.; Chen, K.; Fu, L.; Dong, C. GCFSR: A Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1889–1898. [Google Scholar] [CrossRef]
Grigorev, A.; Iskakov, K.; Ianina, A.; Bashirov, R.; Zakharkin, I.; Vakhitov, A.; Lempitsky, V. Stylepeople: A generative model of fullbody human avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5147–5156. [Google Scholar] [CrossRef]
Jiang, B.; Wang, L.; Cheng, J.; Tang, J.; Luo, B. Gpens: Graph data learning with graph propagation-embedding network. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3925–3938. [Google Scholar] [CrossRef] [PubMed]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12868–12878. [Google Scholar] [CrossRef]
Suthar, V.; Vakharia, V.; Patel, V.K.; Shah, M. Detection of Compound Faults in Ball Bearings Using Multiscale-SinGAN, Heat Transfer Search Optimization, and Extreme Learning Machine. Machines 2022, 11, 29. [Google Scholar] [CrossRef]
Jalayer, M.; Kaboli, A.; Orsenigo, C.; Vercellis, C. Fault Detection and Diagnosis with Imbalanced and Noisy Data: A Hybrid Framework for Rotating Machinery. Machines 2022, 10, 237. [Google Scholar] [CrossRef]
Kim, J.H.; Hwang, Y. GAN-based synthetic data augmentation for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002512. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Park, S.; Yoo, C.H.; Shin, Y.G. Effective Shortcut Technique for Generative Adversarial Networks. Appl. Intell. 2022, 53, 2055–2067. [Google Scholar] [CrossRef]
Zhu, J.H.; Zhou, X.Y.; Xu, M.S.; Wang, Y.; Hou, J.J.; Zhao, X.Y.; Cheng, L. Improved DCGAN Data Enhanced Tomato Leaf Disease Image Recognition. Radio Eng. 2023, 53, 1235–1241. [Google Scholar]
Lin, B.W.; Zhao, G.Z.; Wang, X.P.; Li, H. Facial Expression Generation Based on group residual Block Generative Adversarial Networks. Comput. Eng. Appl. 2024, 60, 240–249. [Google Scholar]
Li, M.A.; Peng, W.M. EEGsignal augmentation method based on generative adversarial network with ResBlock and self-attention machenism. J. Comput. Appl. 2022, 42, 80–86. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–28 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13708–13713. [Google Scholar] [CrossRef]
Wang, L.; Yang, J.; Zhang, C.; Dai, Z. Dual Discriminative Generative Adversarial Networks with Hybrid Attention. Comput. Eeg. Appl. 2024, 60, 212–221. [Google Scholar]
Yang, Y.; Sun, L.; Mao, X.; Zhao, M. Data Augmentation Based on Generative Adversarial Network with Mixed Attention Mechanism. Electronics 2022, 11, 1718. [Google Scholar] [CrossRef]
Wang, X.; Cheng, H.X.; Sun, S.Y.; Jiang, Z.Q.; Cheng, K.; Cheng, L. MSFSA-GAN: Multi-Scale Fusion Self Attention Generative Adversarial Network for Single Image Deraining. IEEE Access 2022, 10, 34442–34448. [Google Scholar] [CrossRef]
Zhang, D.H.; Wu, C.Y.; Zhou, J.C.; Zhang, W.S.; Li, C.L.; Lin, Z.F. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement. Eng. Appl. Artif. Intell. 2023, 125, 106743. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. GAN network structure.

Figure 2. ResB–HAM–DCGAN network structure.

Figure 3. ResB-G structure in ResB–HAM–DCGAN.

Figure 4. ResB-D structure in ResB–HAM–DCGAN.

Figure 5. HAM structure.

Figure 6. ResB–HAM–DCGAN generator structure.

Figure 7. ResB–HAM–DCGAN discriminator structure.

Figure 8. The training process of ResB–HAM–DCGAN.

Figure 9. Defect grayscale image. (a) Surface stain; (b) local irregular shape.

Figure 10. Rawhide stick defect map generated by ResB–HAM–DCGAN. (a) At epoch 0; (b) at epoch 500; (c) at epoch 1000; (d) at epoch 1500; (e) at epoch 3000; (f) at epoch 2500.

Figure 11. Comparison between original images and images generated by different deep learning models. (a) Realistic images, (b) DCGAN generated, (c) WGAN-GP generated, (d) ResB-HAM-DCGAN generated.

Figure 12. Ablation study. (a) DCGAN-generated; (b) DCGAN + ResB-generated; (c) DCGAN + ResB + HAM-generated; (d) ResB–HAM–DCGAN-generated.

Figure 13. Training loss chart.

Figure 14. Classification accuracy of various methods. (a) Training accuracy; (b) test accuracy.

Table 1. Generator network structure.

No.	Input	Operation	Output
1	1 × 1 × 100	TranConv2d (100, 512, k = 4 × 4, s = 1) + SN+ BN + ReLU	4 × 4 × 512
2	4 × 4 × 512	ResB_G(512)	8 × 8 × 256
3	8 × 8 × 256	ResB_G(256)	16 × 16 × 128
4	16 × 16 × 128	ResB_G(128)	32 × 32 × 64
5	32 × 32 × 64	ResB_G(64)	64 × 64 × 32
6	64 × 64 × 32	ResB_G(32)	128 × 128 × 16
7	128 × 128 × 16	MA-block (16, b = 1, gamma = 2, k = 3)	128 × 128 × 16
8	128 × 128 × 16	ResB_G(16)	256 × 256 × 8
9	256 × 256 × 8	MA-block (8, b = 1, gamma = 2, k = 3)	256 × 256 × 8
10	256 × 256 × 8	TranConv2d (16, 1, k = 4 × 4, s = 2, p = 1) + Tanh	512 × 512 × 1

Table 2. The structure of discriminator.

No.	Input	Operation	Output
1	512 × 512 × 1	Conv2d (1, 8, k = 4 × 4, s = 2, p = 1) + SN + IN + LeakyReLU	256 × 256 × 8
2	256 × 256 × 8	ResB_D(8)	128 × 128 × 16
3	128 × 128 × 16	ResB_D(16)	64 × 64 × 32
4	64 × 64 × 32	ResB_D(32)	32 × 32 × 64
5	32 × 32 × 64	ResB_D(64)	16 × 16 × 128
6	16 × 16 × 128	ResB_D(128)	8 × 8 × 256
7	8 × 8 × 256	MA-block (256, b = 1, gamma = 2, k = 3)	8 × 8 × 256
8	8 × 8 × 256	ResB_D(256)	4 × 4 × 512
9	4 × 4 × 512	MA-block (512, b = 1, gamma = 2, k = 3)	4 × 4 × 512
10	4 × 4 × 512	Conv2d (512, 1, k = 4 × 4, s = 1) + LeakyReLU	1 × 1 × 1

Table 3. Parameter settings.

Model	Generator Learning Rate	Discriminator Learning Rate	Penalty Coefficient
DCGAN, and WGAN-GP	0.0002	0.0002	/
ResB–HAM–DCGAN	0.0001	0.0004	10

Table 4. Comparison of evaluation index results of each model.

Model	IS ↑	FID ↓	SSIM ↑
DCGAN	6.12	105.61	0.72
WGAN-GP	7.53	100.12	0.79
ResB-HAM-DCGAN	10.41	91.51	0.83

Table 5. Comparison of image evaluation results generated by each model.

Model	IS ↑	FID ↓	SSIM ↑
DCGAN	6.12	105.61	0.72
DCGAN+Res	6.93	100.25	0.77
DCGAN+Res+MA	9.05	95.88	0.80
ResB-HAM-DCGAN	10.41	91.51	0.83

Table 6. Classification network structure based on LeNet-5.

No.	Input	Operation	Output
1	512 × 512 × 1	Conv2d (1, 16, k = 4 × 4, s = 2, p = 1) + BN + ReLU	256 × 256 × 16
2	256 × 256 × 16	MaxPool2d (k = 2 × 2, s = 2)	128 × 128 × 16
3	128 × 128 × 16	Conv2d (16, 32, k = 4 × 4, s = 2, p = 1) + BN + ReLU	64 × 64 × 32
4	64 × 64 × 32	MaxPool2d (k = 2 × 2, s = 2)	32 × 32 × 32
5	32 × 32 × 32	Conv2d (32, 64, k = 4 × 4, s = 2, p = 1) + BN + ReLU	16 × 16 × 64
6	16 × 16 × 64	MaxPool2d (k = 2 × 2, s = 2)	8 × 8 × 64
7	8 × 8 × 64	Conv2d (64, 128, k = 4 × 4, s = 2, p = 1) + BN + ReLU	4 × 4 × 128
8	4 × 4 × 128	MaxPool2d (k = 2 × 2, s = 2)	2 × 2 × 128
9	2 × 2 × 128	Linear (2 × 2 × 128, 2)	1 × 1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, S.; Guo, Z.; Chen, X.; Li, X.; Ma, F. DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection. Electronics 2024, 13, 2047. https://doi.org/10.3390/electronics13112047

AMA Style

Ding S, Guo Z, Chen X, Li X, Ma F. DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection. Electronics. 2024; 13(11):2047. https://doi.org/10.3390/electronics13112047

Chicago/Turabian Style

Ding, Shuhui, Zhongyuan Guo, Xiaolong Chen, Xueyi Li, and Fai Ma. 2024. "DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection" Electronics 13, no. 11: 2047. https://doi.org/10.3390/electronics13112047

APA Style

Ding, S., Guo, Z., Chen, X., Li, X., & Ma, F. (2024). DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection. Electronics, 13(11), 2047. https://doi.org/10.3390/electronics13112047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCGAN-Based Image Data Augmentation in Rawhide Stick Products’ Defect Detection

Abstract

1. Introduction

2. Literature Review

2.1. GANs and Their Derivative Models

2.2. Residual Network

2.3. AM

2.4. Discussion

3. Residual Block and Hybrid Attention Mechanism-Based DCGAN

3.1. Theoretical Basis

3.2. ResB–HAM–DCGAN Model Framework

3.2.1. ResB

3.2.2. Hybrid Attention Mechanism

3.2.3. Generators and Discriminators

3.2.4. Improved Loss Function

3.3. ResB–HAM–DCGAN Model Training Process

4. Experimental Results and Analysis

4.1. Comparative Experiment

4.1.1. Analysis of Images Generated at Different Training Stages of ResB–HAM–DCGAN

4.1.2. Comparison Experiment of Images Generated by Different Machine Learning Models

4.2. Ablation Study of ResB–HAM–DCGAN Model

4.3. Effectiveness Assessment of Augmented Image Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI