Defect Detection of MEMS Based on Data Augmentation, WGAN-DIV-DC, and a YOLOv5 Model

Surface defect detection of micro-electromechanical system (MEMS) acoustic thin film plays a crucial role in MEMS device inspection and quality control. The performances of deep learning object detection models are significantly affected by the number of samples in the training dataset. However, it is difficult to collect enough defect samples during production. In this paper, an improved YOLOv5 model was used to detect MEMS defects in real time. Mosaic and one more prediction head were added into the YOLOv5 baseline model to improve the feature extraction capability. Moreover, Wasserstein divergence for generative adversarial networks with deep convolutional structure (WGAN-DIV-DC) was proposed to expand the number of defect samples and to make the training samples more diverse, which improved the detection accuracy of the YOLOv5 model. The optimal detection model achieved 0.901 mAP, 0.856 F1 score, and a real-time speed of 75.1 FPS. As compared with the baseline model trained using a non-augmented dataset, the mAP and F1 score of the optimal detection model increased by 8.16% and 6.73%, respectively. This defect detection model would provide significant convenience during MEMS production.


Introduction
Micro-and nanoscale devices have been applied in many areas [1], especially microelectromechanical system (MEMS) devices. They have many advantages such as low cost, small size, and the capability to integrate with on-chip circuits [2]. However, MEMS surface defects are inevitable during production, which affect the function of the device. The real-time detection of MEMS surface defects is crucial but challenging [3]. On the one hand, small size defects may lead to low detection accuracy. An object detection deep learning network with better feature extraction performance is needed. On the other hand, a deep learning network requires sufficient training samples for learning to avoid overfitting [4]. It is difficult to collect enough and diverse defect samples from the products in practice [5].
Thus, in this work, we present a methodology to deal with these problems. This method that is based on data augmentation effectively solves the problem that an insufficient original dataset affects the training of a detection model, which would reduce the reliance on dataset collection during an industrial fabrication process. In addition, the methodology uses a single-stage detection model. Smaller size defects will be detected on the production line more accurately and in real time.

Related Works
In this section, we provide a brief overview of previous studies related to data augmentation with GANs, and then we discuss the relevant work on object detection with single-stage and two-stage models.
Data augmentation is widely used to avoid overfitting. Conventional data augmentation methods, such as flipping, rotation, and color space transformations, can only change

Paper Contribution
In this paper, we aimed to train a real-time detection model based on YOLOv5 to detect MEMS surface defects more accurately and efficiently.
First, a MEMS defect dataset was expanded with an improved GAN, i.e., Wasserstein divergence for generative adversarial networks with deep convolutional structure (WGAN-DIV-DC). To improve the quality of the generated images, different network structures were applied to the generator and discriminator.
Secondly, in order to improve the detection performance of the YOLOv5 baseline model, Mosaic and one more prediction head were added into the baseline model for comparative experiments. The optimal model trained by using an augmented dataset performed well both in accuracy and speed during MEMS surface defect detection.

Dataset Collection
The defect dataset was collected from a MEMS wafer containing thousands of MEMS cells. The data acquisition system consisted of three parts, as shown in Figure 1: an area array charge-coupled device (CCD) camera, a microscope, and a MEMS wafer. The microscope was equipped with a 10× eyepiece and a 20× objective lens. By moving the stage in the x and y directions, every MEMS cell could be captured by the camera. Finally, Sensors 2022, 22, 9400 3 of 14 1200 MEMS surface defect images, with a size of 1600 × 1600, were collected from over 10,000 MEMS cells.

Dataset Collection
The defect dataset was collected from a MEMS wafer containing thousands of MEMS cells. The data acquisition system consisted of three parts, as shown in Figure 1: an area array charge-coupled device (CCD) camera, a microscope, and a MEMS wafer. The microscope was equipped with a 10× eyepiece and a 20× objective lens. By moving the stage in the x and y directions, every MEMS cell could be captured by the camera. Finally, 1200 MEMS surface defect images, with a size of 1600 × 1600, were collected from over 10,000 MEMS cells.

Data Augmentation
It was noticed during data collection that only a few images could be used as training samples. In this step, data augmentation based on generative adversarial networks was applied to the original dataset.
A GAN training strategy defines a minimax two-player game between a generator network (G) and a discriminator network (D). The generator takes a random noise as input and outputs a generated image. The discriminator distinguishes the difference between the generated sample and the real sample and rejects the fake one. As shown in Figure 2, the two networks improve their performance by competing with each other. During GAN training, the G and D are trained alternately, constantly optimizing the following minimax value function [7]: where ( ) r x is the distribution of the real data and ( ) g z is the distribution of a noise.
The value of ( , ) V D G is minimized by training G, and then is maximized by training D.
During training, the binary cross-entropy loss (BCEL) is used to define the loss function of the generator (LG) and discriminator(LD):

Data Augmentation
It was noticed during data collection that only a few images could be used as training samples. In this step, data augmentation based on generative adversarial networks was applied to the original dataset.
A GAN training strategy defines a minimax two-player game between a generator network (G) and a discriminator network (D). The generator takes a random noise as input and outputs a generated image. The discriminator distinguishes the difference between the generated sample and the real sample and rejects the fake one. As shown in Figure 2, the two networks improve their performance by competing with each other.

Dataset Collection
The defect dataset was collected from a MEMS wafer containing thousands of MEMS cells. The data acquisition system consisted of three parts, as shown in Figure 1: an area array charge-coupled device (CCD) camera, a microscope, and a MEMS wafer. The microscope was equipped with a 10× eyepiece and a 20× objective lens. By moving the stage in the x and y directions, every MEMS cell could be captured by the camera. Finally, 1200 MEMS surface defect images, with a size of 1600 × 1600, were collected from over 10,000 MEMS cells.

Data Augmentation
It was noticed during data collection that only a few images could be used as training samples. In this step, data augmentation based on generative adversarial networks was applied to the original dataset.
A GAN training strategy defines a minimax two-player game between a generator network (G) and a discriminator network (D). The generator takes a random noise as input and outputs a generated image. The discriminator distinguishes the difference between the generated sample and the real sample and rejects the fake one. As shown in Figure 2, the two networks improve their performance by competing with each other. During GAN training, the G and D are trained alternately, constantly optimizing the following minimax value function [7]: where ( ) r x is the distribution of the real data and ( ) g z is the distribution of a noise.
The value of ( , ) V D G is minimized by training G, and then is maximized by training D.
During training, the binary cross-entropy loss (BCEL) is used to define the loss function of the generator (LG) and discriminator(LD): During GAN training, the G and D are trained alternately, constantly optimizing the following minimax value function [7]: where r(x) is the distribution of the real data and g(z) is the distribution of a noise. The value of V(D, G) is minimized by training G, and then is maximized by training D. During training, the binary cross-entropy loss (BCEL) is used to define the loss function of the generator (L G ) and discriminator(L D ): The generator and discriminator are both multilayer perceptrons (MLP), consisting of linear connection layers, one-dimensional normalization layers, and activation layers, as shown in Figure 3a.
The generator and discriminator are both multilayer perceptrons (MLP), consisting of linear connection layers, one-dimensional normalization layers, and activation layers, as shown in Figure 3a.

Deep Convolutional Generative Adversarial Network
In the DCGAN [19], all convolutional layers were used to reduce the parameters in Figure 3b. In the generator, ReLU was used for all activation layers except for the output, which used Tanh. In the discriminator, LeakyReLU was used for all activation layers to prevent gradient sparsity. In addition, pooling layers were replaced by convolution layers (stride = 2) to prevent losing too many features.

Wasserstein Generative Adversarial Network
The WGAN introduced Earth-Mover (EM) distance to define the loss function of the discriminator and generator, and the EM distance (W distance) is defined as follows: where W means the minimum cost required to move one data distribution to another; where w L f denotes the Lipschitz norm of w f . To enforce the Lipschitz constraint, the loss function can be defined in the following ways: weight clipping, gradient penalty [20],

Deep Convolutional Generative Adversarial Network
In the DCGAN [19], all convolutional layers were used to reduce the parameters in Figure 3b. In the generator, ReLU was used for all activation layers except for the output, which used Tanh. In the discriminator, LeakyReLU was used for all activation layers to prevent gradient sparsity. In addition, pooling layers were replaced by convolution layers (stride = 2) to prevent losing too many features.

Wasserstein Generative Adversarial Network
The WGAN introduced Earth-Mover (EM) distance to define the loss function of the discriminator and generator, and the EM distance (W distance) is defined as follows: where W means the minimum cost required to move one data distribution to another; indicate the real image and the generated data distribution mapped by f w (x), respectively; and f w can be considered to be a discriminator network with parameter w. f w should satisfy the following condition [11]: where f w L denotes the Lipschitz norm of f w . To enforce the Lipschitz constraint, the loss function can be defined in the following ways: weight clipping, gradient penalty [20], and Wasserstein divergence [21], corresponding to three models, i.e., WGAN, WGAN-GP, and WGAN-DIV, respectively. WGAN limits the discriminator parameters to [−c, c] by clipping weight directly, and the loss function of the generator and discriminator can be calculated by using Equation (5): WGAN-GP makes the model more stable by penalizing the gradient of the discriminator.
, λ is a hyperparameter to be determined, ε is a random number ranging from 0 to 1. As the input should be continuous considering Lipschitz constraint, the input data are made to be distributed evenly in the whole space by interpolating between the generated image and the real image. Wasserstein divergence was introduced into WGAN to define the loss function, which does not require the 1-Lipschitz constraint. The loss function of the discriminator in WGAN-DIV is: where k and p are hyperparameters. The experimental results [21] show that the model had the best performance when k = 2 and p = 6.
In this study, inspired by DCGAN, the MLP structure of WGAN was replaced with a deep convolution network (DC). As shown in Figure 3c, ReLU activation in the generator was replaced with LeakyReLU to obtain a better performance. In addition, the final Sigmoid activation and batch normalization layer were removed to slow down the convergence speed of the discriminator and to optimize the generator stably. Thus, we obtained WGAN-DC, WGAN-GP-DC, and WGAN-DIV-DC. The validity of the models were proven through the experiments described in Section 3.1.

Yolov5 Algorithm
As introduced in the previous sections, a YOLOv5 model was used to detect MEMS surface defects. The YOLOv5 network structure mainly consisted of four parts: model input, backbone, neck, and prediction. Figure 4 shows the baseline model of this experiment. Mixup [22], a straightforward data augmentation principle, constructed a new sample by linearly interpolating two random samples and their labels from the training set. Mixup increased the robustness to adversarial samples and greatly improved the generalization of the model.  Three feature maps, with sizes of 20 × 20, 40 × 40, and 80 × 80, were used as three prediction layers to detect objects of different sizes. Every prediction layer outputs the corresponding prediction head, and finally works out the prediction bounding box and class. The generalized intersection-over-union (GIoU) [26] loss function can solve the problem of non-overlapping bounding boxes during training. As shown in Figure 5, box C is the minimum rectangle area including A and B. GIoU and the loss function were calculated by using Equation (11): The backbone network included a Convolution-Batchnorm-SiLU module (CBS), C3_n, and spatial pyramid pooling (SPP). The CBS is the most basic module in a backbone network, consisting of a two-dimensional convolutional layer, two-dimensional batch normalization layer, and SiLU activation function. The CBS2, whose stride equaled 2, performed the downsampling operation. The bottleneck module was inspired by the idea of CSPNet [23]. Only half of the feature channels go through the CBS modules. SPP executes maximum pooling with kernel sizes 5 × 5, 9 × 9, and 13 × 13, and SPP concatenates the feature maps to avoid computing the convolutional features repeatedly.
The neck module, a feature fusion part, combines a feature pyramid network (FPN) and a path aggregation network (PAN). FPN [24] extracts the features through a top-down architecture with lateral connections, while PAN [25] transmits features in a bottom-up pyramid. The two structures showed significant improvement in feature extraction.
Three feature maps, with sizes of 20 × 20, 40 × 40, and 80 × 80, were used as three prediction layers to detect objects of different sizes. Every prediction layer outputs the corresponding prediction head, and finally works out the prediction bounding box and class. The generalized intersection-over-union (GIoU) [26] loss function can solve the problem of non-overlapping bounding boxes during training. As shown in Figure 5, box C is the minimum rectangle area including A and B. GIoU and the loss function were calculated by using Equation (11): where IoU (intersection-over-union) is the intersection ratio of the prediction box A and original label box B. Three feature maps, with sizes of 20 × 20, 40 × 40, and 80 × 80, were used as three prediction layers to detect objects of different sizes. Every prediction layer outputs the corresponding prediction head, and finally works out the prediction bounding box and class. The generalized intersection-over-union (GIoU) [26] loss function can solve the problem of non-overlapping bounding boxes during training. As shown in Figure 5, box C is the minimum rectangle area including A and B. GIoU and the loss function were calculated by using Equation (11): where IoU (intersection-over-union) is the intersection ratio of the prediction box A and original label box B.

Modules for Comparative Experiments
In this paper, the following three modules were introduced into the baseline model to improve the detection accuracy of the model. First, as compared with Mixup, Mosaic [27] selects four samples from the training set every time for random scaling, regional clipping, disorderly arrangement, and splicing into a new image, as shown in Figure 6. Mosaic can increase the diversity of the dataset and can improve the robustness of the model.

Modules for Comparative Experiments
In this paper, the following three modules were introduced into the baseline model to improve the detection accuracy of the model. First, as compared with Mixup, Mosaic [27] selects four samples from the training set every time for random scaling, regional clipping, disorderly arrangement, and splicing into a new image, as shown in Figure 6. Mosaic can increase the diversity of the dataset and can improve the robustness of the model. In the backbone network, spatial pyramid pooling fast (SPPF), an improved version of SPP, was applied to speed up interference. As shown in Figure 7, SPPF serially executed the maximum pooling with kernel size 5 × 5 and fused the features by concatenation, followed by a CBS module to adjust the output channels.
In the prediction network, one more prediction head was used to detect different size objects, especially for small MEMS defects. As shown in Figure 7, the feature map was expanded to 160 × 160 by upsampling on 19 layers, and then concatenated with the 2nd layer. Four different size feature maps were used to predict bounding box and class, to enhance the feature extraction ability of the network. In the backbone network, spatial pyramid pooling fast (SPPF), an improved version of SPP, was applied to speed up interference. As shown in Figure 7, SPPF serially executed the maximum pooling with kernel size 5 × 5 and fused the features by concatenation, followed by a CBS module to adjust the output channels. the maximum pooling with kernel size 5 × 5 and fused the features by concatenation, followed by a CBS module to adjust the output channels.
In the prediction network, one more prediction head was used to detect different size objects, especially for small MEMS defects. As shown in Figure 7, the feature map was expanded to 160 × 160 by upsampling on 19 layers, and then concatenated with the 2nd layer. Four different size feature maps were used to predict bounding box and class, to enhance the feature extraction ability of the network.

Data Augmentation
Considering that most of the defects only occupied a small area of the original images, defects with a size of 64 × 64 were picked up by sliding windows. A total of 640 selected defect images were used to train the GAN models. According to existing experiments, Adam optimization [21] with a learning rate of 0.0002 was used to update G and D. In order to get better convergence results, the batch size was set to 64 and the training epoch was set to 2700.
Fréchet inception distance (FID) [28] is commonly used to describe the performance of GAN models. The pretrained Inception V3 was used to propagate all real and generated images, and the last pooling layer was used as the coding layer. For this coding layer, the mean and the covariance of the real and generated images were calculated, respectively. When the real images and generated images are assumed to follow a Gaussian distribution, the difference of two Gaussians is measured by FID, which is defined as: In the prediction network, one more prediction head was used to detect different size objects, especially for small MEMS defects. As shown in Figure 7, the feature map was expanded to 160 × 160 by upsampling on 19 layers, and then concatenated with the 2nd layer. Four different size feature maps were used to predict bounding box and class, to enhance the feature extraction ability of the network.

Data Augmentation
Considering that most of the defects only occupied a small area of the original images, defects with a size of 64 × 64 were picked up by sliding windows. A total of 640 selected defect images were used to train the GAN models. According to existing experiments, Adam optimization [21] with a learning rate of 0.0002 was used to update G and D. In order to get better convergence results, the batch size was set to 64 and the training epoch was set to 2700.
Fréchet inception distance (FID) [28] is commonly used to describe the performance of GAN models. The pretrained Inception V3 was used to propagate all real and generated images, and the last pooling layer was used as the coding layer. For this coding layer, the mean and the covariance of the real and generated images were calculated, respectively. When the real images and generated images are assumed to follow a Gaussian distribution, the difference of two Gaussians is measured by FID, which is defined as: where m x , C x are the mean value and covariance of the real images, respectively; m g , C g are the mean and covariance matrix of the generated images, respectively. The lower FID indicates that the generated images are more similar to real images. As shown in Figure 8a, the FID of DCGAN decreased rapidly in the early stage, and the fluctuation was small, indicating that the training of DCGAN was stable. The convergence of DCGAN training came at around 1000 epochs, and the lowest FID was 43.48. WGAN limited the parameters to [−0.01, 0.01] by clipping weights directly, which could easily lead to weight binarization and fall into a local extreme point. The FID of WGAN gradually converged at 1800 epochs, and the lowest FID was 54.47. WGAN limited the parameters to [−0.01, 0.01] by clipping weights directly, which could easily lead to weight binarization and fall into a local extreme point. The FID of WGAN gradually converged at 1800 epochs, and the lowest FID was 54.47. Figure 8b shows the training progress of WGAN and WGAN-DC. The FID of WGAN-DC dropped rapidly and kept a low level with small fluctuations, which indicated that WGAN-DC training was more stable. The lowest FID was 39.55, which meant the quality of the generated samples was better. As shown in Figure 9, the training process of WGAN with DC structure was more stable, converged faster, and reached a much lower FID than WGAN with MLP structure. The lowest FID values of WGAN-GP-DC and WGAN-DIV-DC were 33.75 and 29.44, respectively. The lowest FID values of the models in Table 1 show that WGAN-DIV-DC outperformed the compared methods.

Model
The Lowest FID  Figure 8b shows the training progress of WGAN and WGAN-DC. The FID of WGAN-DC dropped rapidly and kept a low level with small fluctuations, which indicated that WGAN-DC training was more stable. The lowest FID was 39.55, which meant the quality of the generated samples was better.
As shown in Figure 9, the training process of WGAN with DC structure was more stable, converged faster, and reached a much lower FID than WGAN with MLP structure. The lowest FID values of WGAN-GP-DC and WGAN-DIV-DC were 33.75 and 29.44, respectively. The lowest FID values of the models in Table 1 show that WGAN-DIV-DC outperformed the compared methods.
As shown in Figure 8a, the FID of DCGAN decreased rapidly in the early stage, and the fluctuation was small, indicating that the training of DCGAN was stable. The convergence of DCGAN training came at around 1000 epochs, and the lowest FID was 43.48. WGAN limited the parameters to [−0.01, 0.01] by clipping weights directly, which could easily lead to weight binarization and fall into a local extreme point. The FID of WGAN gradually converged at 1800 epochs, and the lowest FID was 54.47. Figure 8b shows the training progress of WGAN and WGAN-DC. The FID of WGAN-DC dropped rapidly and kept a low level with small fluctuations, which indicated that WGAN-DC training was more stable. The lowest FID was 39.55, which meant the quality of the generated samples was better. As shown in Figure 9, the training process of WGAN with DC structure was more stable, converged faster, and reached a much lower FID than WGAN with MLP structure. The lowest FID values of WGAN-GP-DC and WGAN-DIV-DC were 33.75 and 29.44, respectively. The lowest FID values of the models in Table 1 show that WGAN-DIV-DC outperformed the compared methods.

Model
The Lowest FID  The improved model WGAN-DIV-DC was used to expand the MEMS dataset, and 15,000 defect samples were generated. Figure 10 shows some of the generated defect samples. The generated defects were randomly pasted in defect-free MEMS images, and the number of defects on each image ranged from one to five. Finally, the original 200 training images was expanded to 4000 images. One of the synthetic training images is shown in Figure 11.

WGAN-DIV-DC 29.44
The improved model WGAN-DIV-DC was used to expand the MEMS dataset, and 15,000 defect samples were generated. Figure 10 shows some of the generated defect samples. The generated defects were randomly pasted in defect-free MEMS images, and the number of defects on each image ranged from one to five. Finally, the original 200 training images was expanded to 4000 images. One of the synthetic training images is shown in Figure 11.

Training Setting
After data augmentation, the number of images was expanded to 5000, consisting of 1200 real images and 3800 generated images. As shown in Table 2, according to 8:1:1, the training set contained all the generated images and 200 real images, while the validation set and test set only contained 500 real images, respectively.

WGAN-DIV-DC 29.44
The improved model WGAN-DIV-DC was used to expand the MEMS dataset, and 15,000 defect samples were generated. Figure 10 shows some of the generated defect samples. The generated defects were randomly pasted in defect-free MEMS images, and the number of defects on each image ranged from one to five. Finally, the original 200 training images was expanded to 4000 images. One of the synthetic training images is shown in Figure 11.

Training Setting
After data augmentation, the number of images was expanded to 5000, consisting of 1200 real images and 3800 generated images. As shown in Table 2, according to 8:1:1, the training set contained all the generated images and 200 real images, while the validation set and test set only contained 500 real images, respectively.

Training Setting
After data augmentation, the number of images was expanded to 5000, consisting of 1200 real images and 3800 generated images. As shown in Table 2, according to 8:1:1, the training set contained all the generated images and 200 real images, while the validation set and test set only contained 500 real images, respectively. Our experimental environment is shown in Table 3. The model training epoch was set to 3000, and the batch size was set to 16. The initial learning rate was 0.01. We trained this model only on a MEMS dataset from scratch without using pretrained weights.

Model Evaluation Metrics
Some metrics were used to evaluate the performance of the detection models: precision (P), recall (R), mean average precision (mAP), F1 score, and detection speed. The detection speed can be measured by the number of images detected per second (FPS). In It is common to use the F1 score and mAP to comprehensively measure the accuracy of a model. The harmonic mean of the precision and recall scores is the F1 score, as defined in Equation (11): The P-R curve is plotted with R as x axis and P as y axis. The area under the P-R curve (AP) and mAP are calculated using Equation (12): where mAP is the average AP of all categories (N classes). The closer the mAP and F1 score get to one, the better performance of the model is. Figure 12 shows the influence of data augmentation on the baseline model. If the metrics did not increase, the model training terminated in advance. The model started to converge at around 600 epochs trained by using a non-augmented dataset, while after data augmentation, the model training started to converge at around 270 epochs, and terminated in advance at about 330 epochs. The detection accuracy of the models was greatly improved, as shown in Table 4. data augmentation, the model training started to converge at around 270 epochs, and terminated in advance at about 330 epochs. The detection accuracy of the models was greatly improved, as shown in Table 4.  Additionally, the inference time was tested. Table 5 presents a comparison of the inference time with different models. The detection speed of the model trained with SPPF was not significantly improved, so SPP was not replaced by SPPF finally.  Additionally, the inference time was tested. Table 5 presents a comparison of the inference time with different models. The detection speed of the model trained with SPPF was not significantly improved, so SPP was not replaced by SPPF finally. It can be seen from Table 4 that the detection mAP of the optimal model in this paper for MEMS surface defects is 0.901 and the F1 score is 0.856. Data augmentation and Mosaic greatly improve the accuracy of the detection model. In addition, one more prediction head added one more feature map, which performed better to detect MEMS defects. Some randomly selected detection results of the optimal model are shown in Figure 13, where the green is the label box and the red is the prediction box.

Training Results and Discussions
In order to evaluate the overall success of our research, it was essential to provide a comparison against similar studies conducted in recent years. Table 6 presents a comparison of our method with those in [14,29]. It can be seen from Table 4 that the detection mAP of the optimal model in this paper for MEMS surface defects is 0.901 and the F1 score is 0.856. Data augmentation and Mosaic greatly improve the accuracy of the detection model. In addition, one more prediction head added one more feature map, which performed better to detect MEMS defects. Some randomly selected detection results of the optimal model are shown in Figure 13, where the green is the label box and the red is the prediction box. In order to evaluate the overall success of our research, it was essential to provide a comparison against similar studies conducted in recent years. Table 6 presents a comparison of our method with those in [14,29].
The objective in three cases was surface defects. The authors of [14], based on a twostage detector, identified no better balance between detection accuracy and speed, while [29] based on a one-stage detector, they reported better performance in detection accuracy and speed. Although our method had a lower mAP than [29], the difference was minimal and around 2.4%. In addition, our results presented a faster calculation speed. Thus, our method provided a more accurate and efficient defect detection model for MEMS production.  The objective in three cases was surface defects. The authors of [14], based on a two-stage detector, identified no better balance between detection accuracy and speed, while [29] based on a one-stage detector, they reported better performance in detection accuracy and speed. Although our method had a lower mAP than [29], the difference was minimal and around 2.4%. In addition, our results presented a faster calculation speed. Thus, our method provided a more accurate and efficient defect detection model for MEMS production.

Conclusions
In conclusion, an improved generative model WGAN-DIV-DC and an optimal detection model were proposed to detect MEMS surface defects, which addressed the challenges of real-time defect detection during the fabrication process. First, WGAN-DIV-DC generated more diverse defect samples to expand the original dataset. Secondly, the YOLOv5 baseline model was optimized by introducing Mosaic and one more prediction head. The comparative experimental results showed that the optimal model achieved the highest mAP of 0.901 and F1 score of 0.856. As compared with the baseline model, the optimal model trained by using the augmented dataset performed better. This proposed framework has the potential to be used in other similar surface defect detection scenarios, which could reduce the reliance on original data collection. In addition, the visual detection method would significantly improve production efficiency and product quality.
Although the experimental results have shown the feasibility of the proposed method for MEMS surface defect detection, we plan to explore different networks and parameters for model optimization in future works. Data Availability Statement: This study did not report any data.