Assembly Quality Detection Based on Class-Imbalanced Semi-Supervised Learning

Due to the imperfect assembly process, the unqualified assembly of a missing gasket or lead seal will affect the product’s performance and possibly cause safety accidents. Machine vision method based on deep learning has been widely used in quality inspection. Semi-supervised learning (SSL) has been applied in training deep learning models to reduce the burden of data annotation. The dataset obtained from the production line tends to be class-imbalanced because the assemblies are qualified in most cases. However, most SSL methods suffer from lower performance in class-imbalanced datasets. Therefore, we propose a new semi-supervised algorithm that achieves high classification accuracy on the class-imbalanced assembly dataset with limited labeled data. Based on the mean teacher algorithm, the proposed algorithm uses certainty to select reliable teacher predictions for student learning dynamically, and loss functions are modified to improve the model’s robustness against class imbalance. Results show that when only 10% of the total data are labeled, and the imbalance rate is 5.3, the proposed method can improve the accuracy from 85.34% to 93.67% compared to supervised learning. When the amount of annotated data accounts for 20%, the accuracy can reach 98.83%.


Introduction
Screw fasteners are simple in construction and easy to operate. They are widely used in the mechanical structure as the crucial part of large equipment, such as aero-engine, high-speed railways, production machinery, wind turbines, air conditioning systems, and elevator cranes. The assembly quality should be carefully inspected because it determines the mechanical properties and safety of products. Screw-gasket-seal is a typical connector assembly structure. The gasket is used to protect the surface of the connector from screw abrasion. The seal can pass through several structures to form a closed loop, preventing the structure from being loose. The images of qualified assembly samples are shown in Figure 1(a1-a4), with a lead threading through two parallel holes of the hexagon screw and a gasket placed between the two mating surfaces to increase friction. Unqualified assemblies (Figure 1(b1-b4)) without gaskets or lead seals have hidden safety hazards, which may cause inestimable loss of life and property.
Therefore, it is crucial to detect whether such assemblies are qualified effectively. With the advent of industry 4.0 and the continuous development of machine learning technology, the manufacturing industry has an increasing demand for automatic and intelligent production [1,2]. Both industrial practitioners and academic researchers are exploring intelligent detection methods to replace manual inspection. Common noncontact inspection methods include 3D scanning [3][4][5] and machine vision. The 3D scanning methods can provide the precision position of the component to assist the industrial robots. Machine vision detection methods use the camera to capture images from production lines and then design methods can provide the precision position of the component to assist the industrial robots. Machine vision detection methods use the camera to capture images from production lines and then design algorithms to complete the extraction and analysis of image information. It can provide objective, high-speed measurement and good reliability with a simple system and strong adaptability. Before the rise of deep learning algorithms, there have been many studies based on low-level visual features. For example, in [6], Hoff transform was applied to the bolt-loosening detection in the connection of wind turbine tower segments; Liu et al. [2] used the gradient coded co-occurrence matrix (GCCM) to inspect the missing of bogie block key on freight cars. Due to its data-driven nature, the fast-developing deep learning algorithms can extract knowledge from historical data, reducing the dependence on expert domain knowledge and avoiding artificial design of visual features. They have been widely used in the detection of key components of high-speed trains [7][8][9][10], fault diagnosis [11][12][13], high-voltage transmission line detection [14][15][16], and other industrial applications. The high generalization performance of deep learning relies on a large amount of labeled data. However, massive manpower and repeated labor will be needed to collect and annotate tens of thousands of data, and the sample number of the unqualified class is often low, resulting in the class imbalance of the training set. These factors will significantly affect the neural network model. Few-shot learning, generative methods, and semisupervised learning have been presented to alleviate these problems. Few-shot learning is one method that fully utilizes a small number of labeled data [17]. It is suitable for tasks where a variety of prior knowledge is available, including supervised data from other domains and modalities. Wang et al. [11] used a classification algorithm based on the similarity of sample pairs in intelligent bearing fault diagnosis, which included feature learning and metric learning modules. The feature learning module used twin neural networks to extract features from the sample pair separately, and the metric learning module was used to predict the similarity of the sample pair. The classification was conducted according to the similarity between the test sample and the labeled sample. To avoid artificially designing similarity measure function, [18] introduced the meta-learning method into the metric learning module in machine fault diagnosis to learn distance function adaptively.
Some studies attempt to generate data artificially. A common generating method is to use the generative adversarial network (GAN). Two models are trained simultaneously: one generative model G capturing the data distribution, and a discriminant model D estimating the probability that samples come from real data rather than G [19]. One study [20] used a generator to generate faulty mode data in compressor fault detection by minimizing the cluster center distance between the real and the generated data. In [21], the The high generalization performance of deep learning relies on a large amount of labeled data. However, massive manpower and repeated labor will be needed to collect and annotate tens of thousands of data, and the sample number of the unqualified class is often low, resulting in the class imbalance of the training set. These factors will significantly affect the neural network model. Few-shot learning, generative methods, and semi-supervised learning have been presented to alleviate these problems. Few-shot learning is one method that fully utilizes a small number of labeled data [17]. It is suitable for tasks where a variety of prior knowledge is available, including supervised data from other domains and modalities. Wang et al. [11] used a classification algorithm based on the similarity of sample pairs in intelligent bearing fault diagnosis, which included feature learning and metric learning modules. The feature learning module used twin neural networks to extract features from the sample pair separately, and the metric learning module was used to predict the similarity of the sample pair. The classification was conducted according to the similarity between the test sample and the labeled sample. To avoid artificially designing similarity measure function, [18] introduced the meta-learning method into the metric learning module in machine fault diagnosis to learn distance function adaptively.
Some studies attempt to generate data artificially. A common generating method is to use the generative adversarial network (GAN). Two models are trained simultaneously: one generative model G capturing the data distribution, and a discriminant model D estimating the probability that samples come from real data rather than G [19]. One study [20] used a generator to generate faulty mode data in compressor fault detection by minimizing the cluster center distance between the real and the generated data. In [21], the generative model was also applied to intelligent fault diagnosis of rotating machinery, where feature differences obtained by a feature extractor were minimized to obtain high-quality generated data. Compared with the classification network, the generative model requires an additional design of the generator structure. Moreover, it needs careful adjustment of the hyper-parameters and more time in optimizing the network structure to prevent divergence.
When unlabeled samples are available in large quantities, semi-supervised learning (SSL) is a promising approach. There are two main kinds of semi-supervised learning methods: unsupervised preprocessing and perturbation-based method. Unsupervised preprocessing methods usually extract features from unlabeled data. For example, in detecting freight car plate bolts, [22] used unlabeled data to pre-train the stacked autoencoder and then assigned its weights to the classification network. Zhang et al. [12] combined the training stages of auto-encoder and classifier and used two identical encoder networks to process labeled and unlabeled data for bearing fault diagnosis simultaneously. This integrated training method can achieve higher classifier accuracy.
The consistency regularization method based on perturbation is used to find the smooth manifold where the dataset lies by leveraging the unlabeled data [23]. It is assumed that similar samples have similar labels in the dense data space. Consistency regularization methods usually add perturbation to input data, network structure, and training mode and constrain the probability distribution of the model's output to remain unchanged. They do not depend on any intermediate steps or pre-trained supervised learners and generally extend the existing supervised loss function to contain the unlabeled data. In the Π model [24], each sample of the labeled and unlabeled dataset is propagated twice in every epoch, and perturbations are introduced by random noise of the input data and network dropout. To reduce the computing burden, the temporal ensembling method [25] replaces one forward propagation with the exponential moving average (EMA) of earlier predictions. The mean teacher (MT) method [26] directly applies the EMA to the regular network parameters to obtain another Teacher model. Compared with Temporal Ensembling, it can get more accurate predictions. There are few studies on semi-supervised learning methods for detecting unqualified assemblies or other products. The academic studies of consistency regularization methods almost assume that the distribution of instances in each class is balanced [19]. However, they have difficulty ensuring ideal performance on imbalanced data, especially in minority classes. Sometimes they are even worse than supervised learning methods [27].
To address the issues above, this article introduces a semi-supervised learning algorithm to detect unqualified assembly samples. It can achieve an accuracy of 93.67% when the labeled fraction of the training dataset is 10% and the imbalance rate is 5.3. This algorithm improves the mean teacher algorithm and makes up for the deficiency of the semi-supervised learning method in the class-imbalanced scenarios. Firstly, the certainty values of teacher predictions are measured, and the teacher predictions with high certainty are selected for the consistency constraint. Then, label-distribution-aware margin loss (LDAM Loss) [28] is applied to the labeled data training of the student model to enhance the robustness against class imbalance under supervised learning, and compression consistency loss (CCL) is adopted to prevent decision boundaries from skewing into the minority class regions. This paper is organized as follows: Section 2 introduces the proposed semi-supervised learning algorithm in detail. In Section 3, the assembly dataset and comparative experiments between the proposed method and other existing methods are described, and the results are discussed. Finally, Section 4 presents the main conclusions.

Class-Imbalanced Semi-Supervised Learning
Assume that the training set D contains N samples, N l of which are labeled. Let

Model Framework for Assembly Quality Detection
The method adopted in this paper is based on the mean teacher algorithm, which includes a teacher model and a student model with the same network structure. The overall algorithm is shown in Figure 2. The models use the DenseNet121 [29] structure, famous for achieving high performance while reducing the scale of parameters. As shown in Figure 2, the DenseNet121 includes four dense blocks, each followed by a transition layer. The transition layer consists of one 1 × 1 convolutional layer and one 2 × 2 pooling layer. The dense block is composed of multiple dense layers, and the output of each dense layer is connected to other layers in a feedforward manner to prevent the disappearance of features. The dense block and dense layer are illustrated in Figure 3, and each dense layer is a sequence of BN-ReLU-1 × 1 Conv-BN-ReLU-3 × 3 Conv.
Assume that the training set D contains N samples, l N of which are labeled. Let represents the labeled training set, where the training sample is denoted as i x , i y is the corresponding one-hot label, and C is the number of classes.

Model Framework for Assembly Quality Detection
The method adopted in this paper is based on the mean teacher algorithm, which includes a teacher model and a student model with the same network structure. The overall algorithm is shown in Figure 2. The models use the DenseNet121 [29] structure, famous for achieving high performance while reducing the scale of parameters. As shown in Figure 2, the DenseNet121 includes four dense blocks, each followed by a transition layer. The transition layer consists of one 1 × 1 convolutional layer and one 2 × 2 pooling layer. The dense block is composed of multiple dense layers, and the output of each dense layer is connected to other layers in a feedforward manner to prevent the disappearance of features. The dense block and dense layer are illustrated in Figure 3, and each dense layer is a sequence of BN-ReLU-1 × 1 Conv-BN-ReLU-3 × 3 Conv.   Inspired by knowledge distillation, the mean teacher (MT) method [26] uses the teacher-student structure. The weights of the teacher are the exponential moving average of the weights of the student. The mean teacher algorithm introduces perturbations in the model weights and input data and encouraging the predictions to remain the same. Defining s θ as the weights of the student model, then the corresponding weight t θ of the teacher model is Inspired by knowledge distillation, the mean teacher (MT) method [26] uses the teacher-student structure. The weights of the teacher are the exponential moving average of the weights of the student. The mean teacher algorithm introduces perturbations in the model weights and input data and encouraging the predictions to remain the same. Defining θ s as the weights of the student model, then the corresponding weight θ t of the teacher model is where µ is the smoothing coefficient hyper-parameter, iter is the global iteration step and µ 0 is the maximum value of µ. At the early training period, µ is small, therefore the teacher is rapidly updated by the new student weights. In the later training period, when µ reaches µ 0 , the teacher will have a longer memory since the improvement of the student is slow down. As the perturbations to the input, random noise enhancement (η, η ) is applied to the original sample x i before input to the models, so the predictions of the student and the teacher areŷ The usual mean teacher approach uses the mean square error (MSE) as the consistency regularization loss (CRL) to minimize the Euclidean distance between the teacher prediction and the student prediction.
Not all teacher predictions are reliable, and consistent constraints on the unreliable predictions will damage model performance. To make the student model dynamically select reliable predictions from the teacher, this paper adopts the certainty driven mechanism, which is explained in detail in Section 2.2. At each iteration, m samples with reliable teacher predictions are chosen from the mini-batch to form the subset M, and the consistent constraint is computed on M. Then, the cost function is the sum of the consistency regularization loss of each sample.
To improve the robustness of MT to class imbalance, LDAM loss is applied to b l labeled samples in the mini-batch to enhance the accuracy of the student model in the supervised training. Moreover, the compression consistency loss (CCL) is introduced to weaken the decision boundary smoothing effect of samples predicted to be the majority class. Details of the class-imbalanced loss functions are descripted in Section 2.3. Finally, the semi-supervised cost function of the mini-batch is The pseudo code in Algorithm 1 presents the whole process of the proposed method.

Algorithm 1 Training of the proposed method
The pseudo code in Algorithm 1 presents the whole process of the proposed method.

Certainty Driven Selection
The deficiency of existing perturbation-based semi-supervised learning methods is that all the outputs are regularized without exception. A large part of the outputs can be unreliable due to the confirmation bias [26]. Confirmation bias results from incorrect predictions of unlabeled data used in subsequent training, increasing the confidence of wrong predictions and making the model resist new changes. In this case, maintaining the consistency regularization will result in the student model converging incorrectly. In the absence of labeled targets as supervision, evaluating the certainty of teacher predictions, and filtering out low-certainty samples is necessary to ensure that the consistency constraints only apply to high-certainty samples.
The certainty driven selection method is shown in Figure 4. It is assumed that the teacher network has H layers, with the parameters set where ( | , ) t t i i p y x θ is the prediction probability based on the input data i x and model parameters t θ , and ( ) t q θ is the posterior distribution of model parameters, which cannot be obtained directly but can be estimated by dropout variational inference. Dropout variational inference is a practical method for approximating large and complex models [31]. Dropout is applied to every weight layer both in the training and the testing phase. The inference is obtained by sampling the approximate posterior, also referred to as Monte Carlo dropout. In addition, Liu et al. [32] believe that a prediction with high certainty should be consistent under randomly sampled subnetworks and random noise in inputs. Assuming that the set of sub-sampling results of K random enhancements of input

Certainty Driven Selection
The deficiency of existing perturbation-based semi-supervised learning methods is that all the outputs are regularized without exception. A large part of the outputs can be unreliable due to the confirmation bias [26]. Confirmation bias results from incorrect predictions of unlabeled data used in subsequent training, increasing the confidence of wrong predictions and making the model resist new changes. In this case, maintaining the consistency regularization will result in the student model converging incorrectly. In the absence of labeled targets as supervision, evaluating the certainty of teacher predictions, and filtering out low-certainty samples is necessary to ensure that the consistency constraints only apply to high-certainty samples.
The certainty driven selection method is shown in Figure 4. It is assumed that the teacher network has H layers, with the parameters set θ t = {Φ h } H h=1 determined by limited random variables, and Φ h represents the parameters of layer h. For sample x i , the predicted distribution q(ŷ t i x i ) of the teacher is approximated as [30] q(ŷ t i | where p(ŷ t i x i , θ t ) is the prediction probability based on the input data x i and model parameters θ t , and q(θ t ) is the posterior distribution of model parameters, which cannot be obtained directly but can be estimated by dropout variational inference. Dropout variational inference is a practical method for approximating large and complex models [31]. Dropout is applied to every weight layer both in the training and the testing phase. The inference is obtained by sampling the approximate posterior, also referred to as Monte Carlo dropout. In addition, Liu et al. [32] believe that a prediction with high certainty should be consistent under randomly sampled subnetworks and random noise in inputs. Assuming that the set of sub-sampling results of K random enhancements of input data , and the prediction variance (PV) [30] is used to measure the uncertainty of the prediction. The higher the variance, the higher the uncertainty: For input data x i , i ∈ {1, · · · , B} of the mini-batch, the uncertainty values of teacher predictions are [U(x 1 ), · · · , U(x B )]. The inputs are sorted in ascending order of the uncertainty to form the ordered input set {P 1 , . . . , P B }. The reliable input samples set M = {P 1 , . . . , P m } contains m lowest uncertainty samples chosen from the ordered input set and is used for the consistency constraints. m = min(βe, B), e as the epoch, and β as the ramp-up coefficient. The number of samples selected by certainty will increase over time, as the teacher predictions will become more accurate during training.
uncertainty of the prediction. The higher the variance, the higher the uncertainty:

Class-Imbalanced Learning
Imbalanced learning is a machine learning paradigm in which the classifier learns from dataset with a skewed class distribution. In this paper, we modify the training losses to further improve the robustness of the model to the imbalanced dataset. There are two main methods for solving class imbalance in supervised learning: loss re-weighting and mini-batch resampling [33][34][35]. These methods make the proportion of samples of different classes in the training loss closer to the test distribution to achieve a better trade-off between the accuracy of majority and minority classes. However, the model's scale is usually massive relative to the number of samples of the minority class, so there is the problem of over-fitting to the minority class. Label-distribution-aware margin loss (LDAM Loss) [28] regularizes different classes according to the number of samples: The regularization of the minority class should be stronger than that of the majority class to boost the generalization ability of the model to the minority class without sacrificing the fitting ability to the majority class.

Class-Imbalanced Learning
Imbalanced learning is a machine learning paradigm in which the classifier learns from dataset with a skewed class distribution. In this paper, we modify the training losses to further improve the robustness of the model to the imbalanced dataset. There are two main methods for solving class imbalance in supervised learning: loss re-weighting and mini-batch resampling [33][34][35]. These methods make the proportion of samples of different classes in the training loss closer to the test distribution to achieve a better tradeoff between the accuracy of majority and minority classes. However, the model's scale is usually massive relative to the number of samples of the minority class, so there is the problem of over-fitting to the minority class. Label-distribution-aware margin loss (LDAM Loss) [28] regularizes different classes according to the number of samples: The regularization of the minority class should be stronger than that of the majority class to boost the generalization ability of the model to the minority class without sacrificing the fitting ability to the majority class. Figure 5 shows an example of binary classification, where χ 1 and χ 2 represent the margin of majority class and minority class, respectively. Class margin is the minimum distance from all samples of this class to the decision boundary. The minority class should have a more significant margin than the majority class. For the multi-classification problem, when the margin of class c is satisfied χ c ∝ 1/N 1/4 c , the minimum test error can be obtained, where N c is the sample number of class c. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 15 Therefore, Hinge loss is adopted to enforce the class margin. For the labeled sample ( , ) i i x y , hinge loss is where l z is the lth output of ˆs i y predicted by the student model, The class imbalance will bring more challenges to the semi-supervised learning algorithm because, based on the smooth hypothesis, the decision boundary is located in the low-density area of the data space [23]. However, in the case of the class-imbalanced dataset, the high-density area of the minority class is sparse relative to the majority class, which causes that the decision boundary enters the minority class region, making the model predict the minority class samples to be the majority class. Therefore, to prevent the decision boundary from being overly smooth and infiltrating into the minority class areas, when the prediction given by the teacher model is the majority class, the consistency constraints should be suppressed. For m samples j x from the certainty-driven selection, the compression consistency loss is defined as  Therefore, Hinge loss is adopted to enforce the class margin. For the labeled sample (x i , y i ), hinge loss is where z l is the lth output ofŷ s i predicted by the student model, ∆ y i is the margin of class y i , satisfying ∆ y i = A/N 1/4 y i , and A is a constant for margin tuning. Since the hinge loss is non-convex and non-continuous, it is hard to optimize. A smoother cross-entropy loss with enforced class margin is adopted The class imbalance will bring more challenges to the semi-supervised learning algorithm because, based on the smooth hypothesis, the decision boundary is located in the low-density area of the data space [23]. However, in the case of the class-imbalanced dataset, the high-density area of the minority class is sparse relative to the majority class, which causes that the decision boundary enters the minority class region, making the model predict the minority class samples to be the majority class. Therefore, to prevent the decision boundary from being overly smooth and infiltrating into the minority class areas, when the prediction given by the teacher model is the majority class, the consistency constraints should be suppressed. For m samples x j from the certainty-driven selection, the compression consistency loss is defined as c represents the class predicted by the model, δ ∈ (0, 1] is the compression coefficient, and N min is the sample number of the class with the least samples. When Nĉ = N min , g(Nĉ) = 1, the compression consistency loss of samples predicted to be the smallest class is the same as the typical consistency regularization loss (CRL). The larger the data size of the predicted classĉ, the smaller g(Nĉ) is. Therefore, the final semi-supervised cost function is Appl. Sci. 2021, 11, 10373 9 of 15

Dataset
The original images of the assembly were captured from the production line by the industrial camera BM-500GE (produced by JAI) with a CCD resolution of 2456 × 2058. We clipped the original images to form an assembly image dataset to reduce the influence of irrelevant background objects and focus on the central area of the assembly. The minimum resolution of images in the dataset was 279 × 235, and the maximum resolution was 313 × 528. All assembly images were divided into three classes: two unqualified classes (missing lead seal and missing lead seals and gaskets) and a qualified class. Three types of images in the dataset are shown in Figure 6. There are only fine-grained differences between the two unqualified classes, so distinguishing the two unqualified minority classes is a great difficulty.

Dataset
The original images of the assembly were captured from the production line by the industrial camera BM-500GE (produced by JAI) with a CCD resolution of 2456 × 2058. We clipped the original images to form an assembly image dataset to reduce the influence of irrelevant background objects and focus on the central area of the assembly. The minimum resolution of images in the dataset was 279 × 235, and the maximum resolution was 313 × 528. All assembly images were divided into three classes: two unqualified classes (missing lead seal and missing lead seals and gaskets) and a qualified class. Three types of images in the dataset are shown in Figure 6. There are only fine-grained differences between the two unqualified classes, so distinguishing the two unqualified minority classes is a great difficulty. The training set contained a total of 16,663 images. Three labeled fractions ε of 10%, 20%, and 50% were adopted in experiments. We randomly extracted images for manual labeling to ensure that the labeled and unlabeled datasets had the same distribution. The sample numbers of three classes in the training set and test set are shown in Table 1. It is worth noting that, to verify the model's classification performance on minority classes, we used a class-balanced test set in this study. In the labeled training set, the imbalanced ratio of the majority class to the minority class was as high as 5.3, which was severe for traditional classification methods.  The training set contained a total of 16,663 images. Three labeled fractions ε of 10%, 20%, and 50% were adopted in experiments. We randomly extracted images for manual labeling to ensure that the labeled and unlabeled datasets had the same distribution. The sample numbers of three classes in the training set and test set are shown in Table 1. It is worth noting that, to verify the model's classification performance on minority classes, we used a class-balanced test set in this study. In the labeled training set, the imbalanced ratio of the majority class to the minority class was as high as 5.3, which was severe for traditional classification methods. Table 1. Image numbers of the classes in training set and testing set.

Training Settings and Metrics
The teacher and student were initialized by the pre-trained weights of DenseNet121 on ImageNet [36], and 100 epochs were performed with the mini-batch size of 16, where the batch size of labeled data was 8. Because the number of unlabeled images was no less than the labeled images, in every epoch, labeled images were iterated unlimited times until every unlabeled image was iterated once. We used stochastic gradient descent (SGD) to optimize the network, with a learning rate of 0.1, weight decay of 0.0001, and momentum of 0.9. The maximum value of EMA coefficient µ 0 of the teacher model was 0.999. To obtain the certainty values of teacher model predictions, we used MC dropout five times, and the ramp-up coefficient β was 2. For the LDAM, parameter A was tuned so that the maximum margin was 0.5. For CCL, δ was set to 0.5. Random augmentations applied to training data included random horizontal flipping and color jitter of brightness and contrast. After random augmentations, all images were resized to 224 × 224 before being sent to the network.
We used two common metrics in class-imbalanced learning to evaluate the performance of the model on the test set: balanced accuracy (bACC) [37] and geometric mean score (GM) [38], which are arithmetic and geometric mean scores, respectively, defined as where N c represents the sample number of class c, and TP c represents the sample number both belonging to class c and predicted to be class c. We trained the proposed and conventional methods for comparison on the training sets with the labeled fraction ε of 10%, 20%, and 50%. Then the classification performances were compared on the test dataset. Conventional methods included: supervised learning with limited labeled data; The standard mean teacher method; The mean teacher method with three commonly used class-imbalanced learning strategies: (1) re-weighting: the loss of each sample was re-weighted by the inverse of the sample number of the corresponding class, and re-normalized to make the average weight in the mini-batch was 1; (2) resampling: the sampling probability of each sample was inversely proportional to the sample number in its class; (3) focal loss [39]: the loss of the relatively correctly classified sample was reduced, and the loss of the difficult and incorrectly classified sample was increased. To ensure fair comparisons, all methods adopted the DenseNet121 model structure and the same hyper-parameters as the proposed method, such as pre-training initialization, labeled data batch size, and the optimization method mentioned above. All the training and testing experiments were repeated ten times, and the experimental results on the test set were averaged. The algorithm in this paper was implemented using Python toolkit PyTorch, and experiments were carried out on a computer with Intel Core I5-8500 @ 3.00 GHz CPU and 12G NVIDIA Titan RTX GPU. Figure 7 shows the differences in the representation of the training set between the supervised learning method (a) and the proposed method (b). T-SNE [40] projection with perplexity of 50 was used for visualization. In Figure 7a, the boundaries of the three classes were mixed. Therefore, under the condition of limited and imbalanced data, the model trained by the conventional supervised learning method was hard to learn discriminative data representation. The proposed method could form better class boundaries and obtain better classification performance.  Table 2 shows the mean and standard deviation of bACC and GM for the above methods on the test set. The proposed algorithm achieved an average bACC of 93.67% and an average GM of 93.57% when the labeled fraction was 10%. When the labeled fraction was 20%, an average ACC of 98.83% and an average GM of 98.83% were achieved. When the labeled fraction was 50%, an average ACC of 99.17%, and an average GM of 98.99% were reached. The proposed method performed better than the supervised learning method and all the mean teacher methods with existing class-imbalanced learning strategies, indicating that the proposed method was effective in the case of limited labeled data with the imbalanced class distribution. In addition, with less annotated data, the proposed method had more advantages and had a higher accuracy than other methods.  Figure 8 shows the error rates of all methods in the three classes. Figure 8a-c resulted from training the models under the labeled fraction ε of 10%, 20%, and 50%, respectively. The proposed method kept a low error rate in the majority and minority classes. When labeled data were few (Figure 8a,b), the supervised learning method showed high error rates in all classes. In contrast, although the mean teacher method achieved higher accuracy, its error rates in the minority classes did not decrease significantly because of confirmation bias. The mean teacher method combined with class-imbalanced learning strategies led to overfitting in the lead seal missing class, which was the greatest minority. Although it achieved a lower error rate, it sacrificed its fitting ability in the similar subminority class-both lead seal and gaskets missing. In addition, with the gradual increase of labeled data, the supervised learning algorithm had already obtained a low error rate, and the mean teacher method with class-imbalanced learning strategies did not observably improve the error rate.  Table 2 shows the mean and standard deviation of bACC and GM for the above methods on the test set. The proposed algorithm achieved an average bACC of 93.67% and an average GM of 93.57% when the labeled fraction was 10%. When the labeled fraction was 20%, an average ACC of 98.83% and an average GM of 98.83% were achieved. When the labeled fraction was 50%, an average ACC of 99.17%, and an average GM of 98.99% were reached. The proposed method performed better than the supervised learning method and all the mean teacher methods with existing class-imbalanced learning strategies, indicating that the proposed method was effective in the case of limited labeled data with the imbalanced class distribution. In addition, with less annotated data, the proposed method had more advantages and had a higher accuracy than other methods.  Figure 8 shows the error rates of all methods in the three classes. Figure 8a-c resulted from training the models under the labeled fraction ε of 10%, 20%, and 50%, respectively. The proposed method kept a low error rate in the majority and minority classes. When labeled data were few (Figure 8a,b), the supervised learning method showed high error rates in all classes. In contrast, although the mean teacher method achieved higher accuracy, its error rates in the minority classes did not decrease significantly because of confirmation bias. The mean teacher method combined with class-imbalanced learning strategies led to overfitting in the lead seal missing class, which was the greatest minority. Although it achieved a lower error rate, it sacrificed its fitting ability in the similar sub-minority classboth lead seal and gaskets missing. In addition, with the gradual increase of labeled data, the supervised learning algorithm had already obtained a low error rate, and the mean teacher method with class-imbalanced learning strategies did not observably improve the error rate. Figure 9 compares the accuracy and loss of the proposed method and the standard mean teacher method in the training process. The accuracy values were the validation accuracy after each epoch. A thousand uniform sampled loss values from all iterations were used to plot Figure 9(a2-c2). It can be seen that the proposed method converged faster and tended to be more stable, and a model with higher accuracy could be obtained with fewer iteration steps. In Figure 10, the accuracy and prediction variance (PV) during the training process are compared. With the ascending of model accuracy, the uncertainty of prediction labels gradually decreased. There was a strong inverse relationship between classification accuracy rate and average PV, which verified certainty driven selection's effectiveness.   Figure 9 compares the accuracy and loss of the proposed method and the standard mean teacher method in the training process. The accuracy values were the validation accuracy after each epoch. A thousand uniform sampled loss values from all iterations were used to plot Figure 9(a2-c2). It can be seen that the proposed method converged faster and tended to be more stable, and a model with higher accuracy could be obtained with fewer iteration steps. In Figure 10, the accuracy and prediction variance (PV) during the training process are compared. With the ascending of model accuracy, the uncertainty of prediction labels gradually decreased. There was a strong inverse relationship between classification accuracy rate and average PV, which verified certainty driven selection's effectiveness.

Experimental Results
(a1) (b1) (c1) (a2) (b2) (c2)    Figure 9 compares the accuracy and loss of the proposed method and the standard mean teacher method in the training process. The accuracy values were the validation accuracy after each epoch. A thousand uniform sampled loss values from all iterations were used to plot Figure 9(a2-c2). It can be seen that the proposed method converged faster and tended to be more stable, and a model with higher accuracy could be obtained with fewer iteration steps. In Figure 10, the accuracy and prediction variance (PV) during the training process are compared. With the ascending of model accuracy, the uncertainty of prediction labels gradually decreased. There was a strong inverse relationship between classification accuracy rate and average PV, which verified certainty driven selection's effectiveness.

Conclusions
This paper represents a semi-supervised class-imbalanced learning method based on the mean teacher to detect unqualified assembly samples. For consistency constraints, samples with high reliability are selected according to the model prediction certainty to improve the performance. Label distributed aware margin loss and compression consistency loss are employed to guarantee the accuracy of classification without sacrificing the fitting ability to the majority class. Experiments were carried out on the assembly im-

Conclusions
This paper represents a semi-supervised class-imbalanced learning method based on the mean teacher to detect unqualified assembly samples. For consistency constraints, samples with high reliability are selected according to the model prediction certainty to improve the performance. Label distributed aware margin loss and compression consistency loss are employed to guarantee the accuracy of classification without sacrificing the fitting ability to the majority class. Experiments were carried out on the assembly image dataset, and the performance was evaluated and compared with traditional deep learning classification methods. To verify the performance of the proposed method on a small amount of labeled data, 10% of the total data were labeled. Experimental results show that the prediction accuracies of the supervised learning method, mean teacher algorithm, and the proposed method were 85.34%, 88.22%, and 93.67% respectively. The proposed method overcame the performance degradation of the traditional semi-supervised learning algorithm on class-imbalanced datasets, kept low error rates in all classes, and could effectively avoid over-fitting on the minority class that occurred in the commonly used class-imbalanced learning methods. The model performances were also discussed when the labeled fraction increased to 20% and 50%, and the proposed method still achieved the highest accuracies of 98.83% and 98.99%. Future work will focus on applying the proposed method to more manufacturing scenarios and further enhancing classification accuracy.