1. Introduction
As one of the important components of electronic components, varistors play an important role in protecting circuits and stabilising voltage, which is the ‘safety valve’ of the circuit system. However, due to the complexity of the production process, varied and irregularly distributed defects can easily appear on the surface of varistors in actual production [
1], and these defects will, to a certain extent, adversely affect their performance and service life. Traditional manual inspection methods generally have a high leakage rate, low detection efficiency and other problems, and it is difficult for them to meet the high requirements of quality and efficiency of modern production [
2]. Therefore, a good product surface defect detection method can not only effectively improve product quality and production efficiency, but also significantly reduce the input of labour costs, while providing strong support for the intelligent transformation of the production process.
With the rapid development of industrial technology and the continuous expansion of application scenarios, traditional machine vision inspection methods, by virtue of their efficient data processing capabilities, have shown significant advantages in achieving high-speed and continuous operation, providing a strong support for the improvement of modern industrial production efficiency. However, these methods mainly rely on manually designed algorithms and domain a priori knowledge in the feature extraction process [
3], and although they can achieve better results in a specific type of varistors in a specific scenario, there are obvious limitations in their generalisation ability and environmental adaptability, especially in industrial field environments with large fluctuations in lighting conditions or complex background interference [
4]. The detection accuracy of the traditional machine vision system tends to show a significant degradation; it is difficult to meet the strict requirements of modern intelligent manufacturing on the detection stability.
In order to overcome the many limitations of traditional machine vision inspection methods in practical applications, deep learning has emerged to bring new breakthroughs in the field of defect detection. Supervised learning [
5], as an important method of deep learning, can accurately establish the mapping relationship between defect features and category labels through the training of a large amount of labelled data, so as to achieve high-precision defect detection. However, this method also has some significant shortcomings. First, the acquisition of high-quality labelled data requires a large amount of human and material costs, and subjective errors will inevitably be introduced in the labelling process; second, the generalisation performance of the model is highly dependent on the distributional completeness of the training data, and when facing out-of-distribution samples [
6], the model is prone to misjudgement and the omission of judgement. More critically, in some specific industrial scenarios, the scarcity problem of defective samples further restricts the scalability of supervised learning methods. In contrast, the unsupervised learning method reconstructs defective samples as defect-free samples by constructing a feature representation space based on defect-free samples [
7], and compares the differences before and after the reconstruction to achieve the precise localisation of defective regions. The significant advantage of this approach is that there is no need to predefine the defect categories, effectively circumventing the cumbersome sample labelling process.
In the defect detection task, the Transformer-based mask image modelling model masks the defective region by the target mask [
8], so that the model reconstructs the defective region based on the features of the defect-free region around the target mask, which is prone to target mask residuals, i.e., mask artifacts, during the reconstruction process, leading to difficulties in realizing high-precision defect localisation by comparing the differences between the before and after reconstructed images. The CNN mainly relies on the convolution operation of the local sensory field, which is difficult to use to effectively model the long-distance dependency relationship between pixels, making the model’s ability to perceive the global semantic information weak, and the defect segmentation appears to have unclear segmentation boundaries. On the other hand, the non-periodic random texture distribution on the surface of the varistors is coupled with the complex background interference, and the CNN’s intrinsic translation invariant feature extraction mechanism will misidentify the discretely distributed defective features as the normal texture, and this feature confusion makes it difficult for the model to accurately distinguish between defective and normal regions during defect segmentation, resulting in the segmentation results of the mis-segmentation phenomenon. The inherently multi-step iterative denoising mechanism of diffusion models incurs substantial computational complexity and memory overhead [
7], posing significant challenges for resource-constrained industrial deployments. Secondly, model performance is highly dependent on substantial high-quality training data to accurately learn the distribution of normal data. In industrial scenarios with limited training samples, overfitting is prone to occur. Concurrently, the random sampling that is inherent in diffusion models may introduce unnecessary noise or uncertainty, resulting in poor stability of reconstruction outcomes. This instability can lead to false detections when comparing pre- and post-reconstruction differences.
Based on the above problem, an unsupervised detection method for surface defects of varistors with a reconstructed normal distribution under mask constraints is proposed. In order to make the model focus on the texture distribution of the main region of the image and reduce the model’s focus on the background region, the image pre-processing method of removing the background and extracting the main body image of the varistor is researched and obtained on the basis of the colour space as well as the morphology, and the mask-constrained main body pseudo-anomaly generating strategy is adopted, which can effectively alleviate the defects’ residual phenomenon in the reconstruction result of the reconstruction model and improve the model localization ability. The KAN is combined with the U-Net to construct a segmentation sub-network, and the Gaussian radial basis function is introduced as the learnable activation function of the KAN to enhance the model’s ability to express image features, thus revealing the differences between the original image and reconstructed image more accurately, and realising a more accurate recognition of subtle defects.
The main contributions of this paper are as follows:
(a) On the basis of colour space as well as morphology, an image pre-processing method is proposed to extract the body image of the varistors, and a mask-constrained body pseudo-anomaly generation strategy is adopted to enable the model to focus on the texture distribution of the main region of the image, reduce the model’s focus on the background region, alleviate the defective residual phenomenon that occurs in the reconstruction results of the reconstruction network, and enhance the model’s localisation capability.
(b) Combine the KAN with the U-Net to construct a segmentation sub-network, and introduce the Gaussian radial basis function (GRBF) as the learnable activation function of the KAN to enhance the model’s ability to express image features, so as to realise more accurate recognition of subtle defects.
(c) A multi-colour and multi-specification varistor dataset is established to evaluate the performance of the proposed unsupervised varistor defect detection model. The experimental results show that the proposed method has good superiority and generalisation in the varistor defect detection task.
This article is organized structurally as follows.
Section 2 describes image reconstruction methods and the KAN architecture methods.
Section 3 introduces an unsupervised detection of surface defects in varistors with reconstructed normal distribution under mask constraints, which is used for detecting surface defects on varistors.
Section 4 describes the dataset collection equipment, datasets, training details, and evaluation metrics, and discusses the results of each experiment.
Section 5 summarises the experimental results and outlines future research directions.
3. Proposed Methods
The Var-MNDR framework is divided into three parts, which are varistor image preprocessing, a model training phase, and a model testing phase, as shown in
Figure 1.
(a) The original varistor image is processed through pixel channel processing and background colour removal to obtain a background-free varistor image. Elliptical feature fitting and affine transformation are used to obtain an elliptically corrected varistor image. Finally, Canny edge detection is used to find the maximum bounding rectangle to obtain the preprocessed varistor image.
(b) The model mainly consists of a reconstruction sub-network and a segmentation sub-network. In the model training stage, the defect-free original samples are first preprocessed, and the defect-free original samples are controllably damaged by the artificially synthesised pseudo-anomaly information to form pseudo-anomaly images containing pseudo-anomalies, and after the pseudo-anomaly images are inputted into the reconstruction sub-network, the reconstruction sub-network establishes the mapping relationship between normal texture features and abnormal texture features, and finally realises the accurate reconstruction of pseudo-anomalous regions according to the feature distribution law of normal samples.
(c) In the model testing stage, the defective sample images containing real defects are input, and after the same preprocessing process, the reconstruction sub-network reconstructs the potential defective regions that deviate from the normal texture space based on the a priori knowledge of the normal texture obtained from the training, and generates the defect-free reconstruction results that are consistent with the distribution of the normal texture space. Finally, the reconstructed image is sent to the segmentation sub-network along with the preprocessed image to generate the final defect segmentation image.
3.1. Image Preprocessing Algorithms
Since the varistor dataset is manually collected and the varistor image position is random, in order to remove redundant background data features and unify the pose of the varistor image, the varistor image preprocessing algorithm processing is shown in
Figure 2.
The steps of the algorithm are described as follows:
Step 1 Firstly, the colour space of the background image without placing a varistor and the image to be processed with placing a varistor are converted from the RGB space to the HSV space and the H-channel pixel values of the two HSV colour space images are retained [
19].
Step 2 The H-channel colour space image obtained in step 1 is subjected to image differencing and image threshold segmentation to obtain a mask image [
20]. The differential image process can be described as follows:
where
denotes the background image,
denotes the image to be processed, and
denotes the differential image.
Step 3 Record the pixel positions in the mask image of step 2 with a pixel value of 0. Assign a pixel value of the corresponding position of the pending image to 0, so as to obtain the pending image after removing the background.
Step 4 Fit the mask image with elliptic features [
21] and the standard form of the elliptic equation is as follows:
where
, if the data point set
is given, define the parameter vector
, and the least squares optimisation objective is as follows:
where
, after fitting the elliptic equation coefficients A, B, C, D, E, F, calculate the centre coordinates, axis lengths and rotation angles and perform rotation and affine transformation according to the calculated results.
Step 5 Rotate and affine transform the resultant image obtained in step 3 according to the centre coordinates [
22], axis length and rotation angle in step 4 to obtain an ellipse corrected image to be processed.
Step 6 Perform canny edge detection on the resultant image obtained in step 4, find the maximum outer rectangle and crop to obtain a normalised mask image [
23]. Calculate the edge detection gradient strength M and the direction
can be described as follows:
Step 7 The resultant image obtained in step 5 is cropped according to the maximum outer rectangle obtained in step 6 to obtain the normalised to-be-processed image.
3.2. Pseudo-Anomaly Generation Strategy
Since the defects only exist in the image body region, in order to make the model focus on the texture distribution of the image body region and reduce the model’s focus on the background region, as shown in
Figure 3, a mask-constrained body pseudo-anomaly generation strategy is proposed.
Firstly, the Berlin noise generator is used to generate a random noise image with natural texture features (
Figure 3, Perlin), the dynamic thresholds were obtained through uniform random sampling to generate diverse noise patterns ranging from minute pinholes, cracks, and scratches to larger patches, whilst simultaneously controlling the size and sparsity of the generated defect regions. Subsequently, the noise image undergoes binarisation to produce an initial defect mask image (
Figure 3, Mp), combined with the normalised mask obtained in step 7 of
Section 3.1 (
Figure 3, Mt) as a spatial constraint, and the noise regions with intersections with the main region are selected as effective pseudo-anomaly candidate areas through morphological logic and an algorithm to screen out the noise region that intersects with the subject region as an effective pseudo-anomaly candidate (
Figure 3, Ma), and in order to ensure the reasonableness of the texture features of the generated pseudo-anomaly image, random dynamic weighting coefficients are used to spatially superimpose the anomaly source image (
Figure 3, anomaly source) with the original normal image (
Figure 3, I) and ultimately generate pseudo-anomalies in accordance with the distribution of the features of the subject region. The synthetic image can be described as follows:
where
,
,
, with
being the inverse of
.
3.3. Reconstruction Sub-Network
The reconstruction sub-network consists of a set of encoder and decoder architectures. In the network training phase, pseudo-anomalous images that have undergone mask constraints on the subject are fed into the reconstruction sub-network, which establishes the mapping relationship between normal texture features and abnormal texture features, and the pseudo-anomalous images are reconstructed as approximately similar original images according to the feature distribution law of normal samples, thus achieving an accurate reconstruction of pseudo-anomalous regions. Structural similarity (SSIM) and mean square error (MSE) are used as constraints; the former ensures global semantic consistency by measuring the similarity of local structures in the image, and the latter preserves detailed features by constraining pixel-level differences. The loss function of the reconstruction network can be expressed as follows:
where
and
denote the structural similarity and mean square error between the preprocessed and reconstructed images, respectively.
is the weighting parameter that regulates the difference between the two losses, which is set to 0.5 for the experiment.
3.4. Segmentation Sub-Network
To achieve the precise identification of minor defects, we retained the overall topology of the U-Net, namely the ‘encoder-decoder-skip connection’ structure, while replacing the standard convolutional blocks in the encoder and decoder with G-KAN blocks. We also introduced Gaussian radial basis functions (GRBFs) as learnable activation functions for the KANs, thereby enhancing the model’s ability to express non-linear features and improving segmentation accuracy. Additionally, we utilise the GRBF to approximate B-spline functions and employ layer normalisation to keep the input within the RBF domain, thereby accelerating model training while optimising training efficiency without compromising performance.
The Kolmogorov–Arnold representation theorem states that any multivariate continuous function defined on a bounded domain can be represented as a two-level nested combinatorial form of a finite number of univariate functions, and the mathematical expression of the theorem is as follows [
14]:
where
and
are a set of learnable activation functions, each acting on the pth component of the input and combining it by summation, and
is another set of learnable activation functions, which performs a non-linear transformation on the combined intermediate result and obtains the output by summing over 2n + 1 terms.
Assuming that the input of the KAN is x,
is the output of the kth layer, and
is the activation function from the kth layer to the k + 1th layer, the output of the k + 1th layer can be expressed as follows:
The GRBF is a typical local response function, exhibiting high sensitivity to minute variations in the input space near the centre point. As minute defects often manifest as slight anomalies in grayscale, texture, or edge information within local image regions, this function effectively amplifies such subtle differences, enhancing sensitivity for detecting minute defects. Moreover, the GRBF is continuously differentiable, exhibiting smooth and stable output variations near defect edges or central regions. This facilitates the precise localisation of defect boundaries. Using the GRBF to approximate the B-spline function, each channel feature is mapped in detail through the learnable GRBF. Compared with traditional activation functions, it can better capture the non-linear boundary features of subtle defects. The GRBF expression can be written as follows:
where x is the input vector, c is the centre of the basis function,
is the shape parameter used to control the width of the function, and
is the Euclidean distance between x and c. If Wk is defined as the weight of the kth layer, the Gaussian radial basis activation function for joining the KAN can be defined as follows:
The segmentation sub-network employs multiple G-KAN blocks to construct a U-Net architecture with symmetric skip connections. The encoder progressively extracts high-level semantic features of defects through operations such as convolution and downsampling, while simultaneously reducing spatial resolution and expanding the receptive field. The decoder then gradually restores the spatial dimensions of the feature maps through upsampling and convolution, mapping the abstract semantic features extracted by the encoder back to the original image resolution to generate pixel-level defect segmentation feature maps. Skip connections transmit feature maps from corresponding encoder layers to corresponding decoder layers, enabling the fusion of high-resolution detail features from shallow encoder layers with semantic features from deep decoder layers. This compensates for the spatial information loss caused by downsampling. The overall architecture of the segmentation sub-network is shown in
Figure 4.
The specific steps for the G-KAN block are as follows:
Step 1 To capture complex, non-linear patterns in the input features, we introduce the GRBF as the learnable activation function for the KAN, which maps the input features in a non-linear manner, transforming them from the original channel space to a high-dimensional feature space defined by learnable centre points.
Specifically, for the input feature map , where B denotes the batch size, C denotes the number of input channels, and W and H denote the width and height of the feature map, respectively, we independently apply the GRBF(x) with K learnable centres c and shape parameters to each input channel to calculate its response value, resulting in a K-dimensional feature vector. Subsequently, this transformation is applied to all spatial positions in the original C channels, and the C K-dimensional vectors corresponding to each position are concatenated along the channel dimension to form a new K × C-dimensional channel feature. The input is transformed from the original to .
Step 2 Process
using a SplineConv layer with a smooth feature distribution adapted to the GRBF to promote non-linear feature combination. The parallel standard convolution path retains local correlations, and the fusion of the two enhances feature diversity. This processing can be described as follows:
where
and
represent the weights and biases of the SplineConv layer, respectively.
and
represent the weights and biases of the standard convolution layer, respectively.
Step 3 Finally, the output paths from the SplineConv layer and the standard convolution layer are fused via element-wise addition to generate the output feature map. This process can be described as follows:
3.5. Definition of the Decision Threshold
This paper employs the ‘3
criterion’ to determine the decision threshold, which may be expressed as follows:
where
denotes the mean of the image,
represents the standard deviation of the image,
is the coefficient of the standard deviation
, and Threshold is the threshold value. Assuming that the segmentation result map after model segmentation is
, its mean
and standard deviation
can be expressed as follows:
where N denotes the total number of pixels across all images involved in the calculation. Consequently, the threshold determination formula can be transformed as follows:
Through experimentation, the value of
adopted in this paper is 3.