Buckle Pose Estimation Using a Generative Adversarial Network

: The buckle before the lens coating is still typically disassembled manually. The difference between the buckle and the background is small, while that between the buckles is large. This mechanical disassembly can also damage the lens. Therefore, it is important to estimate pose with high accuracy. This paper proposes a buckle pose estimation method based on a generative adversarial network. An edge extraction model is designed based on a segmentation network as the generator. Spatial attention is added to the discriminator to help it better distinguish between generated and real graphs. The generator thus generates delicate external contours and center edge lines with help from the discriminator. The external rectangle and the least square methods are used to determine the center position and deﬂection angle of the buckle, respectively. The center point and angle accuracies of the test datasets are 99.5% and 99.3%, respectively. The pixel error of the center point distance and the absolute error of the angle to the horizontal line are within 7.36 pixels and 1.98 ◦ , respectively. This method achieves the highest center point and angle accuracies compared to Hed, RCF, DexiNed, and PidiNet. It can meet practical requirements and boost the production efﬁciency of lens coatings.


Introduction
Intelligent manufacturing has transformed the manufacturing industry.Although it improves production efficiency and product quality, it also faces new challenges in the automated production of mechanical parts.In industrial production, computer vision technology aims to process collected images into relevant information needed for industrial production, such as the pose of materials [1] and the presence of defects [2], through algorithm processing.The information is converted into instruction codes and transmitted to industrial robots that perform operations such as grasping and disassembly.Newly manufactured lenses must first be removed from their plastic plates, and their encapsulating buckles must be disassembled before being placed on an aluminum plate for coating.Manual buckle disassembly is laborious and characterized by temporal, efficiency, and quality constraints.Therefore, to automate the coating process, it is necessary to design an accurate real-time buckle pose estimation algorithm that assists these robots in automatically disassembling the buckle.
To address this need, this paper proposes a model that identifies the pose of a buckle on a plastic plate by obtaining its center position and deflection angle relative to the horizontal plane.To achieve this task, three key requirements must be met: (1) The model must be generalizable enough to adaptively handle significant buckle shape differences.Figure 1 presents three common buckle types (buckles are inside the red box).
(2) The model must distinguish the buckle from its background.The differences between these two are small.Figure 1 illustrate this scenario.
(3) The accuracy of buckle pose estimation must be high enough to reduce the loss caused by the clumsy disassembly of the machine.
inator-assisted generator is applied during training for edge refinement.First, its encoder is used for feature extraction and decoder feature reconstruction.A dilated convolution [4] is then performed so that the network can obtain information about the larger receptive field while paying attention to the overall target characteristics.Thus, the network can regress the overall outline information of the target to determine its pose to meet Requirements ( 1) and (2).Discriminators are introduced during training to assist the generator.Spatial attention is added to the discriminator, causing it to pay more attention to the difference between the generated image and its edge in the ground truth image.An auxiliary generator increases the attention paid to the detailed edge information and improves the accuracy to meet Requirement (3).

Related Work
In recent years, an increasing number of visual detection algorithms have been developed [5][6][7]; they are mainly divided into traditional and deep learning algorithms.

Traditional Detection Approaches
Traditional methods require manually labeled features designed for specific conditions and classifiers or template matching methods to detect targets using these features.The algorithms enumerate all possible targets in input images by combining several classical methods with parametric fine-tuning.Examples include cascade classification [8], sparse Fourier transformation [9], histogram of oriented gradient edge shape characteristic description [10], deformable part detection based on components [11], Haar wavelet characterization, and support vector machine (SVM) prediction [12].Yu et al. [13] proposed a crack extraction method that used multiscale morphological operations for connector crack detection.However, it was susceptible to changes in background brightness.Zhi et al. [14] used a double-threshold method and gear involute geometric relationship to determine the tooth pitch in a local image of a gear by reversely mapping the filtered pixel information to the base circle and calculating the phase angle.However, this measurement method was susceptible to data diversity problems.Meanwhile, Zhang et al. [15] proposed a method to detect train center plate bolts.Gabor wavelets of different scales were used to act on the image, the weight obtained by each channel was optimized by a genetic algorithm, and the weighted features were classified using SVM.However, the To meet these requirements, this paper proposes a generative adversarial network (GAN)-based [3] buckle pose estimation algorithm.A generator applies an edge extraction network to classify each pixel in the image, and the outer contours and center edge lines of buckles are regressed to estimate their center position and deflection angle.A discriminatorassisted generator is applied during training for edge refinement.First, its encoder is used for feature extraction and decoder feature reconstruction.A dilated convolution [4] is then performed so that the network can obtain information about the larger receptive field while paying attention to the overall target characteristics.Thus, the network can regress the overall outline information of the target to determine its pose to meet Requirements (1) and (2).Discriminators are introduced during training to assist the generator.Spatial attention is added to the discriminator, causing it to pay more attention to the difference between the generated image and its edge in the ground truth image.An auxiliary generator increases the attention paid to the detailed edge information and improves the accuracy to meet Requirement (3).

Related Work
In recent years, an increasing number of visual detection algorithms have been developed [5][6][7]; they are mainly divided into traditional and deep learning algorithms.

Traditional Detection Approaches
Traditional methods require manually labeled features designed for specific conditions and classifiers or template matching methods to detect targets using these features.The algorithms enumerate all possible targets in input images by combining several classical methods with parametric fine-tuning.Examples include cascade classification [8], sparse Fourier transformation [9], histogram of oriented gradient edge shape characteristic description [10], deformable part detection based on components [11], Haar wavelet characterization, and support vector machine (SVM) prediction [12].Yu et al. [13] proposed a crack extraction method that used multiscale morphological operations for connector crack detection.However, it was susceptible to changes in background brightness.Zhi et al. [14] used a double-threshold method and gear involute geometric relationship to determine the tooth pitch in a local image of a gear by reversely mapping the filtered pixel information to the base circle and calculating the phase angle.However, this measurement method was susceptible to data diversity problems.Meanwhile, Zhang et al. [15] proposed a method to detect train center plate bolts.Gabor wavelets of different scales were used to act on the image, the weight obtained by each channel was optimized by a genetic algorithm, and the weighted features were classified using SVM.However, the requirements for manual design features relied too much on human subjectivity and, the method lacked generalizability.
Owing to the characteristics of large background interference and diverse buckle shape data, it is necessary to analyze the characteristics of each buckle using traditional methods.Moreover, the recognition accuracy depends largely on the feature generation process, which limits their applicability to industrial applications.Therefore, this study compares traditional binarization and segmentation methods.Binarization methods can be divided into local and global categories.Global methods use only one threshold over the entire image, whereas local methods use multiple regional thresholds.We considered two contemporary methods, those developed by Otsu [16] and Wellner [17], for their global and local thresholds, respectively.Furthermore, we used the watershed segmentation algorithm [18] to segment images into disjointed regions that are related in terms of certain attributes.The results of these three methods are shown in Figure 2.
requirements for manual design features relied too much on human subjectivity and, the method lacked generalizability.
Owing to the characteristics of large background interference and diverse buckle shape data, it is necessary to analyze the characteristics of each buckle using traditional methods.Moreover, the recognition accuracy depends largely on the feature generation process, which limits their applicability to industrial applications.Therefore, this study compares traditional binarization and segmentation methods.Binarization methods can be divided into local and global categories.Global methods use only one threshold over the entire image, whereas local methods use multiple regional thresholds.We considered two contemporary methods, those developed by Otsu [16] and Wellner [17], for their global and local thresholds, respectively.Furthermore, we used the watershed segmentation algorithm [18] to segment images into disjointed regions that are related in terms of certain attributes.The results of these three methods are shown in Figure 2. The results shown in Figure 2 indicate that traditional methods perform poorly in data extraction.In Otsu's method, although a rough outline of the target can be regressed, it is very difficult to find the target position and estimate the pose.The watershed method eliminates the influence of the background, but it is highly sensitive to noise; hence, much of the valuable target information is lost.Wellner's method treats parts of the background as foreground, resulting in a complete loss of the target information.

Deep Learning Detection Approaches
Deep learning [19] algorithms automatically learn the targets to be detected while avoiding the influence of human subjectivity and other noisy factors.Thus, they provide strong feature generalization.Nonlinear combinations are used to build convolutional neural network (CNN) architectures [20][21][22], and their capacities are controlled by varying the breadth and depth of features so that they can make strong and correct assumptions about the nature of an image [23].Ge et al. [24] proposed a recognition method for twodimensional (2D) instance segmentation and three-dimensional feature consistency pairing to assist in automatic workpiece painting.They used a mask region-based CNN (R-CNN) [25] to combine the fast segmentation and recognition of 2D workpieces with the strong discrimination of local details, using fast point feature histogram point cloud features to accurately distinguish dissimilar multi-view components and coarse-to-fine parts.The results shown in Figure 2 indicate that traditional methods perform poorly in data extraction.In Otsu's method, although a rough outline of the target can be regressed, it is very difficult to find the target position and estimate the pose.The watershed method eliminates the influence of the background, but it is highly sensitive to noise; hence, much of the valuable target information is lost.Wellner's method treats parts of the background as foreground, resulting in a complete loss of the target information.

Deep Learning Detection Approaches
Deep learning [19] algorithms automatically learn the targets to be detected while avoiding the influence of human subjectivity and other noisy factors.Thus, they provide strong feature generalization.Nonlinear combinations are used to build convolutional neural network (CNN) architectures [20][21][22], and their capacities are controlled by varying the breadth and depth of features so that they can make strong and correct assumptions about the nature of an image [23].Ge et al. [24] proposed a recognition method for two-dimensional (2D) instance segmentation and three-dimensional feature consistency pairing to assist in automatic workpiece painting.They used a mask region-based CNN (R-CNN) [25] to combine the fast segmentation and recognition of 2D workpieces with the strong discrimination of local details, using fast point feature histogram point cloud features to accurately distinguish dissimilar multi-view components and coarse-to-fine parts.Li et al. [26] proposed a method based on an improved "You Only Look Once" (YOLO) Version 3 (YOLOv3) real-time object detection algorithm [27] to identify workshop workpieces, wherein a deep separable convolution was used to improve the performance of the darknet backbone.However, its implementation was limited in its high-precision measurement capability, and it was considerably dependent on the accuracy of the regression detection framework.Li et al. [28] embedded an improved squeeze-and-excitation network (SENet) [29] module into a 50-layer convolutional residual network (ResNet50) [30] backbone and adopted a feature pyramid structure [31] to fuse dimensional features, which significantly improved the detection effect of vehicle-bottom parts.Because an improved SENet module was added to each residual block, the complexity of the network increased, whereas the detection speed decreased.Faster R-CNN [32], YOLO [33], and SSD [34] are representative algorithms for deep learning object detection.These deep learning detection algorithms focus on the classification of the target; for the specific location of the target, they only need to give a rough candidate box.However, this study aims to identify the pose of the buckle, which needs to determine the deflection angle using a target detection algorithm that cannot estimate the rotation angle of the target.

Methods
This paper proposes a high-precision buckle pose estimation algorithm based on GAN that addresses the problems of large intra-class and small inter-class data differences.Figure 3 illustrates the proposed architecture, which consists of three components: a feature extraction module (encoder), a feature reconstruction module (decoder), and a refinement edge module (discriminator).
pieces, wherein a deep separable convolution was used to improve the performance of the darknet backbone.However, its implementation was limited in its high-precision measurement capability, and it was considerably dependent on the accuracy of the regression detection framework.Li et al. [28] embedded an improved squeeze-and-excitation network (SENet) [29] module into a 50-layer convolutional residual network (ResNet50) [30] backbone and adopted a feature pyramid structure [31] to fuse dimensional features, which significantly improved the detection effect of vehicle-bottom parts.Because an improved SENet module was added to each residual block, the complexity of the network increased, whereas the detection speed decreased.Faster R-CNN [32], YOLO [33], and SSD [34] are representative algorithms for deep learning object detection.These deep learning detection algorithms focus on the classification of the target; for the specific location of the target, they only need to give a rough candidate box.However, this study aims to identify the pose of the buckle, which needs to determine the deflection angle using a target detection algorithm that cannot estimate the rotation angle of the target.

Methods
This paper proposes a high-precision buckle pose estimation algorithm based on GAN that addresses the problems of large intra-class and small inter-class data differences.Figure 3 illustrates the proposed architecture, which consists of three components: a feature extraction module (encoder), a feature reconstruction module (decoder), and a refinement edge module (discriminator).A batch of standard-sized original images and corresponding ground-truth labels are input into the network, and feature extraction is performed through convolution, pooling, and residual block operations in the encoder.The dilated convolution operation expands the receptive field of the network without increasing its depth, which is conducive to retaining spatial information.Thus, the network pays more attention to the overall characteristics of the target area.The decoder is used for feature reconstruction and skip connections are introduced in each layer of the encoder and decoder to combine different feature levels to compensate for the loss incurred by max-pooling.After obtaining the results of the feature reconstruction module, the channel is adjusted using a 1 × 1 convolution to obtain the final segmentation image.The model learns the effective features of each image in the training sample through forward propagation and calculates the loss by comparing the ground-truth label.A back-propagation algorithm is used to minimize generation loss to optimize the network.During training, the generated prediction and ground-truth label images are input into the discriminator.Feature extraction is carried out through the convolution block, and feature screening is performed by the spatial attention module, which makes the discriminator pay more attention to the edge information, enhances the identification of the generated labels by the discriminator, and assists the generator in refining the edge effects.

Feature Extraction Module
CNNs [35] are widely used to extract the features of objects and are trained by stacking multiple convolution kernels and pooling layers.Our feature extraction network uses a ResNet50 backbone to extract the main features.As shown in Figure 4, the network consists of three blocks: a low layer that contains low-level features, a middle layer containing transition features, and a deep layer containing high-level features.First, a convolutional layer with a convolution block and two residual blocks is used to obtain the most original low-layer features.The middle-layer features are used to obtain contour and edge information.Finally, two convolution blocks and eight residual blocks are used to obtain deep semantic features.
taining spatial information.Thus, the network pays more attention to the overall characteristics of the target area.The decoder is used for feature reconstruction and skip connections are introduced in each layer of the encoder and decoder to combine different feature levels to compensate for the loss incurred by max-pooling.After obtaining the results of the feature reconstruction module, the channel is adjusted using a 1 × 1 convolution to obtain the final segmentation image.The model learns the effective features of each image in the training sample through forward propagation and calculates the loss by comparing the ground-truth label.A back-propagation algorithm is used to minimize generation loss to optimize the network.During training, the generated prediction and ground-truth label images are input into the discriminator.Feature extraction is carried out through the convolution block, and feature screening is performed by the spatial attention module, which makes the discriminator pay more attention to the edge information, enhances the identification of the generated labels by the discriminator, and assists the generator in refining the edge effects.

Feature Extraction Module
CNNs [35] are widely used to extract the features of objects and are trained by stacking multiple convolution kernels and pooling layers.Our feature extraction network uses a ResNet50 backbone to extract the main features.As shown in Figure 4, the network consists of three blocks: a low layer that contains low-level features, a middle layer containing transition features, and a deep layer containing high-level features.First, a convolutional layer with a convolution block and two residual blocks is used to obtain the most original low-layer features.The middle-layer features are used to obtain contour and edge information.Finally, two convolution blocks and eight residual blocks are used to obtain deep semantic features.An appropriate receptive field of deep features is key to the segmentation task.The receptive field size of each layer is calculated using Equation (1): (,  − 1) = ((, ) − 1) ×  + . ( This formula iterates from the uppermost to the lowest layer, where (, ) represents the local receptive field of the ith to jth layer,  represents the step size, and  represents the size of the convolution kernel.The resolution of the training image is 1024 × 768, and the width of the target buckle accounts for ~40% of the entire image, which is approximately 400 pixels.Note that the ideal receptive field size cannot be lower than 400 pixels.However, the maximum receptive field of the feature extraction network is 267 pixels.The network cannot focus on the overall characteristics of the target after obtaining the deep features; however, using the pooling layer to improve the receptive field of the network loses part of the information.Therefore, dilated convolutions are added after the deep features to increase the receptive field of the network (see Table 1).The size of the An appropriate receptive field of deep features is key to the segmentation task.The receptive field size of each layer is calculated using Equation (1): R(i, j − 1) = (R(i, j) − 1) × s + k. ( This formula iterates from the uppermost to the lowest layer, where R(i, j) represents the local receptive field of the ith to jth layer, s represents the step size, and k represents the size of the convolution kernel.The resolution of the training image is 1024 × 768, and the width of the target buckle accounts for ~40% of the entire image, which is approximately 400 pixels.Note that the ideal receptive field size cannot be lower than 400 pixels.However, the maximum receptive field of the feature extraction network is 267 pixels.The network cannot focus on the overall characteristics of the target after obtaining the deep features; however, using the pooling layer to improve the receptive field of the network loses part of the information.Therefore, dilated convolutions are added after the deep features to increase the receptive field of the network (see Table 1).The size of the receptive field in the final network is 427 pixels.To calculate the receptive field of the dilated convolution, the equivalent of the dilated convolution is first obtained, and the size of the receptive field is recalculated using the equivalent convolution.The conversion formula between the two is shown in Equation (2): where k represents the equivalent convolution, and d is the dilated ratios.In this study, dilated convolutions with void ratios of three and two and a convolution kernel size of three are used in correspondence to the convolution kernels with convolution kernel sizes of seven and five in the equivalent.

Feature Reconstruction Module
After obtaining the overall features of the target using dilated convolutions, the features are reconstructed by the decoder.The feature extraction network obtains three scale features that are recorded as deep features, F 1 ; middle layer features, F 2 ; and low features, F 3 .As shown in Figure 5, the deep features are upsampled to each scale using deconvolutions, and the original scale features are fused by skip connections to obtain the enhanced feature, f i (i = 1, 2), for each scale.The final segmentation image, R, is obtained by adjusting the channel dimension of the feature map through a 1 × 1 convolution, as shown in Equations ( 3)-( 5): where Buckle pose estimation must regress the outer contour and the center edge line of the target.In each training sample, a positive sample pixel consists of the label position, whereas other positions are negative sample pixels.Therefore, the outer contour and the central edge line account for a small proportion of the overall pixels, which causes a serious imbalance between the positive and negative samples.We apply focal loss (FL) as the loss function of the edge extraction network.FL balances the weights between positive and negative samples, including those of easy and difficult versions, as shown in Equation (6).

𝐹𝐿(𝑝
where  is the predicted value of each pixel,  is the true label of each pixel,  is the weight coefficient for controlling positive and negative samples, and  is the weight co- Buckle pose estimation must regress the outer contour and the center edge line of the target.In each training sample, a positive sample pixel consists of the label position, whereas other positions are negative sample pixels.Therefore, the outer contour and the central edge line account for a small proportion of the overall pixels, which causes a serious imbalance between the positive and negative samples.We apply focal loss (FL) as the loss function of the edge extraction network.FL balances the weights between positive and negative samples, including those of easy and difficult versions, as shown in Equation (6).
where p is the predicted value of each pixel, y is the true label of each pixel, α is the weight coefficient for controlling positive and negative samples, and γ is the weight coefficient for controlling difficult samples.

Refinement Edge Module
For high-precision buckle pose estimation, it is necessary to return the outer contour and center edge line closer to the ground-truth label.The feature reconstruction module is used to return the approximate pose of the target; however, some data samples obtain rough contour lines that reduce accuracy.Therefore, it is necessary to refine the regression results and add a discriminator to assist the generator in training.This discriminator is shown in Figure 6, which follows the concept of adversarial segmentation.The segmentation model, G, and the adversarial discriminator, D, are subjected to a minimax game.G aims to generate a label image to fool D, and D aims to distinguish the prediction image of G from the ground truth.Therefore, the discriminator is used to judge the quality of the edges generated by the edge extraction network, and the edge extraction network is used to generate better edges to deceive the discriminator.Therefore, edge refinement is accomplished.The mixture function of segmentation and discrimination losses is expressed as where the first term represents the original segmentation loss function of Equation ( 6), and the latter term is the loss of discriminator D, where z is a binary number indicating whether the input data contain the predicted (0) or ground-truth image (1).Thus, z becomes the probability of the output data matching the predicted or ground-truth image.For the discriminator, D, convolutional block stacking is used to extract features, and the spatial attention mechanism is added after obtaining deep features so that the network can focus more on generating the contour label values.Average pooling encodes global statistics, while Max pooling encodes the salient parts.It is the same as the attention module proposed by [36], the feature map, S, is first obtained by stacking convolutional blocks.For the discriminator, D, convolutional block stacking is used to extract features, and the spatial attention mechanism is added after obtaining deep features so that the network can focus more on generating the contour label values.Average pooling encodes global statistics, while Max pooling encodes the salient parts.It is the same as the attention module proposed by [36], the feature map, S, is first obtained by stacking convolutional blocks.It is then averaged, and the maximum values of the channels in feature map S are calculated to obtain the average and maximum feature maps.The two maps are then channel-fused, and the channel is changed by a 1 × 1 convolution to obtain the attention map, M, which is multiplied by the feature map S pixel-by-pixel.The calculated final attention feature map, SA, is expressed as where ϕ represents the sigmoid activation function, ⊗ represents pixel-by-pixel multiplication, Ga and Gm are the global average and maximum values of all channels in the feature map, respectively.

Datasets
The datasets used in this study were taken from real industrial production scenarios, and their resolutions were normalized to 1024 × 768.A total of 1572 real data items were used: 500 for training, 200 for validation, and 872 for testing.

Training Label
To meet the high-precision buckle pose estimation requirement, the training label was determined by analyzing the data of the outer contour and center edge line of the target buckle, as shown in Figure 7.The data were labeled by the labelme annotation tool, which mainly marks the outer contour and the center of the target.After model inferencing, the center point of the minimum circumscribed rectangle was considered the center position.The least square fitting slope was used as the relative horizontal deflection angle for the center edge line.The outer contour labels are edge and mask labels, respectively.Based on the following experimental comparison, the edge label was superior to the mask label.

Experimental Environment
In this study, we used an NVIDIA 2080Ti graphics card with 11-GB memory for the PyTorch framework in the Ubuntu system environment for this experiment.The Adam optimizer was used for generator training, with a learning rate of 0.001 for 40 training epochs.The RMSprop optimizer was used as the discriminator, and to make its learning process slower than that of the generator, its learning rate was set to half that of the generator (0.0005).The first five epochs only trained the generator, and when a certain level of generation ability was found, the discriminator was trained.

Evaluation Methodology
To evaluate the performance of our algorithm, the evaluation metric for edge detection was used to measure the quality of the results, and the absolute distance of the error was used as the evaluation metric for the final result.For the edge extraction network, the optimal dataset scale (ODS), optimal image scale (OIS), average precision (AP), and R50 measures were used, and the F-measure was the reconciled average of precision (P) and recall (R).ODS represents the global optimal threshold, which is fixed in all images to

Experimental Environment
In this study, we used an NVIDIA 2080Ti graphics card with 11-GB memory for the PyTorch framework in the Ubuntu system environment for this experiment.The Adam optimizer was used for generator training, with a learning rate of 0.001 for 40 training epochs.The RMSprop optimizer was used as the discriminator, and to make its learning process slower than that of the generator, its learning rate was set to half that of the generator (0.0005).The first five epochs only trained the generator, and when a certain level of generation ability was found, the discriminator was trained.

Evaluation Methodology
To evaluate the performance of our algorithm, the evaluation metric for edge detection was used to measure the quality of the results, and the absolute distance of the error was used as the evaluation metric for the final result.For the edge extraction network, the optimal dataset scale (ODS), optimal image scale (OIS), average precision (AP), and R50 measures were used, and the F-measure was the reconciled average of precision (P) and recall (R).ODS represents the global optimal threshold, which is fixed in all images to maximize the overall F-measure.OIS represents the optimal threshold of each image and was used to maximize the F-measure of the image.AP is the integral of the P/R curve, and R50 is the recall rate at 50% precision.For edge detection, a distance tolerance parameter was used to determine whether the predicted boundary pixels were correctly predicted, allowing for small positioning errors between the predicted and ground-truth boundaries.The distance tolerance was obtained by multiplying the width and height of the image by the maxDist (0.0075).The formulae for P and R are shown in Equations ( 10) and (11); the formulae for the F-measure and AP are shown in Equations ( 12) and (13).
where E1 represents the binarization result of all edge images, matchE is the predicted edge point that matches the ground-truth point, allG is the ground-truth edge point, and matchG is the number of predicted edge points that were ground true.The number of non-zeros (nnz) reflects the number of non-zero elements and sum represents the sum of the points.The F-measure was tuned by adjusting the value of β to obtain the optimized proportion of P and R, where β = 1.
The evaluation indices of the center point and angle prediction were evaluated using the Euclidean distance and absolute value error, respectively.The calculation formulae are as follows: where (x 1 , y 1 ) and (x 2 , y 2 ) represent the center point coordinates of the true value and those of the predicted value, respectively.α and α 0 represent the deflection angles of the true and predicted values, respectively.

Experimental Results and Analysis
The two given labels were evaluated, and the compared state-of-the-art contour regression algorithms included Hed [37], richer convolutional features for edge detection [38], DexiNed [39], and PidiNet [40].First, the label results, whose parameters were consistent during training, were evaluated; only the training label style was changed.From Figure 8, it can be seen that the maximum error of the center distance of the edge label is in the range of seven-to-eight pixels, whereas the maximum error of the center distance of the mask label is more than eight pixels.Because the machine is clumsy, it is possible to damage the lens when the error exceeds eight pixels.
We selected data from the mask and edge labels of the resulting image, as shown in Figure 9.Then, the segmentation image generated by the mask label forced the network to classify the surrounding background pixels into positive values because all pixels are classified during training.Owing to the small differences between the background and target classes, the mask label caused the network to learn useless pixel information, which resulted in errors.When the edge label was used, the overall target shape was not affected and its center position could be estimated using the circumscribed rectangle, although there was a local missing part of the regression outer contour.Based on the following experimental comparison, the edge label is superior to the mask label.

Experimental Results and Analysis
The two given labels were evaluated, and the compared state-of-the-art contour regression algorithms included Hed [37], richer convolutional features for edge detection [38], DexiNed [39], and PidiNet [40].First, the label results, whose parameters were consistent during training, were evaluated; only the training label style was changed.From Figure 8, it can be seen that the maximum error of the center distance of the edge label is in the range of seven-to-eight pixels, whereas the maximum error of the center distance of the mask label is more than eight pixels.Because the machine is clumsy, it is possible to damage the lens when the error exceeds eight pixels.We selected data from the mask and edge labels of the resulting image, as shown in Figure 9.Then, the segmentation image generated by the mask label forced the network to classify the surrounding background pixels into positive values because all pixels are classified during training.Owing to the small differences between the background and target classes, the mask label caused the network to learn useless pixel information, which resulted in errors.When the edge label was used, the overall target shape was not affected and its center position could be estimated using the circumscribed rectangle, although   We verified the advantages of our contour regression method and compared it to the other methods.As shown in Table 2, the ODS, AP, and R50 of this method are higher than other methods, and only 0.1% lower than the best method under OIS.Although the results of all methods were not significantly different from the overall indicators, they did not perform well for specific difficult samples.As shown in Figure 10, the background interferences of some samples were quite significant, and the characteris- We verified the advantages of our contour regression method and compared it to the other methods.As shown in Table 2, the ODS, AP, and R50 of this method are higher than other methods, and only 0.1% lower than the best method under OIS.
Although the results of all methods were not significantly different from the overall indicators, they did not perform well for specific difficult samples.As shown in Figure 10, the background interferences of some samples were quite significant, and the characteristics of the target areas were insignificant.The proposed method was superior to the other methods in terms of the contour and line regression results.Notably, other methods produced incomplete or inaccurate contours and line regressions, resulting in inaccurate results.The method proposed in this study can effectively regress contours and lines better than the other methods.To show the advantages of our method more clearly, we used the center point pixel distance differences of under seven pixels and angle absolute errors below 1.5 as correct sample estimations to recalculate accuracy.As shown in Table 3, the center point accuracy of the proposed method reached 99.5%, and the angle estimation accuracy reached 99.3%, which was higher than the other methods.In all samples, the maximum error of the center point pixel distance deviation was 7.36 pixels, and the maximum error of the angle absolute error was 1.98°.Hence, the results met the three requirements pointed out in the introduction.

Ablative Study
To evaluate the impact of each module on overall performance, ablation experiments were conducted using the same training parameters and datasets for the cavity convolution and refinement edge modules.The refinement edge module was used to assist in segmentation network training and the inference phase was not used.As shown in Table 4, the accuracy of the center point estimation increased from 93.2 to 98.7% after adding the dilated convolutions to expand the network receptive field, and the accuracy of angle estimation increased from 93.0 to 98.6%.When using the refinement edge module of GAN to assist generator training, the network better refined the contours of the generated images.The accuracy of central point estimation increased by 0.8% and that of angle estimation increased by 0.7%.To show the advantages of our method more clearly, we used the center point pixel distance differences of under seven pixels and angle absolute errors below 1.5 as correct sample estimations to recalculate accuracy.As shown in Table 3, the center point accuracy of the proposed method reached 99.5%, and the angle estimation accuracy reached 99.3%, which was higher than the other methods.In all samples, the maximum error of the center point pixel distance deviation was 7.36 pixels, and the maximum error of the angle absolute error was 1.98 • .Hence, the results met the three requirements pointed out in the introduction.

Ablative Study
To evaluate the impact of each module on overall performance, ablation experiments were conducted using the same training parameters and datasets for the cavity convolution and refinement edge modules.The refinement edge module was used to assist in segmentation network training and the inference phase was not used.As shown in Table 4, the accuracy of the center point estimation increased from 93.2 to 98.7% after adding the dilated convolutions to expand the network receptive field, and the accuracy of angle estimation increased from 93.0 to 98.6%.When using the refinement edge module of GAN to assist generator training, the network better refined the contours of the generated images.The accuracy of central point estimation increased by 0.8% and that of angle estimation increased by 0.7%.The improvements highlighted samples with poor regression effects regarding the edge extraction network.The regression of contours and lines was more refined to be closer to the actual values, and the effect image and difficult sample result image were refined.As shown in Figure 11, when only the edge extraction network is used to regress the target, a double contour phenomenon may occur, which affects the regression results.When dilated, the convolution is added to increase the receptive field, whereas the double contour and shape line regression phenomena are reduced to a certain extent.However, the regression contour was rougher.Again, leveraging the GAN method, the discriminator and generator confrontation training was added to distinguish the authenticity of the generated image; thus, the double contour phenomenon disappeared.
Appl.Sci.2023, 13, x FOR PEER REVIEW 13 of 16 Refinement Edge 99.5% 99.3% The improvements highlighted samples with poor regression effects regarding the edge extraction network.The regression of contours and lines was more refined to be closer to the actual values, and the effect image and difficult sample result image were refined.As shown in Figure 11, when only the edge extraction network is used to regress the target, a double contour phenomenon may occur, which affects the regression results.When dilated, the convolution is added to increase the receptive field, whereas the double contour and shape line regression phenomena are reduced to a certain extent.However, the regression contour was rougher.Again, leveraging the GAN method, the discriminator and generator confrontation training was added to distinguish the authenticity of the generated image; thus, the double contour phenomenon disappeared.After obtaining the outer contour and the center edge line of the target using the proposed method, the minimum circumscribed rectangle of the outer contour is estimated, and the center point of the circumscribed rectangle is taken as the center point of the target location.The center edge line is fitted using the least square method, and the slope obtained is used as the rotation angle of the target relative to the horizontal position.Some of the final results are shown in Figure 12.After obtaining the outer contour and the center edge line of the target using the proposed method, the minimum circumscribed rectangle of the outer contour is estimated, and the center point of the circumscribed rectangle is taken as the center point of the target location.The center edge line is fitted using the least square method, and the slope obtained is used as the rotation angle of the target relative to the horizontal position.Some of the final results are shown in Figure 12.
Appl. 2023, 13, x FOR PEER REVIEW 13 of 16 Refinement Edge 99.5% 99.3% The improvements highlighted samples with poor regression effects regarding the edge extraction network.The regression of contours and lines was more refined to be closer to the actual values, and the effect image and difficult sample result image were refined.As shown in Figure 11, when only the edge extraction network is used to regress the target, a double contour phenomenon may occur, which affects the regression results.When dilated, the convolution is added to increase the receptive field, whereas the double contour and shape line regression phenomena are reduced to a certain extent.However, the regression contour was rougher.Again, leveraging the GAN method, the discriminator and generator confrontation training was added to distinguish the authenticity of the generated image; thus, the double contour phenomenon disappeared.After obtaining the outer contour and the center edge line of the target using the proposed method, the minimum circumscribed rectangle of the outer contour is estimated, and the center point of the circumscribed rectangle is taken as the center point of the target location.The center edge line is fitted using the least square method, and the slope obtained is used as the rotation angle of the target relative to the horizontal position.Some of the final results are shown in Figure 12.

Conclusions
In this study, aiming at the high-precision pose estimation requirements of buckles under large background interference and complex data, a GAN-based buckle pose estimation algorithm was proposed.The algorithm uses dilated convolutions to enable the network to focus on the overall target characteristics.A discriminator and spatial attention module were added for edge refinement.Problems of incomplete and low-accuracy regressions of snap contours and lines were solved, thereby improving the accuracy of the overall pose estimation.Finally, the external matrix and least squares methods were used to estimate the center position and deflection angle of the target, respectively.The maximum error of the center point distance of the test set was 7.36 pixels, and the absolute maximum error of the angle was 1.98 • .The inferencing speed of the code deployed on an industrial computer equipped with NVIDIA 2080Ti was approximately 30 ms per frame, which meets real-time requirements and accelerates the production efficiency of the lens coating.For some samples (the first row in Figure 11), the regression results of our method were missing.In the future, we plan to use a more advanced network to optimize our pose estimation technique while also extending its generalizability to other industrial parts.

Figure 1 .
Figure 1.Different buckle types (buckles are inside the red box).

Figure 1 .
Figure 1.Different buckle types (buckles are inside the red box).

Figure 2 .
Figure 2. Experimental results of traditional lens buckle segmentation.

Figure 2 .
Figure 2. Experimental results of traditional lens buckle segmentation.

Figure 3 .
Figure 3. Generative adversarial network structure.The encoder is a feature extraction module, the decoder is a feature reconstruction module, and the discriminator serves as an auxiliary classifier to help the generator generate more realistic labels.A batch of standard-sized original images and corresponding ground-truth labels are input into the network, and feature extraction is performed through convolution, pooling, and residual block operations in the encoder.The dilated convolution operation expands

Figure 3 .
Figure 3. Generative adversarial network structure.The encoder is a feature extraction module, the decoder is a feature reconstruction module, and the discriminator serves as an auxiliary classifier to help the generator generate more realistic labels.

Figure 4 .
Figure 4. Feature extraction module.Through different convolution blocks, the features of different layers are obtained, which are low-level features  3 , mid-level features  2 , and deep features  1 .

Figure 4 .
Figure 4. Feature extraction module.Through different convolution blocks, the features of different layers are obtained, which are low-level features F 3 , mid-level features F 2 , and deep features F 1 .

16 Figure 5 .
Figure 5. Feature reconstruction module.The features are upsampled through deconvolution blocks, and fused with  1 ,  2 , and  3 features extracted from features extraction through skip connections to obtain enhanced features.

Figure 5 .
Figure 5. Feature reconstruction module.The features are upsampled through deconvolution blocks, and fused with F 1 , F 2 , and F 3 features extracted from features extraction through skip connections to obtain enhanced features.

16 Figure 6 .
Figure6.Refinement edge module.The generated result map is input into the discriminator, and the features are extracted through the convolution stack block.Spatial attention is used to pay more attention to the target area.Finally, the input probability is passed through the fully connected layer.

Figure 6 .
Figure6.Refinement edge module.The generated result map is input into the discriminator, and the features are extracted through the convolution stack block.Spatial attention is used to pay more attention to the target area.Finally, the input probability is passed through the fully connected layer.

Figure 8 .
Figure 8. Statistical chart of center distance error for different labels.

Figure 8 .
Figure 8. Statistical chart of center distance error for different labels.
Appl.Sci.2023, 13, x FOR PEER REVIEW 11 of 16there was a local missing part of the regression outer contour.Based on the following experimental comparison, the edge label is superior to the mask label.

Figure 9 .
Figure 9. Part of the results of different labels.

Figure 9 .
Figure 9. Part of the results of different labels.

Figure 10 .
Figure 10.Results of different methods.

Figure 11 .
Figure 11.Comparison of the results of ablation experiments.

Figure 11 .
Figure 11.Comparison of the results of ablation experiments.

Figure 11 .
Figure 11.Comparison of the results of ablation experiments.

Figure 12 .
Figure 12.Final result image (the center point and the rotation angle are obtained by the least circumscribed rectangle and the least square method for the outer contour and the center edge line, respectively).

Table 1 .
Feature extraction network for each layer receptive field size.
× k)•n * represents n convolution kernels of the same k × k size.

Table 2 .
Evaluation indexes of different methods.

Table 2 .
Evaluation indexes of different methods.
Figure 10.Results of different methods.

Table 3 .
Accuracy and error values of different methods.

Table 4 .
Comparison of ablation experiments of different modules.

Table 3 .
Accuracy and error values of different methods.

Table 4 .
Comparison of ablation experiments of different modules.