EA-CTFVS: An Environment-Agnostic Coarse-to-Fine Visual Servoing Method for Sub-Millimeter-Accurate Assembly

: Peg-in-hole assembly, a crucial component of robotic automation in manufacturing, continues to pose challenges due to its strict tolerance requirements. To date, most conventional peg-in-hole assembly algorithms have been validated only within simulated environments or under limited observational scenarios. In this paper, an environment-agnostic coarse-to-fine visual servoing (EA-CTFVS) assembly algorithm is proposed. Firstly, to solve the frequent issue of visual blindness during visual servoing, a bottleneck pose is proposed to be used as the desired pose for the visual servoing. Secondly, to achieve accurate assembly, a coarse-to-fine framework is constructed, in which a rough pose is given by the coarse controller to remove large initial alignment errors. For the fine controller, a twin network-based fine controller is provided to improve assembly accuracy. Furthermore, EA-CTFVS utilizes the Oriented Bounding Box (OBB) of objects as the input for visual servoing, which guarantees the system’s ability to operate effectively in diverse and complex scenes. The proposed EA-CTFVS achieves a successful assembly rate of 0.92/0.89 for initial alignment errors of 15/30 cm and 0.6 mm tolerance in real-world D-sub plug assembly tasks under complex scenarios.


Introduction
With the continuous development of industrial automation, intelligent robot manipulation is increasingly replacing manual labor, resulting in a more efficient, intelligent, and safer mode of production.Despite the rapid developments in robot automation, achieving high-precision peg-in-hole assembly in unstructured scenarios, such as 3C assembly, remains challenging.For peg-in-hole assembly tasks that require sub-millimeter accuracy, even slight errors in pose estimation can result in task failure.Furthermore, variable unstructured scenes with the presence of distractor objects can significantly impact the accuracy and robustness of localization algorithms.Therefore, studying high-precision pegin-hole assembly algorithms in non-structured real-world scenarios is crucial for advancing the industrial robotics industry.
The traditional peg-in-hole assembly method involves establishing a force model in the assembly process and manually designing the controller and assembly strategy.Nevertheless, this model-based approach requires enhanced adaptability to diverse tasks and presents challenges in accurately modeling the contact states of intricate components.Recently, visual servoing has garnered increasing interest from researchers in the field of peg-in-hole assembly because it does not require physical contact [1][2][3].Haugaard et al. [2] and Triyonoputro et al. [4] used image inputs from multiple cameras to train neural networks on synthetic data, using visual servoing networks to achieve assembly with sub-millimeter accuracy.However, their task did not consider the orientation requirements and may fail if the holes and ends have large angles and alignment deviations.Lu et al. [3] designed a six-degrees-of-freedom (DoF) algorithm.However, it only applies to a simulation scene, and the problem of visual blindness may exist in the real world.Yu et al. [5] used a twin network to compare the difference between the current and desired images, realizing a sub-millimeter peg-in-hole assembly.However, it only focused on the final part of the visual servoing process and assumed that the two images were in the same scene without interference from background objects.Valassakis et al. [6] applied a segmentation algorithm to remove background information interference.However, they did not consider the high-precision requirements of the task.
Generally, current visual servoing peg-in-hole assembly algorithms suffer from the following issues: (1) visual blindness challenges during visual servoing, (2) inability to achieve sub-millimeter assembly precision with significant initial alignment errors in realworld scenarios, and (3) difficulties in handling complex and changeable unstructured scenes.To address the first challenge, the bottleneck pose was employed as the target position for visual servoing.To address the second issue, a visual servoing pipeline that integrates open-and closed-loop control [7] strategies, progressing from coarse-to-fine adjustments, was introduced.In response to the third problem, an oriented bounding box (OBB) mask was utilized to mitigate background interference during visual servoing.
Specifically, the bottleneck pose represents a predefined relative pose between the robot's end-effector and the object to be assembled (e.g., when the robot's end-effector is in the bottleneck pose, the assembly can be completed by moving 12 cm downward along the Z-axis from the world frame perspective).The bottleneck pose avoids the occlusion that can occur when the camera is close to the object.Moreover, an environment-agnostic coarse-to-fine visual servoing (EA-CTFVS) method is presented to address the large initial alignment errors and background interference.The proposed EA-CTFVS consists of two main components: a coarse positioning network based on oriented object detection and keypoint detection, which adopts open-loop control, and an end-to-end visual servoing fine controller based on a twin network that adopts closed-loop control.An overview of EA-CTFVS is shown in Figure 1.In the coarse stage, oriented object detection is used to identify assembly objects, obtain category and size information, and estimate the rough pose of objects by detecting key points.The robot's end-effector then approaches the bottleneck pose through the coarse controller.In the fine controller stage, the OBB mask detected by the oriented object detection network is used to remove background information and achieve environment-agnostic performance.Additionally, the Siamese network is used to predict the pose differences between live and bottleneck images, ultimately guiding objects to the bottleneck pose with sub-millimeter precision.Unlike the method proposed by Lu et al. [3], sub-millimeter-precision assembly tasks were achieved using only consumer-grade depth cameras without the need for complex 3D point cloud calculations.Moreover, the proposed EA-CTFVS can handle complex assembly scenarios in the presence of interfering objects.
Experimental results show that the proposed EA-CTFVS can complete D-sub plug assembly tasks under large initial alignment errors and sub-millimeter accuracy.In addition, a series of experiments show that EA-CTFVS can cope with the assembly task in the presence of background interference.Ablation experiments highlight the effectiveness of the coarse-to-fine framework.In general, the main contributions of this paper are summarized as follows: 1.
EA-CTFVS, a coarse-to-fine visual servoing network designed to accomplish pegin-hole assembly tasks with sub-millimeter precision in unstructured real-world environments without any force-sensing mechanism, is introduced.

2.
EA-CTFVS solves the problem of visual blindness existing in traditional visual servoing assembly by introducing bottleneck pose as the desired pose.

3.
EA-CTFVS demonstrates the ability to achieve precise and rapid completion of the peg-in-hole assembly task, even when confronted with large initial alignment errors.

4.
EA-CTFVS can accomplish peg-in-hole assembly under complex background interference rather than being limited to a single observation scene.Then, the fine controller utilized visual feedback to align the end-effector to the fine bottleneck pose with sub-millimeter accuracy.Finally, the predefined trajectory is repeated to complete the high-precision assembly.
The paper is organized as follows.In Section 2, the related work is briefly reviewed.The proposed EA-CTFVS is introduced in detail in Section 3. Experimental details and discussion are presented in Section 4. Finally, the paper is concluded in Section 5.

Related Work
This section mainly discusses the related work in Peg-In-Hole Assembly Tasks and Coarse-to-Fine Strategy for Robotic Manipulation.

Peg-in-Hole Assembly Task
Industrial robots are widely used in manufacturing, especially for assembling peg and hole parts.However, it still poses the challenge of achieving better accuracy in searching and positioning for more complicated operations [8].Some researchers use force feedback control for peg-in-hole assembly [9][10][11][12].However, this method requires abundant contact between the manipulator's end and the object, so assembly safety is difficult to guarantee.On the other hand, when the end of the manipulator has a large initial alignment error with the hole, the search time of the method based on force feedback is too long.
Researchers have gradually favored visual sensors in peg-in-hole assembly tasks in recent years because of their contactless characteristics.Nigro et al. [13] use the Convolutional Neural Network to detect the hole location and use the three-dimensional reconstruction method to determine the orientation of the hole.However, they only use the open-loop method to estimate the pose of the hole, which is difficult to meet the high precision requirements due to the lack of subsequent fine-tuning process.Triyonoputro et al. [4] use synthetic data to train a learn-based visual servoing network to predict the position of the hole and approach the hole using iterative visual servoing.However, its working space is limited to three DoF.In [14], the ICP [15]-based open-loop control method is used to estimate the rough pose, and then it is converted to the learn-based end-to-end visual servoing network for fine-tuning.The sim-to-real training policy achieves the accuracy of a sub-millimeter, but the method works in the four-DoF space.In [3], a six-DoF pegin-hole assembly algorithm based on 3D visual servoing was proposed.In the first stage, 3D key points are used to determine the initial pose, and in the second stage, 3D visual feedback is used to provide refinement.However, this method is only verified in the simulation environment and may be limited by some factors in the actual scene, such as visual blindness.Yu et al. [5] design a visual servoing network based on Siamese networks and achieves sub-millimeter accuracy in real-world D-sub plug assembly tasks.However, they only focus on the final refinement stage of visual servoing and are heavily influenced by background interference.Valassakis et al. [6] use segmentation networks to remove the influence of background information, but it focuses on the generality of the task and does not meet the sub-millimeter accuracy requirements.In contrast to the above work, in this paper, an environment-agnostic coarse-to-fine peg-in-hole assembly visual servoing network is proposed, which is robust to complex and variable environments while ensuring sub-millimeter accuracy and can cope with large initial alignment errors.

Coarse-to-Fine Strategy for Robotic Manipulation
The coarse-to-fine control strategy has been proposed for many years and applied in the field of robotic manipulation [16,17].Combining a rough, model-based controller with a more granular, learning-based approach can significantly improve the search efficiency of the robot in the early stages of operation while demonstrating high accuracy in the final stages.Johns et al. [18] propose an imitation learning framework that uses a coarse controller for sequential pose estimation to reach bottleneck position.Then, a fine controller based on behavioral cloning is adopted to complete the task.Valassakis et al. [6] proposes a one-shot imitation learning paradigm, which reaches a bottleneck by visual servoing and then completes the operation task by repeating the demonstration.Paradis et al. [19] applied the control method from coarse to fine in the surgical robot and used the coarse and fine controllers in cycles to complete the surgical operation.Valassakis et al. [14] use a model-based ICP algorithm as a coarse controller to move the manipulator's end to the bottleneck position and then employ the end-to-end controller to complete the insertion task in the simulation environment.Lu et al. [3] use key-point detection to determine the rough coordinate system of the object, drive the end-effector to reach a bottleneck pose, and then use 3D visual servoing for fine control and complete the peg-in-hole assembly.Keypoint detection-based methods are more convenient than model-based coarse controllers and can be easily migrated to different objects.In this paper, the rough pose of the object is determined using the key point detection method, the end-effector is driven to reach the bottleneck position with a slight error, and then the object is accurately controlled to the bottleneck position using the end-to-end fine controller based on the twin network, with the assembly being completed according to the prior knowledge.Unlike [3], sub-millimeter accuracy in the fine control phase is achieved using only 2D image information.

Method Overview
Inspired by [5,6], the proposed EA-CTFVS aims to learn a controller that precisely moves the end-effector to a particular pose relative to the object, called the bottleneck pose.From this pose, a sub-millimeter peg-in-hole assembly task can be completed by moving the end-effector vertically downward by 12 cm, thereby avoiding blindness problems in visual servoing.Specifically, the proposed EA-CTFVS mainly consists of two parts: (A) open-loop coarse control based on oriented object detection and keypoint prediction, and (B) environment-agnostic visual servoing fine control based on offset prediction, as shown in Figure 2. The pipeline of EA-CTFVS.First, in the preparation phase, the end-effector is positioned at the bottleneck pose, and an image of the bottleneck is captured.An Oriented Bounding Box (OBB) is acquired through oriented object detection.Subsequently, an environment-independent bottleneck image is generated by employing the OBB as a mask.The bottleneck transformation matrix between the female D-sub plug and the end-effector is determined via key point detection.During the subsequent deployment phase, the pose of the female D-sub plug was determined through a combination of oriented object detection and key point detection.Then, the coarse controller guides the end-effector to approach the coarse bottleneck pose.Subsequently, an environmentagnostic live image is transmitted to the twin network visual servo controller in conjunction with the environment-agnostic bottleneck image to get the offset output.The fine controller employs the offset to manipulate the end-effector to approach the fine bottleneck pose with sub-millimeter accuracy, ultimately executing a predefined trajectory to accomplish the assembly task.

Coarse Controller
A coarse controller was designed to guide the end-effector to the coarse bottleneck pose before further refinement.Oriented object detection is used to isolate the target from the chaotic scene as much as possible, and key point detection is used to determine the pose information to design the open-loop controller.The details are as follows: (1) Oriented object detection.In unstructured scenes, objects to be assembled are often non-axis aligned with arbitrary orientations having cluttered surroundings.The coarse controller network is based on YOLO-based arbitrary-oriented detection(YOLOOD) [20], which has been shown to be effective for oriented object detection.Using YOLOOD, the coordinates of the four corner points of the OBB, (x i , y i ), i = 1, 2, 3, 4, were obtained, where (x i , y i ) represents the pixel coordinates of the i th corner point.This allows us to extract the area of the object of interest from the original image, enabling the subsequent visual servoing network to focus solely on the object and avoid environmental interference.YOLOOD also helps identify the class and size of the object of interest.
(2) Keypoint-based open-loop control.The pose information is represented by three key points K = {k 1 , k 2 , k 3 }.Where k 1 represents the three-dimensional coordinates t ∈ R 3 of the center point of the hole, and The orientations of the x-, y-, and z-axes are determined as v x = k 2 − k 1 and v y = k 3 − k 1 , respectively.The rotation matrix R ∈ SO 3 can be calculated from these key points, as shown in Algorithm 1. Finally, the end-effector can reach the desired pose [R | t] ∈ SE 3 through inverse kinematics.Therefore, the key to the problem is obtaining the coordinates of the three key points.Different from [3], a 2D key point detection method combined with a depth camera is used to obtain the 3D coordinates of key points.The 2D detection method requires less computation and maintains higher accuracy.
(3) Keypoint prediction.Like YOLOOD, a key point detection branch based on the YOLOv5 detection network was added to acquire key point coordinates while detecting objects of interest.Although only three key points are needed for open-loop control to obtain pose information, the position characteristics of these three key points are not obvious in the image, which is unsuitable for manual annotation and network regression.Therefore, four corner points with more obvious features are selected as the target of network regression, and the final three key points are then obtained indirectly through the geometric relationship of the four corner points, as shown in Figure 3. Specifically, four corner 2D coordinates of the object of interest can be obtained by The 2D-pixel coordinates of the three key points can be obtained from YOLOOD, and by combining them with the depth camera data, the 3D coordinates of these key points can be determined.(4) Loss.For key-point regression, considering the accuracy of the task, Wing-Loss [21] is adopted to ensure that the network is sensitive to small errors, which is calculated by: wing where x represents the difference between the predicted keypoint coordinates and the groundtruth keypoint coordinates, the non-negative w sets the range of the nonlinear part to (−w, w), ϵ limits the curvature of the nonlinear region, and C = w − w ln(1 + w/ϵ) is a constant that smoothly links the piecewise-defined linear and nonlinear parts.The loss functions for key point regression are defined as follows: where s = (x h , y h ) represents the key point vector, The coarse controller network is trained with supervised learning to minimize the loss: where L reg , L obj , L cls , L ang and L keypoint denote the bounding box regression loss, confidence loss, object classification loss, angular classification loss, and key-point regression loss, respectively.

Fine Controller
Inaccuracies in camera calibration, depth sensors, and keypoint estimation can lead to the failure of coarse controllers for high-precision control.Therefore, a fine-grained, end-to-end visual servoing controller is necessary.In addition, assembly is not completed in an identical scene every time, and the fine controller needs to be applicable to the scene with different interferences.Therefore, an environment-agnostic offset prediction visual servoing network is proposed.The proposed visual servoing network predicts the pose offset from the current end-effector pose to the desired pose.In addition, different from the direct input of the whole image, the proposed visual servoing network uses the object's OBB to remove the redundant background and take the processed image as input to separate the object of interest from the chaotic scene.Therefore, the network focuses on the object of interest itself and is suitable for assembly scenes under different environmental conditions.
(1) Environment-agnostic image input.Unlike traditional visual servoing networks [1,14], an OBB mask is used to remove complex backgrounds, and the processed image is used as input to make the visual servoing network suitable for different production environments.For each control time step t, an environment-agnostic live image is obtained using Where I t represents the live image at time step t, B t represents the OBB mask (x 1 , y 1 , x 2 , y 2 , x 3 , y 3 , x 4 , y 4 ) detected by oriented object detection in the coarse control phase.Using I obb t instead of I t as the input solves the problem that traditional visual servoing networks can only be applied to a single scene, making the algorithm robust to diverse scenes with different interferences.Similarly, the environment-agnostic bottleneck image I obb bot is collected as the desired image for visual servoing.
(2) Six DoF offset prediction.The pose of the end-effector is improved to achieve precise insertion by estimating the offset that represents the relative pose between the current pose and the bottleneck pose.Traditional visual servoing methods often limit the problem to a limited number of DoF, such as three-DoF [2,4] and four-DoF [14].Instead, the proposed visual servoing network predicts six DoF relative poses between the current pose and bottleneck pose, including three DoF translation offsets ∆t = (∆x, ∆y, ∆z) in 3D coordinates and three DoF rotation offsets ∆r = (∆q 1 , ∆q 2 , ∆q 3 , ∆q 4 ) in quaternion representation.Quaternion is chosen instead of Euler Angle as the angle representation because quaternion has no singularity and is more convenient for network learning.
(3) Visual servoing with Siamese architecture.The fine controller moves the end-effector to align the live image with the bottleneck image.To conduct this, a Siamese CNN takes in the environment-agnostic live and bottleneck images and outputs a six DoF offset that represents the relative pose between the current and bottleneck poses.Each branch of the Siamese network uses CaffeNet to extract features with shared weights.The feature map is flattened, subtracted, and fed into five additional fully connected layers for the final output ∆x, ∆y, ∆z, ∆q 1 , ∆q 2 , ∆q 3 , ∆q 4 .An overview of visual servoing with Siamese architecture is shown in Figure 4. ∆x, ∆y, ∆z, ∆q (4) Loss.The loss of Siamese structure visual servoing network is composed of translation error and rotation error, and root mean square error loss function is adopted.Then, the loss function is calculated as follows: where m = 3 for the translation and n = 4 due to the quaternion representation of rotation.∆t and ∆t ′ are the estimated and ground-truth translation values, respectively, in meters.∆q and ∆q ′ are the estimated and ground-truth rotation values, respectively, in the form of normalized quaternions.w = 0.99 is a hyper-parameter that balances the magnitude of rotation loss and translation loss.

Assembly Process
The core idea of the proposed EA-CTFVS is to align the live image with the bottleneck image to complete the assembly, so the bottleneck image must be obtained first.
First, manually guide the end to completely insert the male D-sub plug into the female D-sub plug, and then move the end to a bottleneck pose (for example, lift it 12 cm vertically) so that the female D-sub plug is visible and in the center of the camera's field of view.Then, the bottleneck image I bot can be obtained.By using the oriented object detection and key point detection method in Section 3.2, the bottleneck image without background information I obb bot and the pose transformation matrix from the key point coordinate system to the camera coordinate system [R bot | t bot ] ∈ SE 3 can be obtained.
Next, change the female D-sub plug and end effector position at will, but ensure that the object appears in the view of the camera.After moving, the male D-sub plug (end-effector) has a large alignment error with the female D-sub plug.Hence, the coarse controller is used to move the end-effector to the coarse bottleneck pose.The camera is used to capture the current image, and the oriented object detector of the coarse controller is used to detect the female D-sub plug and its key points.Then, with Algorithm 1, the current transformation matrix from female D-sub plug to end-effector [R cur | t cur ] ∈ SE 3 can be obtained.Combining the desired [R bot | t bot ] ∈ SE 3 matrix collected at the bottleneck pose and hand-eye calibration results, the end-effector can be driven to the bottleneck with a minor alignment error.
Then, due to the high precision required for this assembly task, the fine controller is used for further fine-tuning.With the help of the Siamese network visual servoing Algorithm 2 mentioned in Section 3.3, the pose after the operation of the coarse controller is taken as the initial pose, the live image is constantly aligned with the bottleneck image, and eventually, it reaches a bottleneck position within a specific error tolerance range.Finally, the end-effector executes a predetermined trajectory to complete the assembly.

Experiments 4.1. Experimental Setup
For the real-world sub-millimeter assembly experiments, the Aubo-i5 robotic arm (AUBO Intelligent Technology Co., Ltd., Beijing, China) with a 3D-printed end-effector mounted at the end of the arm was used.Male D-sub plugs of different shapes were attached to the end-effector, while the corresponding female D-sub plug was placed on the workbench using a 3D-printed base.The base was considered part of the D-sub plug, as it is typically secured by mechanical parts in a real production environment.To simulate the chaotic scene of an actual assembly, distractors were randomly placed around the female D-sub plug.An inexpensive RealSense D435-i camera (Intel (China) Co., Ltd., Shanghai, China) was installed on the end-effector to capture RGB and depth images.The camera's horizontal and vertical fields of view are approximately 87 • and 58 • , respectively.The experiment used a pixel resolution of 640 × 480.When at the bottleneck pose (12 cm above the object), the horizontal and vertical resolutions can be calculated to be approximately 0.356 × 0.277 mm/pixel.The initial position of the end-effector was approximately 35 cm above the table, and the task space was defined as the area where the object was at least partially visible in the image from the initial position of the end-effector, covering approximately 30 cm × 20 cm.The bottleneck position was set at 12 cm above the female D-sub plug.Several experiments were conducted to address different bottleneck pose distances.The results show that the closer the camera is to the female connector, the better the performance of the fine controller algorithm.It is hypothesized that a closer camera increases the proportion of the female connector in the camera's field of view, providing more information for the algorithm.However, tests indicate that 12 cm is a critical distance; when the camera is closer than 12 cm, the depth camera fails, outputting a depth value of 0, causing the coarse controller to malfunction.Therefore, 12 cm was ultimately chosen as the bottleneck pose distance.

Task
In the experiment, a peg-in-hole assembly task was designed.Three different shapes of D-sub plugs named A1, A2, and A3 were chosen as experimental objects, as shown in Figure 5.A successful assembly means that the male D-sub plug is fully inserted into the female D-sub plug as shown in Figure 6, and it can observed that this task is fine-grained.To evaluate the tolerance range required for the insertion task, an evaluation experiment was designed.Offsets were added to the bottleneck pose to test the possibility of successful insertion.In the evaluation experiment, A1 was used as the test subject, with multiple sets of translation and rotation parameters.The experimental results are shown in Table 1.The results demonstrate that lower angular and translational errors are necessary to ensure successful insertion.
In the task, female D-sub plugs are randomly placed on the table, and there are interferences around them.The initial alignment translation error was set to [−30 cm, 30 cm).This task tests that the algorithm can still complete the accurate assembly task with low tolerance under large initial alignment errors and complex backgrounds.

Data Generation
Three different shapes of D-sub plugs (A1, A2, and A3) were used as experimental subjects, as shown in Figure 5. Coarse (500 images) and fine (2000 × 2000 = 4,000,000) datasets were built to train the coarse and fine controllers, respectively.
(1) Coarse dataset.Hundreds of female D-sub plug images in different positions and orientations were collected under various backgrounds and illumination.In addition, the shape of the female D-sub plug also changes.The oriented bounding boxes and key points of the female D-sub plug in the dataset were labeled using LabelMe tools.
(2) Fine dataset.The fine controller is designed to estimate the pose transformation matrix corresponding to any two images instead of only from the current image to the bottleneck image.Therefore, when the fine dataset is created, an initial pose Then, a pair of new poses T 02A and T 02B are obtained by twice six DoF transformations from the initial pose T 0 .Where T 02A and T 02B refer to the relative transformation matrix from T 0 to T A and T B .The image I A and I B were recorded, respectively.The transformation matrix label from T A to T B can be calculated as follows: Therefore, the T ∆ , I A and I B constitute a set of training data.Since the role of the fine controller is to fine-tune the end pose, the sampling range of rotation and translation is very small.The movement range from the initial pose is set at an x-axis and y-axis radius of 5 mm, z-axis radius of 10 mm, pitch and yaw angle −5 degrees to 5 degrees, and roll angle −10 degrees to 10 degrees.For each shape of the female D-sub plug, 4000 samples were sampled and recorded.Of the 4000 samples, half were collected in the absence of the distractors and the other half in the presence of the distractors.Hence, there are 2000 × 2000 = 4,000,000 sets of training data for each shape of female D-sub plug in each set about the distractors.The sampling process is completed automatically.

Training
All the experiments are implemented on PyTorch 1.9 deep learning framework and Python 3.8.0 on a PC with two NVIDIA GeForce RTX 3080 GPUs (Nvidia Corporation, Santa Clara, United States) with 16 GB of RAM.The computer operating system is Ubuntu 20.04.
(1) Coarse controller.The stochastic Gradient Descent (SGD) algorithm was employed to optimize the network, where the momentum, weight decay coefficients, and initial learning rate were set to 0.937, 0.0005, and 0.0001, respectively.K-mean and genetic learning algorithms were used to automatically generate anchor size.Data augmentation methods such as flipping, rotation, mosaic, and multi-scale techniques were used to enhance the model's generalization performance.The YOLOv5m model was chosen as the base model, and the COCO dataset pre-trained model was loaded.The batch size was set to 8 and the number of epochs to 300.
(2) Fine controller.Two models, M ei and M whole , were trained.Model M ei was trained with A1, A2, and A3 D-sub plugs in the absence of distractions and an environment-agnostic strategy for ten epochs.The only difference between M whole and M ei is that M whole leverages the whole image as input instead of the filtered image with the OBB mask.A quarter of the maximum number of the training set: 1900 2 × 0.25 × 3 = 2, 707, 500 input pairs were used in the training.The learning rate was 10 −4 initially and halved after the fourth, sixth and eighth epoch.Adam optimizer was used with β 1 = 0.9, β 2 = 0.999, ϵ = 10 −8 and no weight decay.The batch size was set to 128.Random variations in brightness, contrast, and saturation were used for data augmentation on the fine dataset.Data augmentation can improve the model's performance under different lighting conditions.

Experiments and Results
A series of experiments was conducted to verify the effectiveness of the proposed EA-CTFVS method.First, the performance of the overall framework was examined and compared to other advanced studies to demonstrate the effectiveness of EA-CTFVS.Then, ablation experiments for environmentally agnostic strategies were performed to verify the robustness of EA-CTFVS in the presence of distractors.Finally, the performance of the coarse-to-fine framework was compared with that of an end-to-end network to validate the proposed strategy.

Overall Framework Evaluation
In this experiment, the overall performance of the proposed EA-CTFVS is aimed to be tested.In order to simulate the assembly task in the real world as much as possible and verify whether the algorithm can meet the sub-millimeter accuracy requirements, three different sizes of D-sub plugs were chosen as experimental objects.In the experiment, the female D-sub plug was randomly placed on the table, and the alignment error of the male D-sub plug and the female D-sub plug was set to 15 cm and 30 cm.The initial pose of the male D-sub plug was random, but the female D-sub plug was guaranteed to be in the camera's field of view.For each experimental group with different combinations of initial alignment errors and D-sub plugs, 50 different poses were designed, whether different methods completed the assembly task in each pose was recorded, and the success rate was recorded.
The proposed EA-CTFVS was compared with the four baselines.(1) ICP [15]: This method is a traditional point cloud registration method, without neural network learning, and belongs to the open-loop control method.The initial transformation matrix is set to transform from the end-effector to the center point of the workspace.ICP calculates the pose of the current female D-sub plug through the registration of the scene in the template point cloud to guide the end-effector to complete the assembly.(2) ICP with keypoints (ICP w/kpts) [15]: Similar to the ICP algorithm, the ICP w/kpts is assigned a rough female D-sub plug pose calculated from the coarse controller.A rough initial pose can prevent ICP from falling into the local optimal value.(3) KVOIS [22]: KOVIS is a learning-based visual servoing framework using robust keypoints latent representation for robotics manipulation tasks.(4) P2HNet [23]: P2HNet is a learning-based neural network that can directly extract desired landmarks for visual servoing.In addition, force control is used in the final insertion process.
As shown in Table 2, the ICP algorithm could not complete the sub-millimeter peg-inhole assembly task.Even with an approximate initial conversion matrix, the ICP algorithm cannot achieve the required high precision.In addition, the experimental results show that KVOIS cannot perform sub-millimeter insertion tasks.It is suspected that KVOIS focuses on robust virtual-to-reality migration rather than assembly precision.It is worth noting that P2HNet achieved results similar to ours; however, this is because P2HNet incorporates force control during insertion, whereas the proposed EA-CTFVS relies solely on visual information.In contrast, the proposed EA-CTFVS successfully completed the sub-millimeter D-sub plug assembly task with an average success rate of 0.92/0.89for large initial errors (15 cm and 30 cm) and was robust to D-sub plugs of different shapes.The results demonstrate that EA-CTFVS can effectively complete complex and high-precision assembly tasks without relying on expensive hardware.Figure 7 shows an example pipeline of EA-CTFVS.Table 2. EA-CTFVS and other four baseline success rates on peg-in-hole assembly tasks.The A1, A2, and A3 D-sub plug assembly tasks were tested with 15 cm/30 cm initial alignment errors.The results show that EA-CTFVS outperforms other baselines under sub-millimeter tolerances and large initial alignment errors.This experiment aims to verify the performance of the proposed environment-agnostic image input strategy on fine controller networks.Based on this, models M ei and M whole were trained on a fine dataset without distractors and were tested on datasets with and without distractors, using 100 2 × 3 = 30, 000 input pairs in the testing.

Method
As shown in Table 3, model M ei and M whole both perform well on the dataset without distractors, specifically, model M whole performs slightly better than M ei , the errors e x , e y , e z , e r , e p and e y are 0.08 mm, −0.01 mm, 0.07 mm, 0.04 • , 0.05 • , 0.05 • lower than M ei , respectively.After using an OBB mask to remove the background, some object information may be lost due to the error of rotation detection.However, in general, the error is within the acceptable range.However, when model M ei and M whole were tested in a scene with background interference, it was found that for model M ei , the error did not change significantly, whereas, for model M whole , the error far exceeded the original result.After using the OBB mask to remove the background, the algorithm focused on the characteristics of the object itself without background interference, adapting to complex and changeable unstructured scenes.Figure 8 shows similar results.For the dataset without distractors, except for a few outliers, the errors of models M ei and M whole were approximately 0.3 mm and 0.2 • , respectively.However, the errors of model M whole were unacceptable for a dataset with distractors, whereas model M ei still performed well.In conclusion, this experiment proves that the proposed environmentagnostic strategy can render the proposed EA-CTFVS algorithm free of interference and widely applicable in changeable, unstructured environments.

Coarse to Fine Framework Evaluation
This experiment tested the effectiveness of the coarse-to-fine framework.Therefore, the fine controller was used as a single end-to-end visual servoing controller for comparison with the proposed complete EA-CTFVS framework.The experiment was conducted under two initial alignment errors of 15 cm and 30 cm, with the assembly objects being the A1, A2, and A3 D-sub plugs.The validity of the adopted coarse-to-fine framework was tested by comparing the success rates of the D-sub plug assembly experiment under the two different frameworks.The time required to complete the assembly experiment was also considered.
As shown in Table 4, the proposed EA-CTFVS performs well at both 15 cm and 30 cm alignment errors, whereas the end-to-end visual servoing method performs adequately at 15 cm alignment errors but fails at 30 cm alignment errors.The role of the fine controller is to fine-tune the results of the coarse controller.Therefore, the training sampling range of the fine controller was within a small translation and rotation interval to ensure sub-millimeter accuracy.When faced with large alignment errors, it is difficult for the fine controller to complete the task alone, as this sacrifice is necessary for high-precision tasks.However, as shown in Table 5, with an increase in alignment error, the visual servoing time required by the end-to-end controller becomes longer, and the algorithm requires more iterations to approximate the desired pose.However, for the proposed EA-CTFVS, with the rough pose given by the coarse controller, the end-effector can quickly reach the desired pose with a slight error, significantly shortening the time required for the subsequent visual servoing.Therefore, for the EA-CTFVS, the times required for assembly tasks with 15 cm and 30 cm alignment errors were similar.Figure 9 shows the time diagram of the robot end pose correction during assembly with a 15/30 cm initial alignment error.In conclusion, the proposed EA-CTFVS can complete the assembly task quickly and accurately despite large initial alignment errors and is more efficient than the end-to-end method.

Conclusions
In this paper, EA-CTFVS, an environment-agnostic coarse-to-fine visual servoing framework for real-world sub-millimeter peg-in-hole assembly tasks, was presented.The EA-CTFVS employs a bottleneck pose as the desired pose for visual servoing, effectively addressing the prevalent issue of visual blindness encountered in real-world peg-in-hole assembly tasks.Furthermore, EA-CTFVS integrates a coarse controller based on keypoints with a fine controller that employs a twin network.This combination effectively addresses the challenge of achieving sub-millimeter accuracy in peg-in-hole assembly tasks, even when confronted with large initial alignment errors.More importantly, EA-CTFVS utilizes an OBB mask to eliminate background information, enabling the algorithm to effectively handle real-world scenarios with intricate background interference, thereby enhancing its robustness.A series of real-world experiments using three distinct D-sub plugs was conducted to assess the efficacy of the EA-CTFVS.The results show that EA-CTFVS outperforms other advanced methods under sub-millimeter tolerance and significant initial alignment errors.Furthermore, EA-CTFVS is more suitable for complex production scenarios.Although significant results were obtained, this study has some limitations.For example, testing was limited to D-sub plugs in experiments.However, objects vary and are intricate in real-world robotic assembly tasks.In future studies, The adaptability of the network to accommodate objects of diverse shapes is planned to be enhanced.

Figure 1 .
Figure 1.An overview of the proposed EA-CTFVS.EA-CTFVS adopts a coarse-to-fine framework.First, the coarse open-loop controller moves the end-effector to reach near the bottleneck pose.Then, the fine controller utilized visual feedback to align the end-effector to the fine bottleneck pose with sub-millimeter accuracy.Finally, the predefined trajectory is repeated to complete the high-precision assembly.

Figure 2 .
Figure2.The pipeline of EA-CTFVS.First, in the preparation phase, the end-effector is positioned at the bottleneck pose, and an image of the bottleneck is captured.An Oriented Bounding Box (OBB) is acquired through oriented object detection.Subsequently, an environment-independent bottleneck image is generated by employing the OBB as a mask.The bottleneck transformation matrix between the female D-sub plug and the end-effector is determined via key point detection.During the subsequent deployment phase, the pose of the female D-sub plug was determined through a combination of oriented object detection and key point detection.Then, the coarse controller guides the end-effector to approach the coarse bottleneck pose.Subsequently, an environmentagnostic live image is transmitted to the twin network visual servo controller in conjunction with the environment-agnostic bottleneck image to get the offset output.The fine controller employs the offset to manipulate the end-effector to approach the fine bottleneck pose with sub-millimeter accuracy, ultimately executing a predefined trajectory to accomplish the assembly task.

Figure 3 .
Figure 3. Coordinate transformation relationship from four key points to three key points.

Figure 4 .Algorithm 2 1 :
Figure 4. Siamese network architecture.By using the six-DoF pose offset output of the Siamese visual servoing network, the pose of the end-effector can be iteratively refined in a closed-loop control manner.First, the current end-effector pose [R | t] ∈ SE 3 is recorded.Then, the end-effector is moved to the next pose according to the offset predicted by the network, [R ′ | t ′ ] ∈ SE3 , where R ′ = ∆R • R and t ′ = ∆t + t.Finally, this process is repeated until the network-predicted offset is less than a specific threshold.The overall algorithm is shown in Algorithm 2.

Figure 5 .
Figure 5.The D-sub plugs come in different shapes.

Figure 7 .
Figure 7. Example pipeline of EA-CTFVS.(a) Initialization.(b) Oriented object detection and key points detection results of coarse controller.(c) Pose after running coarse controller.(d) Environmentagnostic live image while running fine controller.(e) Reaching bottleneck pose after running fine controller.

Figure 8 .
Figure 8. Distributions of translation and rotation errors for three D-sub plugs on different models are depicted through box plots.Maximum outliers are visualized as circles positioned above plot.These five horizontal lines, arranged from highest to lowest, correspond to maximum fence, third quartile, mean, first quartile, and minimum fence, respectively.(a) Results of M ei on dataset without distractors.(b) Results of M ei on dataset with distractors.(c) Results of M whole on dataset without distractors.(d) Results of M whole on dataset with distractors.

Table 4 .Figure 9 .Table 5 .
Figure 9.Time diagram of robot end pose correction during assembly with 15/30 cm initial alignment error.(a,b) depict the time curves under initial alignment errors of 15 cm and 30 cm, respectively.The blue curves represent the adjustment of the robotic arm's end-effector coordinates in the x, y, and z axes, the red curves indicate the adjustment of the Euler angles.The time points corresponding to the robotic arm's end-effector reaching the initial pose, coarse bottleneck pose, fine bottleneck pose, and insertion pose are marked below the curves.Table 5. Visual servoing time(s) of EA-CTFVS and end-to-end baseline on 15 cm and 30 cm initial alignment error peg-in-hole assembly tasks.The results show that EA-CTFVS is more efficient.

Table 1 .
The relationship between different rotation (roll, yaw and pitch) and translation(x, y and z) offsets and the success of insertion.

Table 3 .
Errors of model M ei and M whole on dataset without/with distractors.Results show that model M ei is more robust to background interference.