Next Article in Journal
A Flying Capacitor Zero-Sequence Leg Based 3P4L Converter with DC Second Harmonic Suppression and AC Three-Phase Imbalance Compensation Abilities
Next Article in Special Issue
Toward Real-Time Industrial Small Object Inspection: Decoupled Attention and Multi-Scale Aggregation for PCB Defect Detection
Previous Article in Journal
PhysGraphIR: Adaptive Physics-Informed Graph Learning for Infrared Thermal Field Prediction in Meter Boxes with Residual Sampling and Knowledge Distillation
Previous Article in Special Issue
Visual Localization and Policy Learning for Robotic Large-Diameter Peg-in-Hole Assembly Tasks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SO-YOLO11-CDP: An Instance Segmentation-Based Approach for Cross-Depth-of-Field Positioning Micro Image Sensor Modules in Precision Assembly

Research Center of Laser Fusion, China Academy of Engineering Physics, Mianyang 621900, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 411; https://doi.org/10.3390/electronics15020411
Submission received: 15 December 2025 / Revised: 11 January 2026 / Accepted: 14 January 2026 / Published: 16 January 2026

Abstract

During batch soldering, assembly of micro image sensor modules, initial random pose, and feature partially occlude target micro-component image, leading to issues of missed and erroneous detection, and low 3D spatial positioning accuracy due to cross-depth-of-field detection errors in microscopic vision. This paper proposes Small object-YOLO11-Cross-Depth-of-field Positioning (SO-YOLO11-CDP), an instance segmentation-based approach for precision cross-depth-of-field positioning micro-component. First, an improved Small object-YOLO11 (SO-YOLO11) image segmentation algorithm is designed. By incorporating a coordinate attention mechanism (CA) into segmentation head to enhance localization of micro-targets, the backbone uses non-stride convolution to preserve fine-grained feature, while target regression performance is boosted via Efficient-IoU (EIoU) loss combined with normalized Wasserstein distance (NWD). Subsequently, to further improve spatial position detection accuracy in cross-depth-of-field detection, a calibration error compensation model for image Jacobian matrix is established based on pinhole imaging principles. Experimental results indicate that SO-YOLO11 achieves 16.1% increase in precision, 4.0% increase in recall, and 9.9% increase in mean average precision (mAP0.5) over baseline YOLO11. Furthermore, it accomplishes spatial detection accuracy superior to 6.5 μm for target micro-components. The method presented in this paper holds significant engineering application value for high-precision spatial position detection of micro image sensor components.

1. Introduction

With the advancement of micro-electro–mechanical systems (MEMSs) technology, the physical dimensions of micro-sensors have shrunk from millimeter-scale to micrometer-scale, imposing higher demands on assembly precision and efficiency. Micro image sensor modules represent a typical product category within micro-sensors, where the core components, including micro image sensors and inner conductor, exhibit sizes on the hundred-micrometer scale. Currently, the assembly of these components relies heavily on manual labor, resulting in issues such as low precision and poor efficiency. Therefore, researching micro-assembly techniques to achieve precision assembly of CMOS sensors is a crucial technological direction for replacing manual labor and enabling consistent, mass-production assembly.
During automated micro-assembly of components, the microscopic vision system extracts target geometric parameters through machine vision algorithms, providing robots with spatial position feedback to achieve precise positioning [1]; force-sensing control then enables compliant assembly of the target components [2]. Unlike conventional microdevice assembly, inner conductors feature weak rigidity and slender dimensions. This physical characteristic makes them prone to deformation during assembly, hindering state consistency. Consequently, random initial states and feature occlusions appear in images, leading to failed component feature extraction. Furthermore, high-magnification microscopic vision exhibits shallow depth-of-field (DOF). The random pose of inner conductor causes coarse positioning deviation between the micro image sensors and the inner conductors, resulting in partial defocusing and cross-depth-of-field detection issues within the same field of view. Consequently, image feature detection and cross-depth-of-field spatial position detection represent core challenges in micro-assembly technology.
For detecting multiple types of targets, the microscopic vision system employs low-magnification objectives to expand the camera’s field of view while reducing the pixel coverage of targets within the image. The challenge lies in accurately extracting target contours and feature information from low-quality images. First, image algorithms are used to segment micro-part outlines and extract feature point pixel coordinates. Yao [3] introduces the principles of optical microscopic vision and traditional image processing algorithms such as template matching and feature matching. However, this approach is unsuitable for irregular, easily deformable parts. End-to-end deep learning image segmentation algorithms demonstrate superior generalization capabilities. Widely adopted capabilities in industrial applications include one-stage algorithms like the YOLO (You Only Look Once) series [4] and SSD [5], as well as two-stage variants of the R-CNN series [6,7,8]. For instance, Gong [9] employed Faster R-CNN for preliminary coarse localization of small-diameter, deep-aperture electrical connectors, followed by contour refinement using an edge enhancement technique. Fontana G [10] developed a dataset for detecting soldering defects on small PCB components and achieved pixel-level precision and high accuracy in defect contour detection by training a Mask R-CNN model. Xu [11] combined the attention mechanism CBAM with YOLOv5 to identify droplets in occluded scenes during micro-assembly processes. Ming Kang [12] proposed the ASF-YOLO model for rapid segmentation of cell images under microscopic vision, demonstrating high detection and segmentation performance. YAN [13] utilizes the U-NET network to extract groove features on ring-shaped components and calculates the component’s pose based on image features and corresponding 3D coordinates. These algorithms detect target features within the pixel coordinate system, necessitating further research on calibrating the robotic manipulator’s 3D spatial coordinate system with the pixel coordinate system. Liu proposed BI-YOLO [14], a lightweight real-time detection algorithm improved from YOLOv8, capable of detecting the positions of cells and microinjector tips in real time during cellular micro-manipulation while providing visual feedback. Liu Hao employed YOLOv5 for coarse localization of small targets [15], followed by fine localization using a position correction algorithm, thereby enhancing localization accuracy.
Current marker-based or active motion-based eye-to-hand calibration methods fail to adequately account for errors introduced by cross-depth-of-field detection when calculating micro-parts spatial positions, thereby limiting their detection accuracy. To calibrate errors arising from different focal planes, multi-scale micro-pattern markers are designed to calibrate depth errors. For instance, Yao [16] developed a cross-scale auxiliary marker board for microscopic vision systems to enhance cross-depth-of-field detection accuracy. Zhang [17] designed a cross-scale detection system integrating binocular global vision and binocular micro vision, calibrated using a cross-scale micro chessboard. Oskars [18] combined image enhancement with YOLOv7 to detect blurred-cell centroids positions, employing a standard micro-scale grid to calibrate camera and manipulator coordinate mappings. Another strategy to circumvent defocus issues involves estimating depth information to operate components on the same focal plane. For instance, Fan [19] embedded the global attention mechanism (GAM) into YOLOv8 to pinpoint target center position and angle, establishing a learning-driven manipulation model to control dual robotic arms for collaborative micro-part manipulation on a 2D plane without image calibration. Ming [20] equipped an eye-in-hand vision system, using YOLO with depth information to roughly localize the peg. After approximation, deep reinforcement learning was employed to achieve precise alignment and insertion of the micro gear onto the peg. However, this strategy has significant limitations and is not applicable on many types of parts. Active motion eye-to-hand calibration methods within multi-camera systems provide superior accuracy and stability. For instance, Qin [21] proposed a point-contour extraction network (MicroPCIE) with robust performance on defocused blurred images to detect biomedical microelectrode position. By actively translating and rotating to calibrate the image Jacobian matrix, the system enables microelectrode implantation precise control. Zhang [22] achieved pose detection by calibrating the image Jacobian matrix of multi-channel microscopic vision and proposed a coarse-fine alignment control strategy. Horizontal vision was used for precise pose detection, while oblique vision was utilized for high-precision position detection of microsphere holes and microtubes.
Currently, the YOLO series algorithms demonstrate excellent real-time performance, meeting requirements for assembly tasks [23,24,25,26]. They also hold application potential for segmenting dense small targets in complex environments [27,28,29]. However, in micro-assembly applications, achieving sub-pixel segmentation accuracy for targets spanning only a few tens of pixels presents a considerable challenge to the positioning precision of existing segmentation algorithms. While most current methods are designed for small-object detection, there remains a lack of segmentation frameworks that deliver both high robustness and high contour accuracy specifically for low-resolution targets. Furthermore, the existing eye-to-hand calibration methods based on active part motion do not account for motion errors introduced by camera refocusing during cross-depth-of-field detection.
To further overcome the challenge of low spatial position detection accuracy in micro image sensor modules, this paper proposes a method integrating small-object instance segmentation algorithm and image Jacobian matrix calibration error compensation: SO-YOLO11-CDP (Small Object-YOLO11-Cross-Depth-of-Field Positioning). This approach extracts the feature pixel locations of micro-components through SO-YOLO11, then achieves precise positioning of micro image sensor module by compensating for errors in image Jacobian matrix calibration. The method achieves micrometer-level high-precision position detection for micro image sensor modules and can be applied to fields such as the assembly of micro-device in MEMS systems.
In summary, the main contributions of this paper are as follows:
  • Enhance instance segmentation performance: SO-YOLO11 integrates three modules suitable for small-object segmentation. Coordinate attention mechanism was introduced into the segmentation head network to embed precise location information of targets; non-stride convolution is employed in the backbone network to enhance the fine-grained feature extraction capability for small targets; and EIOU loss function combined with normalized Wasserstein distance is introduced to improve target regression performance. SO-YOLO11 achieves high robustness and accuracy in instance segmentation performance.
  • Improve spatial position detection accuracy: Establish a cross-depth-of-field image Jacobian matrix calibration error compensation model based on pinhole imaging principles. Design a cross-depth-of-field eye-to-hand calibration workflow to improve spatial position detection accuracy of object feature points. Error compensation method enables higher precision in cross-scale inspection of micro-components under microscopic vision.
The overall structure of this paper is as follows: Section 2 introduces the assembly process of micro image sensor modules and the structure and method of the position detection system. Section 3 presents the image segmentation model and image position detection method. Section 4 elaborates on the cross-depth-of-field calibration error compensation model. Section 5 experimentally validated the feasibility of the detection results and calibration method accuracy. Finally, Section 6 summarizes the content and discusses future work and limitations.

2. Assembly Task Description and Precise Positioning Method

2.1. Assembly Process for Micro Image Sensor Modules

Micro image sensor module assembly comprises the following two components: micro image sensor and inner conductors. Figure 1a illustrates the dimension and structure of components. The inner conductor consists of four sub-conductors, each with a diameter of approximately 50 μm. The micro image sensor includes four equally spaced solder spheres, each typically not exceeding 150 μm in diameter.
The micro image sensor module is assembled by connecting inner conductor to micro image sensor via laser micro-welding of solder balls. The assembly procedure, illustrated in Figure 1b, consists of the following steps: (1) coarse positioning of the components in their initial state; (2) guided by the vision system, the gripper captures inner conductors and aligns them with the solder sphere on the micro image sensor; (3) approach of the welding nozzle and laser melting of the solder ball to form the connection; and (4) the final assembled product after repeating the previous steps.
During the assembly process, the clamping, alignment, and assembly of inner conductors are accomplished through vision servo control. Consequently, the core challenge lies in achieving high-precision spatial position detection, characterized by the following two key issues:
  • Cross-depth-of-field position detection. With target sizes in the hundred-micrometer range, their representation in the image occupies a small pixel area. High-magnification microscopic vision suffers from a narrow field of view and shallow depth of field. This leads to partial defocusing of component features within the image space, introducing errors in 3D position detection.
  • Stochastic initial conditions. During assembly, data cables exhibit flexibility. The welding fixture may deform cable cores during fixation, causing random variations in initial conditions. Overlapping target contour features lead to misdetection or missed detection in image segmentation.
The assembly of irregular inner conductors with micro image sensors is constrained by a confined workspace, which limits the layout of the vision system. In this setup, components are held by a fixed fixture, with the laser welding machine mounted directly above targets and robotic arms arranged on both sides. The optical paths of the horizontal and vertical microscopic vision systems interfere with the existing equipment. Therefore, this platform designs a non-orthogonal dual-channel microscopic vision system that utilizes mirrors to redirect optical paths for real-line inspection, as shown in Figure 1c. The end-effector of the robotic arm is equipped with a reflector mirror and a microgripper, enabling both optical path redirection and inner conductors’ manipulation. Mirrors can deflect the optical path by 90°. The initial orientation of the part forms an angle θ1 with the XWYW plane of the world coordinate system, while the vertical vision optical axis forms an angle θ2 with the same plane. Typically, θ1 = θ2, ensuring the vertical vision detects along the module’s vertical axis.

2.2. Precision Position Detection Framework

To meet practical assembly requirements, we establish a method combining instance segmentation algorithms with cross-depth-of-field calibration for precise positioning, as illustrated in Figure 2. To overcome the poor imaging quality of small targets in images, we design the small-object instance segmentation algorithm SO-YOLO11, which extracts pixel-level contour masks of micro-components. Then, feature points are accurately extracted using a sub-pixel approach. Subsequently, the image-space coordinates are converted into 3D Cartesian coordinates by offline calibration of the cross-depth-of-view image Jacobian matrix. To enhance position detection accuracy, errors introduced during focusing process of the visual system are taken into account and incorporated into the 3D coordinate calculations.

3. Visual Recognition and Image Position Detection

3.1. SO-YOLO11 Segmentation Model

YOLO11 demonstrates superior performance in small objects and occlusion scenarios detection, owing to its C2PSA spatial attention mechanism and C3K2 blocks [30]. Building upon YOLO11’s strength in multi-scale feature extraction, this paper further proposes SO-YOLO11, a dedicated network for small-object image segmentation; the framework is illustrated in Figure 3. The model incorporates multiple modules dedicated to optimizing performance for small objects detection as follows: (1) non-stride convolution (space-to-depth convolution, SPD-Conv) [31] is introduced to enhance feature extraction capabilities for small objects; (2) the segmentation head integrates coordinate attention (CA) [32], which embeds precise positional information; and (3) design a loss function that combines normalized Wasserstein distance [33] and Efficient-IoU (NWD-EIoU) [34] to enhance object regression performance.
SO-YOLO11 incorporates SPD-Conv in the backbone network for downsampling, which helps preserve edge contours and fine-texture details of small objects. Combined with a multi-scale feature fusion network that aggregates shallow features with deep semantic features, this enhances the ability to extract small-object features. While the C2PSA in the YOLO11 backbone strengthens semantic representation of small objects through multi-scale attention, beneficial for detection, it remains insufficient for the precise localization required in segmentation tasks. Therefore, CA is introduced in the segmentation head, enabling spatial perception and accurate boundary segmentation of small objects. Furthermore, the conventional CIOU loss is replaced with the proposed NWD-EIOU loss, which combines overlap and geometric metrics with a 2D Gaussian probability distribution metric. This design reduces sensitivity to scale variations in small objects and improves both regression accuracy and training stability. Addressing the high-precision detection demands of micro-assembly, SO-YOLO11 achieves highly accurate and robust small-object contour extraction.
SPD-Conv. Convolutional neural networks (CNNs) comprise convolutional and pooling layers, primarily used for feature extraction in image recognition and segmentation tasks. Conv block in YOLO utilizes strided convolution. However, this approach tends to overlook fine-grained features of small objects, thereby degrading performance. In contrast, SPD-Conv mitigates the loss of shallow-layer feature information when processing low-resolution targets, thereby improving the accuracy of small-objects segmentation in complex environments.
CA. Coordinate attention decouples the 2D space into two 1D feature encodings through 1D average pooling along the height and width directions. It then generates two directional attention maps embedded with coordinate information through operations such as convolution and nonlinear activation. With its lightweight design, this module effectively captures channel-wise long-range spatial dependencies, thereby enabling accurate localization of small objects. Compared to other attention mechanisms such as SE and CBAM, CA demonstrates higher efficiency and precise localization of small objects. Incorporating CA into the segmentation head substantially enhances instance segmentation accuracy, with significant gains on small targets.
NWD-EIoU. The EIoU loss function computes regression terms for width and height independently. It minimizes the discrepancies in width and height between the ground truth and predicted boxes separately, leading to higher regression accuracy and faster convergence [34]. However, EIoU exhibits high sensitivity to localization deviations in small targets. When the predicted bounding box shares a low intersection-over-union (IoU) with the ground truth box, the training loss increases sharply, causing the target to be misclassified as a negative sample. Consequently, small-object features become difficult to learn, degrading the overall regression performance. Normalized Wasserstein distance (NWD) [1] models each bounding box as a 2D Gaussian distribution and measures the similarity between the predicted and ground-truth box by computing the NWD between its corresponding distributions. This approach provides stable gradients that facilitate small object learning. By integrating the strengths of both methods, the proposed NWD-EIOU loss function improves regression performance and localization accuracy for low-resolution targets. The LNWD-EIOU loss function is as follows (Equation (1)):
L N W D = 1 exp ( W 2 2 ( A , B ) C ) L N W D E I o U = λ L E I o U + 1 λ L N W D
where A, B are the center points of the predicted box and the ground truth box, W2 denotes the Wasserstein distance, C is a normalization constant with C = 12.8, and LNWD represents the weight of NWD; LEIOU represents the weight of Eiou and λ is a weighting hyperparameter; given the prevalence of small objects in our application scenario, λ is set to a relatively low value of 0.2.

3.2. Sub-Pixel Image Position Detection Method Based on Image Gradients

The image position detection process begins with the SO-YOLOv11 segmentation algorithm, which coarsely locates the target to obtain a pixel-accurate contour and its corresponding binary mask. To enhance contour quality, the mask first undergoes preprocessing. A Gaussian filter is applied to smooth the contour, suppress noise, and reduce jagged edges. This is followed by morphological opening and closing operations to eliminate small holes and enhance both the continuity and smoothness of the contour. Subsequently, feature points are extracted from these preprocessed contours.
Finally, a gradient-based method is employed for sub-pixel detection. For any point p on the target contour edge within the image I(u, v), the image gradient direction is perpendicular to the edge tangent. A sub-pixel contour point q is then determined based on the local image gradient Gi at point p.
G i = [ I u I v ] G i ( p q ) = 0
where I is the grayscale function of the point p. Gi is the image gradient at point p, p is the initial pixel coordinate, and q is the sub-pixel image coordinate.
The sub-pixel feature point q coordinates are obtained by solving, using the least squares method, expressed as follows (Equation (3)):
q = ( G i  T G i ) 1 ( G i  T G i p )

4. Spatial Position Detection for Micro-Devices

4.1. Calibration Workflow

A calibration workflow is designed for cross-depth position detection. First, the vision system is actuated to perform n cycles (n ≥ 5) of fine increment, long stroke, and equidistant reciprocating motion along its focus axis. The image offset matrix for the i-th microscopic vision channel is then calculated using Equation (4):
J B i = Δ P B i Δ L T Δ L Δ L T 1
where ∆PBi is the change in pixel coordinates during the i-th focusing iteration; ∆LB is the focal axis displacement amount in image offset matrix calibration.
The image Jacobian matrix is then calibrated through active movement of the end-effector. During this process, defocus blur caused by cross-depth-of-field variations necessitates a visual refocusing strategy; source of errors is shown in Figure 4. To compensate for the resulting positional errors, the camera image offset matrix is incorporated into the calibration, yielding the cross-depth-of-field image Jacobian matrix J, as shown in Equation (5) above.
J = Δ P J B Δ L Δ D T Δ D Δ D T 1
where JB is image offset matrix, ∆D is the displacement change, ∆P is the pixel change, and ∆L is the focal axis displacement amount.
Finally, n points are arbitrarily chosen within the 3D workspace (n ≥ 5). The accuracy of the spatial position detection and the effectiveness of the compensation method are validated by comparing the measured positions against their ground truth values. The complete calibration workflow is depicted in Figure 5.

4.2. Image Jacobian Matrix Model

The micro-manipulator grips the conductors and actuates its translation with three degrees of freedom within the workspace. A dual-channel, non-orthogonal microscopic vision system is employed to measure the relative deviations of the target’s image features. These deviations are then converted into spatial relative positions via the pseudo-inverse of the image Jacobian matrix, which subsequently serves as the input for alignment control. The image Jacobian matrix J describes the mapping between the incremental changes in the part’s spatial position and the resulting changes in its image features. Denoting the conductor’s spatial coordinates in the workspace as D = [xD, yD, zD]T and its coordinates in the image feature space as P = [ux, vx, uz, vz]T, the image Jacobian matrix J is defined by Equation (6).
P = J D [ Δ P 1 , Δ P 2 Δ P N ] = J Δ D 1 , Δ D 2 Δ D N
The optical imaging principle of microscopic vision is based on the pinhole model, as shown in Figure 6. In this model, the center of the camera’s optical axis is equivalent to a pinhole, through which light rays from the target object pass before projecting onto the image plane. For a monocular vision system, the matrix relationship between the coordinates in the image feature space coordinate (u,v) and the camera coordinate system {C} coordinate (xc, yc, zc) is given as follows Equation (7).
u v 1 = 1 z c k x 0 u 0 0 k y v 0 0 0 1 x c y c z c = M c x c y c z c
where kx and ky are the magnification coefficients in the X-axis and Y-axis directions, (u0, v0) are the optical center coordinates, Mc is the internal parameter matrix, and zc = f, f is the focal length.
As illustrated in the optical path design of Figure 1c, the microscopic vision system utilizes a mirror {Mi} to deflect the optical path. Assuming the optical axis aligns with the positive ZC direction of the camera coordinate system and defining counterclockwise reflection as positive, the matrix relationship between coordinates (xc, yc, and zc) in the camera coordinate system {C} and coordinates (xD, yD, and zD) in the part coordinate system {D} is given by Equation (8):
x c y c z c = R M M D x D y D z D + P D
Among them, RM is the rotation matrix between {C} and {M}. Within this matrix, α, β, and θ are the deflection angles of the optical path toward the YCZC, XCZC, and XCYC planes, with ranges from −90° to 90°. PD is the translation vector representing the origin offset from {C} to {D}. MD is the homogeneous transformation matrix from {Mi} to {D}, represented here by a 3 × 3 matrix of non-zero coefficients for positional transformation.
R M = cos α 0 sin α 0 1 0 sin α 0 cos α 1 0 0 0 cos θ sin θ 0 sin θ cos θ cos β sin β 0 sin β cos β 0 0 0 1 M D = m 11 m 12 m 13 m 21 m 22 m 23 m 31 m 32 m 33 , P D = p 1 p 2 p 3
Based on Equations (7)–(9), the coordinate transformation matrix JP for a monocular microscopic vision system can be derived. For systems with a fixed physical structure, JP is a constant 3 × 3 matrix. The image Jacobian matrix J for a monocular camera corresponds to the first two rows of this constant matrix JP. In a dual-channel microscopic vision system, J = [J1, J2], J1 and J2 represent the image Jacobian matrices for the two respective cameras.
P 1 = M C R M M D x D y D z D + M C P D Δ P 0 = M C R M M D Δ x D Δ y D Δ z D = J P Δ x D Δ y D Δ z D
where ∆P denotes the relative position of the part within the image, ∆P = [∆u, ∆v], ∆xD, ∆yD, and ∆zD represent the relative positional deviations of the part along the x, y, and z, respectively.

4.3. Calibration Error Compensation

The application of microscopic vision to spatial pose estimation is constrained by its inherent narrow field of view and shallow depth of field. When a micro-part is positioned within the system’s depth of field, the vision system can acquire clear image features. Under this condition, the relationship between the micropart’s 3D spatial vector and image spatial vector is given by Equation (6) above.
In cross-depth-of-field pose detection, where the target depth exceeds the depth of field, the camera must refocus to resolve features clearly. This focusing motion, which involves moving the camera toward or away from the target, introduces a visual distortion (where nearby objects appear larger than distant ones). Furthermore, any angular misalignment between the focusing axis and the camera’s optical axis introduces additional positional errors in the image. To address these errors in cross-depth-of-field detection, this paper proposes a method that involves moving the camera along its focusing axis to capture images. The image offset Jacobian matrix is then calibrated to compensate for the detection errors, as formulated below in Equation (11):
Δ P = J Δ D + J B Δ L

5. Experimental Results and Analysis

5.1. Multi-Microparts Segmentation and Contour Precise Localization Experiments

5.1.1. Experimental Parameters and Datasets

To ensure the reliability of the model in industrial applications, images were manually captured during small-batch assembly experiments. A total of 132 images form a small sample dataset—MicroAssembly. The dataset includes various complex scenarios and occlusion conditions encountered in the assembly, together with images of micro image sensors, four differently colored inner conductors, and the gripper used in the assembly operation, as shown in Figure 7. The dataset is randomly partitioned into training, validation, and test subsets in an 8:1:1 ratio. The training and validation sets are used for model training and hyperparameter tuning, while the test set is reserved for final performance evaluation and is excluded from training and generalization experiments.
This study utilizes a small sample dataset. To enhance the model’s generalization performance, multiple online data augmentation techniques are integrated, including Mosaic, Mixup, Copy-Paste, and Augmix. Specifically, Mosaic randomly selects four training images, applies geometric transformations such as scaling and cropping, and stitches them into a composite image. Mixup improves the model’s robustness in segmenting objects with indistinct boundaries through linear interpolation between samples. Copy-Paste augments small objects by rotating and replicating them via flipping. Augmix automatically enhances training samples by adjusting visual attributes like saturation, hue, and brightness. The application probabilities for these augmentations, along with other training hyperparameters, are detailed in Table 1. Furthermore, to stabilize late-stage training, Mosaic augmentation is disabled in the final ten epochs. Overfitting is mitigated through an early stopping mechanism and the following two regularization strategies: weight decay and random dropout.
Common evaluation metrics for YOLO include precision (P), recall (R), mean average precision (mAP). The mAP metric is further reported at two IoU thresholds as follows: mAP0.5 and mAP0.5:0.95. The calculation methods for P, R, and mAP are as follows, shown in Equations (12)–(14).
P = T P T P + F P
R = T P T P + F N
m A P = 1 N i = 1 N A P i , A P = 0 1 P ( R ) d R
Among them, TP represents true positives; FP represents false positives; and FN represents false negatives. AP is defined as the area under the P–R curve, and mAP is the mean of the AP values across all categories.

5.1.2. Image Segmentation and Precise Extraction of Contour Feature Points

Balancing model performance and training efficiency, the medium (m) size configuration was selected as the baseline. To evaluate the contribution of each proposed module, we conducted ablation experiments starting from the YOLO11m-seg base model (baseline). Modules were incrementally integrated as follows: A introduces CA; B introduces SPD-Conv; C combines both A and B; and D further enhances model C by replacing its loss function with the proposed NWD-EIoU. Detailed results are presented in Table 2, where “√” indicates the module is active and “×” indicates it is omitted. The dataset images contain the following six object categories: gripper, solder balls, and conductors in four distinct colors. Among these, solder balls are categorized as small-scale targets.
The experimental results in Table 2 demonstrate that SO-YOLO11m surpasses baseline, achieving improvements of 16.1% in precision, 4.0% in recall, 9.9% in mAP0.5, and 5.8% in mAP0.5:0.95. To validate the effectiveness of individual modules for small-object segmentation, the CA and SPD-Conv modules were separately integrated into the baseline model. The CA module substantially boosted both precision and recall for small objects while maintaining robust performance across targets of various sizes. In contrast, the SPD-Conv module primarily enhanced feature extraction for small objects but offered limited benefits for multi-scale object processing. By integrating both CA and SPD-Conv, this work not only strengthens small-object segmentation but also improves the model’s capability to handle multi-scale objects.
Generalization Experiment. To evaluate the generalization capability of the model, K-fold cross-validation was performed. The dataset, excluding the test set, was randomly shuffled and evenly split into five subsets. In each fold, one subset was used as the validation set and the remaining four as the training set. The results are summarized in Figure 8.
For the detection task, the average recall over the five folds was 0.659, and the average mAP50 was 0.694. By comparison, the trained SO-YOLO11 model achieved recall and mAP50 scores of 0.603 and 0.716, respectively. This corresponds to an improvement of 0.056 in recall, albeit with a slight decrease of 0.022 in mAP. For segmentation tasks, the average recall was 0.614 with a standard deviation of 0.045, while the average mAP50 result was 0.611 with a standard deviation of 0.025. Both standard deviations are below 0.05, indicating that the model generalizes well and performs consistently across different data subsets for segmentation. It is worth noting that while precision shows a slight increase in segmentation, the observed drop in mAP50 suggests a mild tendency toward overfitting when trained on limited data. Given the limited size of the dataset, this level of overfitting remains acceptable and can be further mitigated by expanding both the quantity and quality of the training images to enhance the model’s generalization capability. Overall, the K-fold cross-validation results confirm that the proposed model exhibits strong generalization ability and robustness.
Comparative experiments of different models. For the precise segmentation comparison experiment, we selected several commonly used model frameworks for evaluation, including YOLO12m-seg, YOLO11m-seg, YOLOv9-seg, YOLOv8s-seg, YOLOv5s-seg, and an RT-DETR-ResNet50 detection framework equipped with the segmentation head from YOLO11. The detailed performance comparison results are presented in Table 3. As shown in the table, compared with other algorithms, SO-YOLO11 achieves improved performance across all evaluation metrics, demonstrating outstanding capability. For the component localization task in the assembly process of micro image sensor modules, it exhibits practical value in scenarios involving small targets and partial occlusion.
To provide a visual comparison of segmentation performance, this study presents the segmentation results obtained during the micro image sensor assembly process and evaluates the recognition capabilities of different models. Figure 9 compares YOLOv8s-seg, YOLO11m-seg, and SO-YOLO11m, which were selected based on their strong performance in quantitative evaluation. In the initial state images (Figure 9a), both YOLOv8s-seg and YOLO11m-seg fail to detect certain spherical features of the micro image sensor and show difficulty in distinguishing between highly overlapping red and white inner conductors. In horizontal-view images (Figure 9b), where the contrast between inner conductors and the background is low, YOLOv8s-seg produces incomplete segmentation at inner conductors ends. In contrast, SO-YOLO11m delivers more coherent overall segmentation and exhibits higher confidence in localizing small objects compared to YOLO11m-seg. Figure 9c shows a robotic arm grasping an inner conductor under substantial occlusion and overlap. Here, YOLOv8s-seg cannot separate the mutually occluding silver and white inner conductors, while YOLO11m-seg yields inaccurate contours for the spherical structure of the micro image sensor. SO-YOLO11m achieves notably better segmentation accuracy across all components in this challenging scenario. These image comparisons demonstrate that the proposed model can accurately segment and localize micro-component contours even in complex assembly environments.
Feature point extraction. Inner conductor is an irregular, flexible component whose shape changes dynamically during assembly. First, the target contour is coarsely segmented using the image segmentation model. This initial binary mask is then used to precisely localize the contour’s feature points. Through forty iterations with a sub-pixel threshold of 0.01 pixels, sub-pixel curvature feature points are extracted from the contour, enabling the fitting of a high-precision contour. Figure 10 below shows the results of sub-pixel feature point extraction for multiple targets during the assembly process.

5.2. Verification Experiment for 3D Spatial Position Detection Accuracy

Visual system calibration was conducted on the micro-assembly platform mounted on an optical vibration isolation stage. This configuration eliminated dynamic variations in the robotic arm and vision system caused by external vibrations. The vision system used a fixed-magnification microscope lens under constant-intensity illumination, providing a stable detection environment for calibration. Under these conditions, the motion of the robotic arm and the camera focus axis can be regarded as linearly related to the corresponding measured values.
We performed comparative experiments to validate the effectiveness of the compensation method. First, the image Jacobian matrix is calibrated using an active robotic manipulation-based approach. This method offers excellent precision and stability for fixed structural platforms. Eight arbitrary points were selected within the workspace, with displacements between consecutive points not exceeding 500 μm along each axis. Within this range, the target on the end-effector appeared partially defocused in the image, while its contour features retain sufficient clarity for segmentation. Given that the linear stage has a resolution better than 0.5 μm and a repeatability accuracy within ±1 μm, its displacement readings were used as the ground truth. The image Jacobian matrix is then calculated using image coordinates and the linear stages motion data.
Subsequently, the cross-depth-of-field image Jacobian matrix was calibrated using a similar procedure, with a key modification: prior to calibration, the vision system was actively translated along its focusing axis for ten reciprocating cycles with a step size of 100 μm. Images were captured to record the gradual defocusing and the image shift matrix Jb was fitted based on the relationship between image feature point offsets and the focus displacement. During the active motion of the end-effector for calibration, if defocusing occurred, the vision system was refocused to maintain clear target contours. Finally, following the error compensation principle, the cross-depth-of-field Jacobian matrix was computed via the least squares method.
Following the cross-depth-of-field image Jacobian calibration, we validated the method by selecting five arbitrary spatial points. At each point, focused images were captured and the pixel coordinates of the micro-part’s contour features were detected. Calculate the spatial positions for the active motion method and the cross-depth compensation calibration method, respectively. A quantitative comparison of the detection accuracy between the two methods was performed by evaluating their calculated positions against the corresponding ground-truth coordinates. The dual-channel microscopic vision system of the platform was calibrated using the following methods: the active motion method yielded the image Jacobian matrix and the error compensation method produced the image offset matrix and the cross-depth-of-field image Jacobian matrix. The corresponding calibration results are summarized in Table 4.
A comparison of detection accuracy between the proposed error-compensated calibration method and the active motion method is presented in Figure 11. The results show that the root mean squared error (RMSE) for the compensated method is below 6.5 μm, whereas the conventional method yields an MSE below 9 μm. The error compensation reduces the detection error by approximately 2.5 μm, confirming an improvement in accuracy. The result demonstrates the feasibility of the proposed error compensation method for cross-depth-of-field image Jacobian matrix calibration.
To assess the influence of the focusing displacement on the final positioning accuracy during the calibration of the image offset matrix with the proposed compensation method, the image Jacobian matrix JB was calibrated using the following two different step distances: 50 μm and 150 μm. The corresponding calibration results are presented in Table 5.
At the same set of five coordinate points, the root mean square error after compensation remains below 6.2 μm with a 50 μm focusing step and below 7 μm with a 150 μm step. The analysis indicates that the magnitude of error compensation increases with larger focus steps. Compared to the calibration accuracy achieved by the active motion method, improvements are observed across all three step sizes. Moreover, within a hundred-micrometer focal length step variation range, the change in error is kept within 0.8 μm, demonstrating the robustness of the proposed method.

6. Conclusions

This paper addresses the challenges in spatial position detection for micro image sensor modules, such as missed detections, false positives, and large cross-depth errors caused by partially occluded features under random initial states. This study presents SO-YOLO11-CDP, a method that integrates the instance segmentation algorithm SO-YOLO11 with cross-depth-of-field calibration. The SO-YOLO11 algorithm is employed to segment the contours of small objects, with feature points accurately extracted through sub-pixel fitting. Furthermore, an error compensation model and calibration workflow for the image Jacobian matrix are introduced for microscopic vision systems, reducing errors induced by repeated refocusing to enable high-precision detection.
Ablation experiments show that SO-YOLO11 outperforms the baseline YOLO11, with improvements of 16.1% in precision, 4.0% in recall, 9.9% in mAP0.5, and 5.8% in mAP0.5:0.95. Further precision validation comparison experiments demonstrate that after compensation, the vision system reduces the mean squared error for 3D spatial detection to 6.5 μm, a 2.5 μm improvement overactive motion method, confirming its superior accuracy. The proposed method holds practical value for automated assembly, providing robust support for high-precision, vision-guided micro-assembly tasks.
Despite the significant progress achieved by the proposed method for visual inspection in micro-assembly, certain limitations remain. Specifically, in complex environments, the algorithm may fail to accurately locate contours and extract features from targets that are slightly defocused or severely occluded, leading to detection failures. Future work should therefore focus on enhancing the model’s robustness to defocused images. This will require augmenting the dataset with high-quality images representing complex assembly scenarios. Furthermore, we plan to investigate the application of super-resolution reconstruction algorithms during the image preprocessing stage to obtain clearer target images, thereby improving detection accuracy and efficiency.

Author Contributions

Methodology, validation, and writing—original draft preparation, X.L.; writing—review and editing, J.Z.; software, Y.Y.; investigation, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Academy of Engineering Physics Dean’s Fund Self-Enhancement Program, grant number YZJJZQ2023021.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to an ongoing study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tian, W.; Ding, Y.; Du, X.; Li, K.; Wang, Z.; Wang, C.; Deng, C.; Liao, W. A Review of Intelligent Assembly Technology of Small Electronic Equipment. Micromachines 2023, 14, 1126. [Google Scholar] [CrossRef] [PubMed]
  2. Yan, S.; Xu, D.; Tao, X. Hierarchical Policy Learning with Demonstration Learning for Robotic Multiple Peg-in-Hole Assembly Tasks. IEEE Trans. Ind. Inform. 2023, 19, 10254–10264. [Google Scholar] [CrossRef]
  3. Yao, S.; Li, H.; Pang, S.; Zhu, B.; Zhang, X.; Fatikow, S. A Review of Computer Microvision-Based Precision Motion Measurement: Principles, Characteristics, and Applications. IEEE Trans. Instrum. Meas. 2021, 70, 5007928. [Google Scholar] [CrossRef]
  4. Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
  5. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
  6. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  8. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 378–385. [Google Scholar]
  9. Gong, X.; Su, H.; Xu, D.; Zhang, J.; Zhang, L.; Zhang, Z. Visual Defect Inspection for Deep-Aperture Components With Coarse-to-Fine Contour Extraction. IEEE Trans. Instrum. Meas. 2020, 69, 3262–3274. [Google Scholar] [CrossRef]
  10. Fontana, G.; Calabrese, M.; Agnusdei, L.; Papadia, G.; Del Prete, A. SolDef AI: An Open Source PCB Dataset for Mask R-CNN Defect Detection in Soldering Processes of Electronic Components. J. Manuf. Mater. Process. 2024, 8, 117. [Google Scholar]
  11. Xu, Z.; Zhao, X.; Wang, X.; Kong, Y.; Ren, T.; Wang, Y. Interpretability Analysis and Attention Mechanism of Deep Learning-Based Microscopic Vision. Preprints 2024. [Google Scholar] [CrossRef]
  12. Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
  13. Yan, S.; Tao, X.; Xu, D. High-precision robotic assembly system using three-dimensional vision. Int. J. Adv. Robot. Syst. 2021, 18, 17298814211027029. [Google Scholar] [CrossRef]
  14. Liu, G.; Kang, S.; Li, T.; Xu, Y.; Zhou, J. Deep-Learning-Based Real-Time Visual Detection Method for Robotic Cell Microinjection System With High Accuracy and Efficiency. J. Robot. 2025, 2025, 8556780. [Google Scholar] [CrossRef]
  15. Liu, H.; Yu, H.; Ding, K.; Li, X. Two-Stage Visual Detection of Micro-Objects: Theory and Experiments. In Proceedings of the 2023 IEEE International Conference on Real-time Computing and Robotics (RCAR), Datong, China, 17–20 July 2023; pp. 859–863. [Google Scholar] [CrossRef]
  16. Yao, S.; Zhang, X.; Fatikow, S. Marker-assisted cross-scale measurement for robotic macro–micro manipulation utilizing computer microvision. Measurement 2024, 235, 114908. [Google Scholar] [CrossRef]
  17. Zhang, L.; Zhang, X.; Li, H.; Zhang, F. Development and calibration of a cross-scale multi-camera vision system for 3D micro-assembly. IEEE Sens. J. 2024, 25, 5394–5404. [Google Scholar] [CrossRef]
  18. Vismanis, O.; Arents, J.; Subačiūtė-Žemaitienė, J.; Bučinskas, V.; Dzedzickis, A.; Patel, B.; Tung, W.-C.; Lin, P.-T.; Greitans, M. A vision-based micro-manipulation system. Appl. Sci. 2023, 13, 13248. [Google Scholar] [CrossRef]
  19. Fan, Q.; Wu, Y.; Bi, K.; Liu, Y. Autonomous vision-guided two-arm collaborative microassembly using learned manipulation model. IEEE Robot. Autom. Lett. 2024, 9, 2375–2382. [Google Scholar] [CrossRef]
  20. Ming, J.; Bargmann, D.; Cao, H.; Caccamo, M. Flexible gear assembly with visual servoing and force feedback. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 8276–8282. [Google Scholar]
  21. Qin, F.; Xu, D.; Zhang, D.; Pei, W.; Han, X.; Yu, S. Automated hooking of biomedical microelectrode guided by intelligent microscopic vision. IEEE/ASME Trans. Mechatron. 2023, 28, 2786–2798. [Google Scholar] [CrossRef]
  22. Zhang, J.; Dai, X.; Wu, W.; Du, K. Micro-vision based high-precision space assembly approach for trans-scale micro-device: The CFTA example. Sensors 2023, 23, 450. [Google Scholar] [CrossRef]
  23. Cheng, J.; Wu, W.; Yang, Y.; Zhang, J. YOLACT in Micro-Assembly Robot System. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI’21), Sanya, China, 22–24 December 2021; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
  24. Su, Q.; Li, H. YOLO-DST: MEMS Small-Object Defect Detection Method Based on Dynamic Channel–Spatial Modeling and Multi-Attention Fusion. Sensors 2026, 26, 369. [Google Scholar] [CrossRef]
  25. Markert, T.; Matich, S.; Neykov, D.; Muenig, M.; Theissler, A.; Atzmueller, M. Visual Detection of Tiny and Transparent Objects for Autonomous Robotic Pick-and-Place Operations. In Proceedings of the 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA), Stuttgart, Germany, 6–9 September 2022; pp. 1–4. [Google Scholar] [CrossRef]
  26. Hao, Y.; Feng, J. An Industrial Micro Parts Recognition Technology Based on Improved Yolov8. J. Phys. Conf. Ser. 2024, 2872, 012012. [Google Scholar] [CrossRef]
  27. Guo, C.; Tan, F. SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment. Agriculture 2025, 15, 1570. [Google Scholar] [CrossRef]
  28. Yi, W.; Zhang, Z.; Chang, L. M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition. Remote Sens. 2025, 17, 3144. [Google Scholar] [CrossRef]
  29. Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
  30. Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions. arXiv 2024, arXiv:2411.00201. [Google Scholar]
  31. Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects; Amini, M.R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13715, pp. 443–459. [Google Scholar]
  32. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
  33. Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2022, arXiv:2110.13389. [Google Scholar] [CrossRef]
  34. Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Figure 1. Assembly task and visual system structure diagram. (a) Dimension and structure of components. (b) Assembly process. (c) Visual system structure design.
Figure 1. Assembly task and visual system structure diagram. (a) Dimension and structure of components. (b) Assembly process. (c) Visual system structure design.
Electronics 15 00411 g001
Figure 2. Framework for SO-YOLO11-CDP.
Figure 2. Framework for SO-YOLO11-CDP.
Electronics 15 00411 g002
Figure 3. SO-YOLO11 network architecture.
Figure 3. SO-YOLO11 network architecture.
Electronics 15 00411 g003
Figure 4. Source of image feature point offset errors.
Figure 4. Source of image feature point offset errors.
Electronics 15 00411 g004
Figure 5. Calibration workflow chart.
Figure 5. Calibration workflow chart.
Electronics 15 00411 g005
Figure 6. Pinhole imaging principle.
Figure 6. Pinhole imaging principle.
Electronics 15 00411 g006
Figure 7. Component images during assembly.
Figure 7. Component images during assembly.
Electronics 15 00411 g007
Figure 8. K-fold cross-validation results.
Figure 8. K-fold cross-validation results.
Electronics 15 00411 g008
Figure 9. Visual comparison of segmentation results. (a) Initial state image. (b) Horizontal camera image. (c) Grasp inner conductor images.
Figure 9. Visual comparison of segmentation results. (a) Initial state image. (b) Horizontal camera image. (c) Grasp inner conductor images.
Electronics 15 00411 g009
Figure 10. Subpixel feature point extraction results during assembly.
Figure 10. Subpixel feature point extraction results during assembly.
Electronics 15 00411 g010
Figure 11. Spatial position detection error. (a) Error compensation method. (b) Active motion method.
Figure 11. Spatial position detection error. (a) Error compensation method. (b) Active motion method.
Electronics 15 00411 g011
Table 1. Hyperparameter settings.
Table 1. Hyperparameter settings.
HyperparameterValue
OptimizerSGD
Initial learning rate lr00.01
Final learning rate lrf0.01
Training epoch500
Early stopping30
Weight decay0.005
Dropout0.2
Mosaic1.0
Mixup0.2
Copy-paste0.5
Table 2. Ablation experiment results.
Table 2. Ablation experiment results.
ModelCASPD-ConvNWD-EIoUPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)
Baseline×××57.656.458.229.6
A××63.257.960.231.3
B××58.553.655.829.0
C×66.856.959.132.1
D (Ours)73.760.468.135.4
Table 3. Performance comparison of different models.
Table 3. Performance comparison of different models.
ModelPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)
RT-DETR-ResNet5059.954.654.427.8
YOLOv5s-seg55.747.545.722.6
YOLOv8s-seg64.752.254.328.5
YOLOv9-seg47.246.545.424.4
YOLO11m-seg57.656.458.229.6
YOLO12m-seg47.550.649.023.8
SO-YOLO1173.760.468.135.4
Table 4. Calibration results.
Table 4. Calibration results.
Horizontal VisionVertical Vision
Image Jacobian matrix
(active movement method)
0.0644 1.9522 0.0554 0.0038 0.0590 1.5554 0.9578 0.0768 0.0180 0.1071 1.2388 0.3289
Image offset matrix (100 μm pitch) 0.0275 0.0169 0.0046 0.4350
Cross-depth-of-field image Jacobian matrix
(error compensation method)
0.0812 1.9629 0.0447 0.0215 0.0505 1.5469 0.9720 0.0939 0.0358 0.1769 1.3598 0.3330
Table 5. Calibration results at different step lengths.
Table 5. Calibration results at different step lengths.
Horizontal VisionVertical Vision
Image Offset matrix (50 μm step) 0.0330 0.0014 0.0044 0.4274
Image Offset matrix (150 μm step) 0.0362 0.0072 0.0133 0.4349
Cross-depth-of-field image Jacobian matrix
(50 μm step)
0.0866 1.9629 0.0447 0.0061 0.0505 1.5469 0.9719 0.0940 0.0359 0.1726 1.3641 0.3287
Cross-depth-of-field image Jacobian matrix
(150 μm step)
0.0899 1.9629 0.0447 0.0025 0.0505 1.5469 0.9769 0.0890 0.0309 0.1769 1.3598 0.3330
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, X.; Zhang, J.; Yang, Y.; Bi, L. SO-YOLO11-CDP: An Instance Segmentation-Based Approach for Cross-Depth-of-Field Positioning Micro Image Sensor Modules in Precision Assembly. Electronics 2026, 15, 411. https://doi.org/10.3390/electronics15020411

AMA Style

Lu X, Zhang J, Yang Y, Bi L. SO-YOLO11-CDP: An Instance Segmentation-Based Approach for Cross-Depth-of-Field Positioning Micro Image Sensor Modules in Precision Assembly. Electronics. 2026; 15(2):411. https://doi.org/10.3390/electronics15020411

Chicago/Turabian Style

Lu, Xi, Juan Zhang, Yi Yang, and Lie Bi. 2026. "SO-YOLO11-CDP: An Instance Segmentation-Based Approach for Cross-Depth-of-Field Positioning Micro Image Sensor Modules in Precision Assembly" Electronics 15, no. 2: 411. https://doi.org/10.3390/electronics15020411

APA Style

Lu, X., Zhang, J., Yang, Y., & Bi, L. (2026). SO-YOLO11-CDP: An Instance Segmentation-Based Approach for Cross-Depth-of-Field Positioning Micro Image Sensor Modules in Precision Assembly. Electronics, 15(2), 411. https://doi.org/10.3390/electronics15020411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop