A Deep Learning-Based Approach for Extraction of Positioning Feature Points in Lifting Holes

: Due to uncontrollable inﬂuences of the manufacturing process and different construction environments, there are signiﬁcant challenges to extracting accurate positioning points for the lifting holes in prefabricated beams. In this study, we propose a two-stage feature detection, which comprises the ADD (multi-Attention DASPP DeeplabV3+) model and the VLFGM (Voting mechanism line ﬁtting based on Gaussian mixture model) method. Initially, the YoloV5s model is employed for image coarse localization to reduce the impacts of background noise, and the ADD model follows to segment the target region. Then, the multi-step ECA mechanism is introduced to the ADD. It can mitigate the loss of interest features in the pooling layer of the backbone as well as retain the details of the original features; DASPP is adopted to fuse features at different scales to enhance the correlation of features among channels. Finally, VLFGM is utilized to reduce the dependency of accuracy on segmentation results. The experimental results demonstrate that the proposed model achieves a mean intersection over union (mIoU) of 95.07%, with a 3.48% improvement and a mean pixel accuracy (mPA) of 99.16% on the validation set. The improved method reduces vertexes error by 30.00% (to 5.39 pixels) and centroid error by 28.93% (to 1.72 pixels), which exhibits superior stability and accuracy. This paper provides a reliable solution for visual positioning of prefabricated beams in complex environments.


Introduction
Prefabricated components offer advantages such as a shorter construction time, lower costs, and better quality. As a result, prefabricated construction is gradually becoming the direction of development in the construction industry [1]. Prefabricated components are often transported through assembly and disassembly methods. However, traditional manual installation and dismantling methods for lifting equipment are unproductive and pose safety risks [2], which fail to meet the requirements in engineering. With the rise of smart construction, machine vision and image processing are being introduced into construction operations [3]. Non-contact approaches not only enable precise positioning of targets and improve the efficiency of many operations [4], but also reduce potential safety risks, providing greater opportunities for the sustainable development of prefabricated transportation operations.
Due to variations in scenes and operation methods, the strategies for visual alignment also differ, including robot-guided alignment [5], predicting component motion trajectories [6], and assessing assembly progress through landmark perception [7]. When evaluating the spatial orientation of prefabricated components, binocular vision (BV) is often employed [8,9]. This method utilizes feature matching points to obtain the object's spatial information, such as spatial coordinates and angles, and adjusts the translation feed

•
The lightweight YOLOv5s model is used for rough positioning of the lifting holes, reducing the impact of sample imbalance on pixel classification results. • The ADD model is developed by improving the ResNet50 backbone network and introducing multi-stage attention mechanism and additional shallow channels. This model alleviates the influence of lighting conditions, ensures the integrity of complex image segmentation, and achieves higher accuracy.

•
In order to mitigate the reliance on segmentation results, VLFGM is proposed to evaluate the coordinates of alignment points based on line features, and has better robustness and accuracy.
geometric shape of the rectangular hole, using the vertices of the fitted shape as feature points for image matching. Additionally, to achieve sufficient adjustment by utilizing an adequate number of points, we calculate the centroid coordinates of the lifting holes, which hold significant importance and can even be applied on depth images. The workflow of this study is as follows (Figure 1): Figure 1. Flow chart of this study.

The YoloV5s Model for Hole Detection
Although the Yolo series has been updated to version 8, YoloV5 [23] is still capable of handling many detection tasks. Additionally, in this scenario, the network model needs to be deployed on a hoisting device. In relation to existing models, we conducted several comparisons as shown in Table 1. From Table 1, it is evident that YoloV5s outperforms EfficientDet in terms of both accuracy and FPS prediction. Despite the notable speed advantage of YoloV7-tiny, this model incurs a substantial graphics memory overhead during training, approximately 9.8 GB. In the context of the same dataset, training the YoloV7-tiny model demands three to four times the temporal investment required by the YoloV5s model, thereby implying escalated resource allocation for model training and refinement. While YoloXs marginally surpasses YoloV5s in mAP0.50:0.95, its heightened complexity engenders a processing lag of 11 frames in contrast to YoloV5s. Notably, YoloV8s fails to exhibit a significant ascension in accuracy. Furthermore, the YoloV8s boasts a larger parameter space, mandating augmented computational resources. Upon comprehensive juxtaposition of these outcomes, the YoloV5s model is elected as the preferred candidate, owing to its parsimonious parameter configuration and a judicious equilibrium between processing velocity and precision.
The YoloV5s feature extraction network consists of the Backbone, Neck, and Head components, as shown in Figure 2. First, the Mosaic data augmentation is applied within the Backbone, which can adaptively scale and crop the input images. The Focus module transforms high-resolution images into multiple low-resolution feature maps, capturing different-sized feature information. Additionally, the CSP (cross-stage partial network) structure enables local feature connections across stages, effectively reducing computation and parameters. The SPP (spatial pyramid pooling) structure captures semantic information at different scales, facilitating multi-scale fusion and improving the network's ability to detect objects of different sizes.

The YoloV5s Model for Hole Detection
Although the Yolo series has been updated to version 8, YoloV5 [23] is still capable of handling many detection tasks. Additionally, in this scenario, the network model needs to be deployed on a hoisting device. In relation to existing models, we conducted several comparisons as shown in Table 1. From Table 1, it is evident that YoloV5s outperforms EfficientDet in terms of both accuracy and FPS prediction. Despite the notable speed advantage of YoloV7-tiny, this model incurs a substantial graphics memory overhead during training, approximately 9.8 GB. In the context of the same dataset, training the YoloV7-tiny model demands three to four times the temporal investment required by the YoloV5s model, thereby implying escalated resource allocation for model training and refinement. While YoloXs marginally surpasses YoloV5s in mAP0.50:0.95, its heightened complexity engenders a processing lag of 11 frames in contrast to YoloV5s. Notably, YoloV8s fails to exhibit a significant ascension in accuracy. Furthermore, the YoloV8s boasts a larger parameter space, mandating augmented computational resources. Upon comprehensive juxtaposition of these outcomes, the YoloV5s model is elected as the preferred candidate, owing to its parsimonious parameter configuration and a judicious equilibrium between processing velocity and precision.
The YoloV5s feature extraction network consists of the Backbone, Neck, and Head components, as shown in Figure 2. First, the Mosaic data augmentation is applied within the Backbone, which can adaptively scale and crop the input images. The Focus module transforms high-resolution images into multiple low-resolution feature maps, capturing different-sized feature information. Additionally, the CSP (cross-stage partial network) structure enables local feature connections across stages, effectively reducing computation and parameters. The SPP (spatial pyramid pooling) structure captures semantic information at different scales, facilitating multi-scale fusion and improving the network's ability to detect objects of different sizes. The Neck layer consists of the FPN (feature pyramid network) and PAN (path aggregation network). The FPN layer utilizes bottom-up upsampling to propagate and fuse deep semantic features. Then, through top-down lateral connections, the bottom-level features are fused with the top-level features to form a multi-scale feature pyramid [24]. The PAN layer aggregates features layer by layer from shallow layers, and each feature map is fused with a high-resolution shallow feature map to preserve detailed features [25]. The Head layer predicts the location features and class information of the hoisting holes from the feature maps outputted by the Neck layer.

Overall Structure
Despite being introduced in 2018, DeepLabV3+ [10] continues to exhibit the most favorable performance on the PASCAL VOC 2012 test dataset. The model demonstrates superior comprehensive capabilities and exceptional scalability. These attributes encompass the utilization of dilated convolutions to enlarge the receptive field, depthwise separable convolutions to reduce model parameters, multi-scale feature fusion for extracting semantic information from diverse layers, and the amalgamation of shallow and deep-level information to balance original and high-level features. Consequently, ADD is a semantic segmentation model based on the improved DeepLabV3+. The main optimization modules include: using three ECA modules to enhance the focus on the features of interest, replacing the ASPP (atrous spatial pyramid pooling) module with DASPP (dense connection atrous spatial pyramid pooling), and adding a shallow feature branch. The network structure of ADD is shown in Figure 3.   Figure 2. The architecture of the YoloV5s.
The Neck layer consists of the FPN (feature pyramid network) and PAN (path aggregation network). The FPN layer utilizes bottom-up upsampling to propagate and fuse deep semantic features. Then, through top-down lateral connections, the bottom-level features are fused with the top-level features to form a multi-scale feature pyramid [24]. The PAN layer aggregates features layer by layer from shallow layers, and each feature map is fused with a high-resolution shallow feature map to preserve detailed features [25]. The Head layer predicts the location features and class information of the hoisting holes from the feature maps outputted by the Neck layer. Despite being introduced in 2018, DeepLabV3+ [10] continues to exhibit the most favorable performance on the PASCAL VOC 2012 test dataset. The model demonstrates superior comprehensive capabilities and exceptional scalability. These attributes encompass the utilization of dilated convolutions to enlarge the receptive field, depthwise separable convolutions to reduce model parameters, multi-scale feature fusion for extracting semantic information from diverse layers, and the amalgamation of shallow and deep-level information to balance original and high-level features. Consequently, ADD is a semantic segmentation model based on the improved DeepLabV3+. The main optimization modules include: using three ECA modules to enhance the focus on the features of interest, replacing the ASPP (atrous spatial pyramid pooling) module with DASPP (dense connection atrous spatial pyramid pooling), and adding a shallow feature branch. The network structure of ADD is shown in Figure 3.
The proposed model still retains the encoder-decoder structure and uses ResNet50 as the backbone. In the encoding phase, the input image is first fed into the ResNet50-ECA residual network for feature extraction. The features outputted from the backbone network can be divided into two types: shallow semantic features of 256 channels from the 1st layer and deep features from the last layer. The shallow features are directly passed to the decoder, while the latter undergoes ECA (Efficient channel attention) weight assignment and further feature extraction through the DASPP module. The DASPP module includes parallel feature extraction operations, including 1 × 1 convolutions, dilated convolutions with dilation rates of 6, 12, and 18, and global average pooling, converting the 2048-channel backbone network output into five 256-channel feature tensors. The DASPP module employs dense connections for each convolutional branch and these five differently processed feature tensors are concatenated to achieve multi-scale feature fusion. Finally, a 1 × 1 convolution is applied to convert the number of feature channels to 256, obtaining the high-level feature. Appl  The proposed model still retains the encoder-decoder structure and uses ResNet50 as the backbone. In the encoding phase, the input image is first fed into the ResNet50-ECA residual network for feature extraction. The features outputted from the backbone network can be divided into two types: shallow semantic features of 256 channels from the 1st layer and deep features from the last layer. The shallow features are directly passed to the decoder, while the latter undergoes ECA (Efficient channel attention) weight assignment and further feature extraction through the DASPP module. The DASPP module includes parallel feature extraction operations, including 1 × 1 convolutions, dilated convolutions with dilation rates of 6, 12, and 18, and global average pooling, converting the 2048-channel backbone network output into five 256-channel feature tensors. The DASPP module employs dense connections for each convolutional branch and these five differently processed feature tensors are concatenated to achieve multi-scale feature fusion. Finally, a 1 × 1 convolution is applied to convert the number of feature channels to 256, obtaining the high-level feature.
In the feature decoding phase, the input features consist of the original low-level feature 1, the ECA-processed low-level feature 2, and the high-level feature. The addition of the new low-level feature 2 has almost no impact on the network's parameter count and provides better weight distribution for the shallow features, resulting in better preservation of the original information. Both shallow features undergo 1 × 1 convolution for feature dimension reduction. The deep semantic features are first upsampled by a factor of 4, reshaping the feature map to 1/4 of the original size. Then, the three feature tensors are fused together by addition, resulting in a 256-channel feature map. The fused features need further refinement, so a 3 × 3 convolution is applied to refine the feature map, changing its channel number and avoiding the occurrence of aliasing effects between the original features and the upsampled features [26,27]. Finally, the convolved image is upsampled by 1/4 to restore it to the original size and undergo pixel-wise classification to obtain the predicted image.   In the feature decoding phase, the input features consist of the original low-level feature 1, the ECA-processed low-level feature 2, and the high-level feature. The addition of the new low-level feature 2 has almost no impact on the network's parameter count and provides better weight distribution for the shallow features, resulting in better preservation of the original information. Both shallow features undergo 1 × 1 convolution for feature dimension reduction. The deep semantic features are first upsampled by a factor of 4, reshaping the feature map to 1/4 of the original size. Then, the three feature tensors are fused together by addition, resulting in a 256-channel feature map. The fused features need further refinement, so a 3 × 3 convolution is applied to refine the feature map, changing its channel number and avoiding the occurrence of aliasing effects between the original features and the upsampled features [26,27]. Finally, the convolved image is upsampled by 1/4 to restore it to the original size and undergo pixel-wise classification to obtain the predicted image.

ResNet50-ECA
When selecting the backbone, we also considered using lightweight networks, similar to in [28]. However, it did not perform well on our dataset, and ResNet50 was chosen instead of MobileNet, as it has stronger feature extraction capabilities. ResNet50 [29] was proposed by introducing shortcut connections into the DCNN (Deep convolutional neural network) structure and using residual blocks (ResBlocks) as the main building blocks of ResNet. Each ResBlock contains two types of bottleneck structures, as illustrated in Figure 4. For a given input x, the output features of BottleNeck1 and BottleNeck2 are mapped to x as follows: where F(x, {W}) represents the feature maps obtained from the three convolution operations in the main branch, and {W} denotes the 1 × 1 weight convolution operation. It is worth noting that for BottleNeck1, Q(x, W q ) represents the feature maps from the branch, where W q also represents the 1 × 1 convolution operation.
x as follows: where F(x,{W}) represents the feature maps obtained from the three convolution operations in the main branch, and {W} denotes the 1 × 1 weight convolution operation. It is worth noting that for BottleNeck1, q Q(x,{W }) represents the feature maps from the branch, where q {W } also represents the 1 × 1 convolution operation.  Different bottleneck blocks serve different purposes, as shown in Figure 5a. The main function of BottleNeck1 is to alter the dimensionality and channel size of the feature map using a linear function, matching the feature map size with the input dimensions for subsequent operations. This enables the extraction of deeper features through multiple convolutional operations. On the other hand, BottleNeck2 is primarily used to maintain the input feature map's dimensionality and channel size. It achieves this by employing a shortcut connection, directly adding the input "x" to the output "y", which helps prevent the vanishing and exploding gradient problems during training deep networks. Additionally, in ResNet50, each ResBlock incorporates a varying number of shortcut connections, as indicated in Table 2. Since BottleNeck1 alters the shape of the feature map, it is generally not used in a sequential manner, whereas BottleNeck2, which preserves the tensor's shape, can be concatenated. The conventional DCNN modules treat the semantic information of each channel equally important, which is not desirable when emphasizing specific features of interest. The ECA mechanism [30] addresses this issue by computing the global average value of the convolutional layer outputs to capture the channel-wise contributions, enabling channel feature compression. Subsequently, the channel weights obtained from the sigmoid Different bottleneck blocks serve different purposes, as shown in Figure 5a. The main function of BottleNeck1 is to alter the dimensionality and channel size of the feature map using a linear function, matching the feature map size with the input dimensions for subsequent operations. This enables the extraction of deeper features through multiple convolutional operations. On the other hand, BottleNeck2 is primarily used to maintain the input feature map's dimensionality and channel size. It achieves this by employing a shortcut connection, directly adding the input "x" to the output "y", which helps prevent the vanishing and exploding gradient problems during training deep networks. Additionally, in ResNet50, each ResBlock incorporates a varying number of shortcut connections, as indicated in Table 2. Since BottleNeck1 alters the shape of the feature map, it is generally not used in a sequential manner, whereas BottleNeck2, which preserves the tensor's shape, can be concatenated. In Figure 5b, H, W, and C represent the height, width, and number of channels of the image, while X and denote the input and output, respectively. GAP stands for Global Average Pooling, and  is the activation function that modulates and outputs the weight w for each channel. The optimal range for cross-channel information interaction, denoted as k, refers to the size of the 1-dimensional convolutional kernel. Its calculation is as if is e 1 ven.
where C is the number of channels, which is 2048 in ResNet50, Generally, γ and b are set to 2 and 1, respectively, and each parameter is substituted into the above formula to obtain a convolution kernel size of 7.
In the overall structure of ADD, as shown in Figure 3, the ECA mechanism is incorporated after the output of ResNet50 and at the low-level feature 2. At the location between layer 4 of ResNet50 and global average pooling, the inclusion of the ECA module enhances The conventional DCNN modules treat the semantic information of each channel equally important, which is not desirable when emphasizing specific features of interest. The ECA mechanism [30] addresses this issue by computing the global average value of the convolutional layer outputs to capture the channel-wise contributions, enabling channel feature compression. Subsequently, the channel weights obtained from the sigmoid function are multiplied with the original outputs, enhancing useful features while suppressing irrelevant ones, as illustrated in Figure 5b. This approach facilitates the extraction of features of interest. ECA is a lightweight attention mechanism for convolutional neural networks. Unlike the traditional SE mechanism, it does not require computing the correlations between global positions, thus reducing computational and parameter overhead.
In Figure 5b, H, W, and C represent the height, width, and number of channels of the image, while X and X denote the input and output, respectively. GAP stands for Global Average Pooling, and σ is the activation function that modulates and outputs the weight w for each channel. The optimal range for cross-channel information interaction, denoted as k, refers to the size of the 1-dimensional convolutional kernel. Its calculation is as where C is the number of channels, which is 2048 in ResNet50, * = log 2 (C) γ + b γ Generally, γ and b are set to 2 and 1, respectively, and each parameter is substituted into the above formula to obtain a convolution kernel size of 7.
In the overall structure of ADD, as shown in Figure 3, the ECA mechanism is incorporated after the output of ResNet50 and at the low-level feature 2. At the location between layer 4 of ResNet50 and global average pooling, the inclusion of the ECA module enhances the network's capability to express local features and expand the receptive field, thereby facilitating the capture of more diverse feature information. It is common for fully connected layer to compress feature tensors into lower dimensions, potentially resulting in information loss. By introducing the ECA module after the output of the fully connected layer, the information loss caused by feature compression can be alleviated (cascaded ECA mechanism). This approach can bring significant benefits, especially in tasks that require accurate modeling of global features.
Additionally, in the context of image semantic segmentation tasks, low-level feature maps play a crucial role as they contain the original details and texture information of the image [31]. These feature maps assist the model in identifying object boundaries and shapes. By applying the ECA module, adaptive channel weighting is performed on the feature maps, enhancing the correlation between each channel. This enables better exploration of fine details and texture information, leading to improved recognition of object boundaries and shapes. The concatenation of these three types of feature maps allows for the comprehensive utilization of their respective advantages, thereby enhancing the performance of the model.

DASPP
DeepLabv3+ employs the ASPP module, which utilizes cascaded dilated convolutions. By stacking dilated convolutions with different dilation rates, the output feature maps are concatenated together, gradually achieving a larger receptive field. The ASPP module addresses the trade-off between resolution and receptive field by employing dilated convolutions, enabling a larger receptive field size without increasing the number of convolutional parameters, thus encoding high-level semantic information. The calculation of the receptive field size R for the ASPP module is as follows: where r represents the dilation rate of the atrous convolution, and k indicates the size of the convolutional kernel. It can be observed from above equation that, while keeping other factors constant, R increases with the increment of the dilation rate. In DeepLabV3+, three different dilation rates are employed with 3 × 3 convolutional kernels, specifically 6, 12, and 18. The corresponding receptive fields for each convolutional kernel are as follows: By parallelizing 3 atrous convolution layers with different dilation rates, ASPP processes the same input feature and combines the results together, indicating that the output features are a multi-scale sampling of the input. To acquire a wider receptive field, a larger dilation rate is needed. However, as the dilation rate increases, the sampling rate becomes sparser compared to traditional convolution, resulting in the loss of more detailed information. At this point, the effectiveness of dilated convolutions diminishes.
The DASPP module addresses this issue by cascading multiple atrous convolution layers and densely connecting the output of each atrous convolution layer to subsequent ones. This cascading approach combines the advantages of both serial and parallel atrous convolution layers, enabling the generation of feature representations with more scales. As the dilation rate increases layer by layer, the output of each atrous convolution layer is combined with its input and the outputs of other layers, facilitating the fusion of multi-scale features and enlarging the receptive field [32].
The original dilated convolution layers are retained and cascaded in this study, combined with 1 × 1 convolutions to jointly encode and generate a denser feature pyramid. Furthermore, the concatenation of two dilated convolution layers enables the creation of a larger receptive field. Assuming that the receptive field sizes of the two dilated convolutions are R1 and R2, the resulting receptive field after concatenating these two dilated convolutions is illustrated in Figure 6.
convolutional parameters, thus encoding high-level semantic information. The calculation of the receptive field size R for the ASPP module is as follows: where r represents the dilation rate of the atrous convolution, and k indicates the size of the convolutional kernel. It can be observed from above equation that, while keeping other factors constant, R increases with the increment of the dilation rate. In DeepLabV3+, three different dilation rates are employed with 3 × 3 convolutional kernels, specifically 6, 12, and 18. The corresponding receptive fields for each convolutional kernel are as follows: By parallelizing 3 atrous convolution layers with different dilation rates, ASPP processes the same input feature and combines the results together, indicating that the output features are a multi-scale sampling of the input. To acquire a wider receptive field, a larger dilation rate is needed. However, as the dilation rate increases, the sampling rate becomes sparser compared to traditional convolution, resulting in the loss of more detailed information. At this point, the effectiveness of dilated convolutions diminishes.
The DASPP module addresses this issue by cascading multiple atrous convolution layers and densely connecting the output of each atrous convolution layer to subsequent ones. This cascading approach combines the advantages of both serial and parallel atrous convolution layers, enabling the generation of feature representations with more scales. As the dilation rate increases layer by layer, the output of each atrous convolution layer is combined with its input and the outputs of other layers, facilitating the fusion of multiscale features and enlarging the receptive field [32].
The original dilated convolution layers are retained and cascaded in this study, combined with 1 × 1 convolutions to jointly encode and generate a denser feature pyramid. Furthermore, the concatenation of two dilated convolution layers enables the creation of a larger receptive field. Assuming that the receptive field sizes of the two dilated convolutions are R1 and R2, the resulting receptive field after concatenating these two dilated convolutions is illustrated in Figure 6. From Figure 6, it can be observed that the receptive field of two dilated convolution kernels with r = 2 and k = 3 becomes larger and denser. The specific calculation formula for the receptive field is as follows: It should be noted that in DASPP, the 1 × 1 convolution does not affect the size of the receptive field. Additionally, the maximum receptive field for both ASPP and DASPP is: DASPP : The comparison of maximum receptive field shows that the DASPP module used provides sufficient contextual information for pixel-level classification of the same image. This enables better segmentation details.

Existing Problems and Solutions
Corner detection can extract feature points, but it is difficult to obtain the centroid points required for localization (as shown in Figure 7a). On the other hand, edge detection is susceptible to texture interference (Figure 7b), making the extraction of points of interest challenging. Pixel-level segmentation need to classify a large number of pixels, and due to the limitations of model performance, the image segmentation results may be biased towards larger or smaller regions in certain situations. In order to clearly demonstrate the influence of segmentation results on the centroid, we adopted an exaggerated approach to illustrate it, as shown in Figure 7c. When there are obvious misclassified pixel areas within the target region, it can result in missing vertices or centroid offset. Therefore, to reduce the interference of such special cases on the results, we abandoned the approach of extracting vertices based on edges or calculating centroids based on connected regions. This improvement can ignore boundary protrusions or depressions, resulting in less impact on the corner points and centroid calculation of the lifting hole.
It should be noted that in DASPP, the 1 × 1 convolution does not affect the size of the receptive field. Additionally, the maximum receptive field for both ASPP and DASPP is: The comparison of maximum receptive field shows that the DASPP module used provides sufficient contextual information for pixel-level classification of the same image. This enables better segmentation details.

Existing Problems and Solutions
Corner detection can extract feature points, but it is difficult to obtain the centroid points required for localization (as shown in Figure 7a). On the other hand, edge detection is susceptible to texture interference (Figure 7b), making the extraction of points of interest challenging. Pixel-level segmentation need to classify a large number of pixels, and due to the limitations of model performance, the image segmentation results may be biased towards larger or smaller regions in certain situations. In order to clearly demonstrate the influence of segmentation results on the centroid, we adopted an exaggerated approach to illustrate it, as shown in Figure 7c. When there are obvious misclassified pixel areas within the target region, it can result in missing vertices or centroid offset. Therefore, to reduce the interference of such special cases on the results, we abandoned the approach of extracting vertices based on edges or calculating centroids based on connected regions. This improvement can ignore boundary protrusions or depressions, resulting in less impact on the corner points and centroid calculation of the lifting hole.

Clustering Method for Boundary Points
Different situations call for different methods. If the target region is larger ( Figure  7a), it is preferable to utilize the principal components of the boundary points for fitting. This approach allows us to disregard the protruding portion and only utilize the points within the normal range. Due to manufacturing processes and the influence of model segmentation performance, the boundary points of the segmentation results may exhibit situations depicted in Figures 7c and 8b, or even more severe cases. In such cases, fitting the convex hull points of the contour provides better results.
Determining the convexity (CX) or concavity (CA) of an image can typically be accomplished using the vector cross-product method, which yields the number of non-convex points and offers excellent scalability. The vector cross-product method involves calculating the normal vector of a vertex by taking the cross-product of two adjacent edge vectors. Subsequently, the counterclockwise calculation of the normal vector for the

Clustering Method for Boundary Points
Different situations call for different methods. If the target region is larger (Figure 7a), it is preferable to utilize the principal components of the boundary points for fitting. This approach allows us to disregard the protruding portion and only utilize the points within the normal range. Due to manufacturing processes and the influence of model segmentation performance, the boundary points of the segmentation results may exhibit situations depicted in Figures 7c and 8b, or even more severe cases. In such cases, fitting the convex hull points of the contour provides better results.  Even if the predicted region appears to be a convex polygon, there may still exist some concave points in certain details. Therefore, we conducted tests on 200 predicted images and found that when the threshold for the number of concave points is set to 3, the algorithm can perfectly determine whether to use boundary points fitting or convex hull points fitting.
The boundary points are distributed in a strip shape in the two-dimensional space, Determining the convexity (CX) or concavity (CA) of an image can typically be accomplished using the vector cross-product method, which yields the number of non-convex points and offers excellent scalability. The vector cross-product method involves calculating the normal vector of a vertex by taking the cross-product of two adjacent edge vectors. Subsequently, the counterclockwise calculation of the normal vector for the polygon is performed. If the direction of the normal vector of a specific point is opposite to the direction of the polygon's normal vector, the vertex is considered concave, as shown in Figure 8a. The specific calculation method is as follows: (1) For an N-sided shape, select any point P as the starting point, and counterclockwise construct an edge vector V k from the endpoints of each edge, where k ≤ N (2) For any three adjacent points P m , P t , P n , with their corresponding edge vectors denoted as V m , V n , if V m × V n ≥ 0, P t indicates a non-concave point; otherwise, it represents a concave point.
The discriminative illustration is shown in Figure 8a, where concave and convex points have normal vectors with different rotational directions.
Even if the predicted region appears to be a convex polygon, there may still exist some concave points in certain details. Therefore, we conducted tests on 200 predicted images and found that when the threshold for the number of concave points is set to 3, the algorithm can perfectly determine whether to use boundary points fitting or convex hull points fitting.
The boundary points are distributed in a strip shape in the two-dimensional space, For elongated samples with a flattened elliptical distribution (normal distribution ellipse), Gaussian mixture models are well suited for data clustering of this nature. The Gaussian mixture model (GMM) can be used to consider the sample points as a combination of multiple Gaussian distributions (GDs), where each GD corresponds to a cluster in the point set. Since GDs exhibit a bell-shaped curve in space, they can better adapt to the clustering of strip-shaped point sets. For a multivariate random variable following a GD, the expression is as follows: where µ is the mean vector of dimension d, x ∈ R d×d is the covariance matrix of dimension, and |Σ| represents the determinant of matrix ∑.
In the GMM, each cluster is represented by a multivariate Gaussian distribution (GD), where p(x i θ j ) = p(x i µ j , Σ j ) . Let C represent the number of clusters. According to the above equation, each point x i is a linear combination of C GDs. The purpose of the GMM is to estimate unknown parameters Θ = {w 1 , · · · , w C ; µ 1 , · · · , µ C ; Σ 1 , · · · , Σ C } based on the observed dataset X = {x 1 , · · · , x N }. For any observed value x i , let the set of parameters for the jth cluster be denoted as θ j ∈ Θ, where (1 ≤ j ≤ C). The probability of x i belonging to the jth cluster can be modeled as w j (∑ w j = 1), and the probability distribution of x i can be expressed using the following equation: The logarithm of the maximum likelihood function L(Θ) can be used to evaluate the unknown parameters Θ, which can be expressed as: L(Θ) can be solved using the Expectation-Maximization (EM) algorithm [33], with the initialization of parameters w j . The recursive formulas for these parameters are as follows: The posterior probability γ (s) ij of given Θ (s) can be represented as: According to the assumption of independence, the next values of x, µ, Σ need to be maximized by the expectation function, which is defined as: Update the parameters in above equations sequentially until L(Θ) is below a certain threshold. The performance of GMM heavily relies on the density of sample points. If the points are too sparse or too dense, it can lead to clustering errors. To address this, the distance between any two adjacent points was evaluated, and if the distance exceeded 12 pixels, an additional point will be inserted at the midpoint of those two points. This method effectively controlled the density of points. The resulting clustering outcome, as depicted in Figure 8c, clearly demonstrates the performance of this clustering approach.

Features Points Fitting
In certain cases, due to inadequate segmentation results, there may be some outliers within the given cluster groups, or even within the target region itself. The line fitting results are susceptible to the distribution of points, and to avoid the influence of such special cases, VLFGM is proposed. Its principle is as follows: For a given set of points P = {P 1 , · · · , P n }, where two points P i = (x i , y i ), P j = (x j , y j ) form a line P i P j , the equation of this line can be expressed as follows: where ε is a very small floating point number to prevent division by zero. Let A represent the slope of the line and B represent the intercept. The distance D(q) from any other point P q = (x q , y q ) in the set to the line L can be calculated as follows: In order to directly and simply select the principal component points, we assign weights to each point in the point set based on their distance to the line. This approach considers points closer to the line as more important reference points, and the fitted line will be biased towards the intersection with points having higher weights due to their higher vote count. Specifically, we allocate scores to all points based on their distances, such that closer points have a higher proportion of scores, while farther points have lower weights. This mapping relationship is detailed in Table 3. Therefore, the total vote count for the line P i P j is indicated as When a line receives the maximum number of votes, it indicates that it is the best fit line with the majority of points close to it. Thus, this line is considered as the optimal fitting line. Let us denote the starting point as (x 0 , y 0 ), and n as the number of sides of the polygon. Naming all the vertices in a clockwise manner from 0 to n − 1. Once the four fitting lines are obtained, it is easy to determine the coordinates of the four vertices. For a given set of four vertices, which enclose a polygon, the centroid of the polygon can be quickly computed by: The coordinates of the centroid are given by:

Experiment and Model Training
Based on the actual position parameters at the site, a variety of experiments were designed in different environments. The visual effects of the dataset images are illustrated in the accompanying Figure 9. To ensure that the camera can capture the lifting holes completely, the camera setup at the site is shown in Figure 10b. Appl    To simulate a greater range of environmental variations, dataset creation process revolves around altering other conditions based on changes in perspective. When capturing images from specific angles, we introduce variations in lighting angles and intensity. Furthermore, for 20% of the original images, we apply 1 to 3 random augmentation techniques, including Gaussian blur, color jittering, and image rotation, to accommodate diverse scenarios. This approach ensures both the authenticity of the training set and the diversity of images within the dataset. Ultimately, after subjecting the initial set of 3102 images to partial augmentation processes, a total of 4083 images were obtained.
These images were randomly divided into training, testing, and validation sets in an 8:1:1 ratio. The training set was used for training the YOLOv5s model, and the cropped images of the holes were used as the training set for the ADD model. The software environment for model training was Windows 10 operating system with Python 3.7 and PyTorch 1.12. The hardware setup included an AMD Ryzen9 5900X CPU, an RTX 3080 (10 G) GPU, and DDR4 RAM (64 G). The camera model used for capture is HikVision MV-CH120-10GC, with a frame rate of 9.4 FPS and an image resolution of 4096 × 3000. The training process used the SGD optimizer with a batch size of 16 and 100 epochs. The learning rate was adjusted using cosine annealing, starting at 0.01. A momentum of 0.9 and a weight decay of 0.0004 were utilized. To simulate a greater range of environmental variations, dataset creation process revolves around altering other conditions based on changes in perspective. When capturing images from specific angles, we introduce variations in lighting angles and intensity. Furthermore, for 20% of the original images, we apply 1 to 3 random augmentation techniques, including Gaussian blur, color jittering, and image rotation, to accommodate diverse scenarios. This approach ensures both the authenticity of the training set and the diversity of images within the dataset. Ultimately, after subjecting the initial set of 3102 images to partial augmentation processes, a total of 4083 images were obtained.
These images were randomly divided into training, testing, and validation sets in an 8:1:1 ratio. The training set was used for training the YOLOv5s model, and the cropped images of the holes were used as the training set for the ADD model. The software environment for model training was Windows 10 operating system with Python 3.7 and PyTorch 1.12. The hardware setup included an AMD Ryzen9 5900X CPU, an RTX 3080 (10 G) GPU, and DDR4 RAM (64 G). The camera model used for capture is HikVision MV-CH120-10GC, with a frame rate of 9.4 FPS and an image resolution of 4096 × 3000. The training process used the SGD optimizer with a batch size of 16 and 100 epochs. The learning rate was adjusted using cosine annealing, starting at 0.01. A momentum of 0.9 and a weight decay of 0.0004 were utilized.
To validate the effectiveness of the proposed method, it was compared with existing pixel-based classification methods, as well as DeepLabV3+-MobileNetV3. The evaluation criteria included model parameter size, mIoU, mPA, and segmentation result images. For pixel binary classification problems, mIoU and mPA scores are calculated using the following formulas: where TP denotes the number of pixels correctly classified as positive class (Lifting holes), TN refers to the number of pixels correctly classified as negative class (Background), FP corresponds to the number of pixels incorrectly classified as positive class, and FN represents the number of pixels incorrectly classified as negative class. Furthermore, in evaluating the fitting performance of the vertices, a comparison was made between the Hough line detector, line segmentation detection (LSD), LS, and VLFGM. In examining the centroid coordinate error, the proposed method was compared to the LS, minimum bounding rectangle method (MBR), edge centroid (EC), and connected domain centroid (CD). The aforementioned methods are compared based on the Euclidean distance error in terms of pixel points.

Training Results of YoloV5s
Figure 11a presents the curves of training loss and mAP. It is evident that the curves show relatively smooth changes. Particularly, the mAP@0.5:0.95 indicator reached 87.60%, indicating high recall and precision of the model. Furthermore, in evaluating the fitting performance of the vertices, a comparison was made between the Hough line detector, line segmentation detection (LSD), LS, and VLFGM. In examining the centroid coordinate error, the proposed method was compared to the LS, minimum bounding rectangle method (MBR), edge centroid (EC), and connected domain centroid (CD). The aforementioned methods are compared based on the Euclidean distance error in terms of pixel points. Figure 11a presents the curves of training loss and mAP. It is evident that the curves show relatively smooth changes. Particularly, the mAP@0.5:0.95 indicator reached 87.60%, indicating high recall and precision of the model. Furthermore, as shown in Figure 11b. The left image demonstrates the detection results in the presence of various object interferences. The YoloV5s model accurately identifies the real lifting hole instead of adjacent interfering hole. The other three images depict environments with strong lighting, weak lighting, and lighting shadows, respectively. From the confidence of bounding boxes and the anchor boxes, it can be observed that the model is still capable of effortlessly handling target localization under these conditions.

Training Results of YoloV5s
Based on the experimental result images, YoloV5s consistently and accurately identifies the positions of lifting holes, regardless of different lighting conditions or shooting angles. This indicates that the model possesses excellent generalization performance and can reliably achieve rough positioning of lifting holes in the described lifting scenarios. These experimental results provide compelling evidence to demonstrate that YoloV5s can consistently acquire ROI and ensure the safety of lifting operations. Furthermore, as shown in Figure 11b. The left image demonstrates the detection results in the presence of various object interferences. The YoloV5s model accurately identifies the real lifting hole instead of adjacent interfering hole. The other three images depict environments with strong lighting, weak lighting, and lighting shadows, respectively. From the confidence of bounding boxes and the anchor boxes, it can be observed that the model is still capable of effortlessly handling target localization under these conditions.
Based on the experimental result images, YoloV5s consistently and accurately identifies the positions of lifting holes, regardless of different lighting conditions or shooting angles. This indicates that the model possesses excellent generalization performance and can reliably achieve rough positioning of lifting holes in the described lifting scenarios. These experimental results provide compelling evidence to demonstrate that YoloV5s can consistently acquire ROI and ensure the safety of lifting operations.

Image Segmentation Results
To evaluate the segmentation performance of the ADD model, this study compared its results with those of various methods, as shown in Table 4 and Figure 12. In Table 4, considering mIoU alone, DeepLabV3+ does not emerge as the best model. However, after considering the overall impact on the prediction results using mPA, we opted to improve this model. Unet, with its classic encoder-decoder structure, achieves the highest mIoU of 93.04% on the dataset. Nevertheless, its mPA is relatively low, indicating a higher number of misclassified pixels. As shown in Figure 12e, almost every predicted image from Unet exhibits missing boundaries, particularly in areas with strong lighting shadows. The unsophisticated upsampling and downsampling processes in Unet, coupled with the lack of global contextual information in the convolution operations, may result in the loss of features [34], causing the smaller predicted boundaries.  In Table 4, considering mIoU alone, DeepLabV3+ does not emerge as the best model. However, after considering the overall impact on the prediction results using mPA, we opted to improve this model. Unet, with its classic encoder-decoder structure, achieves the highest mIoU of 93.04% on the dataset. Nevertheless, its mPA is relatively low, indicating a higher number of misclassified pixels. As shown in Figure 12e, almost every predicted image from Unet exhibits missing boundaries, particularly in areas with strong lighting shadows. The unsophisticated upsampling and downsampling processes in Unet, coupled with the lack of global contextual information in the convolution operations, may result in the loss of features [34], causing the smaller predicted boundaries.
Although PSPNet and SegNet have slightly lower mIoU scores compared to UNet, their higher mPA scores indicate more accurate prediction of region areas. However, both models have noticeable drawbacks. As depicted in Figure 12c,d, when the segmentation Although PSPNet and SegNet have slightly lower mIoU scores compared to UNet, their higher mPA scores indicate more accurate prediction of region areas. However, both models have noticeable drawbacks. As depicted in Figure 12c,d, when the segmentation results are relatively complete, PSPNet tends to produce less smooth edges, particularly at the vertices. The spatial pyramid pooling structure in PSPNet integrates features from different scales, but the varying semantic information across scales affects the perception of edges, leading to some features being suppressed and compromising edge smoothness. SegNet performs slightly better in terms of prediction results, but almost all images exhibit minor jaggedness along the edges. This is due to the lack of effective contextual information capture modules during the encoding phase, reducing the model's segmentation capability for edge regions. When prominent lighting shadows are present, the aforementioned models mainly classify pixels in well-illuminated areas accurately.
From Table 4 and Figure 12g, it is evident that the DeepLabV3+-ResNet50 model exhibits excellent overall performance metrics, segmentation results and FPS. This model achieves a remarkable mPA of 99.10%, indicating high accuracy in pixel classification and minimal missegmentation, making it well suited for segmenting hole images. However, DeepLabV3+ also has certain drawbacks. In certain images, similar to PSPNet's segmentation results, distortion at the corner points can occur. Nevertheless, this phenomenon is somewhat alleviated due to the adoption of atrous convolution in DeepLabV3+, which offers a larger receptive field and richer semantic information. However, even with these improvements, discontinuities in connected regions still occur in images with strong lighting shadows. Additionally, we employed the lightweight MobileNetV3 as the backbone network. Unfortunately, despite boasting the best real-time performance, this approach yielded poor results across various aspects except for the high mPA, as depicted in Figure 12f.
The ADD model builds upon the DeepLabV3+ model by introducing the ECA mechanism and DASPP module. The improved algorithm incorporates attention mechanisms at multiple stages, resulting in more reasonable feature weight allocation and a better focus on relevant features. Moreover, the model's parameter count increases only slightly, approximately 4.23 M. The densely connected multi-scale feature fusion module offers a larger receptive field and stronger semantic information extraction capability compared to the original model. As a result, ADD achieves the highest mIoU and mPA scores, as shown in Table 4. Simultaneously, the proposed model exhibits the best segmentation results across seven different environments, as illustrated in Figure 12h. In the first group of images, ADD's predictions along the right edge do not exhibit downward bending, particularly noticeable in the lower right corner. Conversely, scenarios (d-g) manifest such curvatures, although these might seem present in the original images due to imaging effects. In the second and fifth image groups, both ADD and scenarios (d-g) display jagged edges along the left and right boundaries. Within the fourth image group, ADD's superiority over other models is distinctly pronounced. In the sixth experimental set, only ADD and the baseline show favorable outcomes. However, the baseline's classification outcome in the upper left corner appears rounded and lacks sharpness, akin to an obtuse angle, which indicates evident distortion. Although some resulting images may exhibit unevenness, overall, the ADD model produces the smoothest edge effects. Furthermore, the model demonstrates superior resistance to lighting interference, preventing the disconnection of connected regions, which is crucial. The ADD provides a guarantee for subsequent detection tasks.
Moreover, we have conducted an analysis of the contributions of each module to the overall model, and the results are presented in the following Table 5. mIoU and mPA are employed as evaluation metrics. The DASPP connects features across different scales. As evident from Table 5, the introduction of DASPP leads to a 0.36% enhancement in the model's mIoU metric. The incorporation of the ECA mechanism in low-level features results in a performance gain of 1.17%, signifying the retention of significant original information in the shallow network layers and emphasizing the favorable role of attending to these features. The introduction of attention mechanisms at the output end of the backbone network (referred to as the deep level in the table) optimizes both mIoU and mPA by 2.18% and 0.08%, respectively, underscoring the importance of deep-level semantic information for model performance, despite its susceptibility to loss and thus meriting profound consideration. Comparing experiments in groups 6 and 8, it is evident that introducing the ECA mechanism in ResNet effectively mitigates the impacts of operations such as pooling. Even with an already optimized foundation, the model's mIoU metric can be further augmented by 1.01%, effectively counteracting excessive information loss caused by the backbone network's pooling layers. By contrasting experiments in groups 4, 6, and 8, it is discernible that introducing attention mechanisms both before and after the backbone network's output end can preserve essential feature information.

Visual Results Comparison
Through the aforementioned evaluation method, we conducted tests on point clusters under special circumstances. Whether the segmented region is oversized, incomplete, or even when the clustering results deviate, the VLFGM method consistently outperforms the LS, demonstrating more pronounced optimization effects on vertex coordinates. The comparative results are presented in Figure 13. By employing an improved approach to fit corner points, feature points from various segmented images can be accurately extracted. This technique also corrects deviations in vertex positions and compensates for the impact caused by incomplete image segmentation. The utilization of improved method enables more effective extraction and utilization of corner information in images, thereby enhancing accuracy and reliability in various application scenarios.
in ResNet effectively mitigates the impacts of operations such as pooling. Even with an already optimized foundation, the model's mIoU metric can be further augmented by 1.01%, effectively counteracting excessive information loss caused by the backbone network's pooling layers. By contrasting experiments in groups 4, 6, and 8, it is discernible that introducing attention mechanisms both before and after the backbone network's output end can preserve essential feature information.

Visual Results Comparison
Through the aforementioned evaluation method, we conducted tests on point clusters under special circumstances. Whether the segmented region is oversized, incomplete, or even when the clustering results deviate, the VLFGM method consistently outperforms the LS, demonstrating more pronounced optimization effects on vertex coordinates. The comparative results are presented in Figure 13. By employing an improved approach to fit corner points, feature points from various segmented images can be accurately extracted. This technique also corrects deviations in vertex positions and compensates for the impact caused by incomplete image segmentation. The utilization of improved method enables more effective extraction and utilization of corner information in images, thereby enhancing accuracy and reliability in various application scenarios. For images with better imaging quality, such as those in groups 1, 5, and 6 of Figure  12, fitting their key points for localization is not challenging. To clearly compare the fitting performance of different methods on the images with poor imaging performance, this study presents 5 groups of different result images, as shown in Figure 14. When the image segmentation results are relatively normal, the fitting results of Hough line detection, LSD,  For images with better imaging quality, such as those in groups 1, 5, and 6 of Figure 12, fitting their key points for localization is not challenging. To clearly compare the fitting performance of different methods on the images with poor imaging performance, this study presents 5 groups of different result images, as shown in Figure 14. When the image segmentation results are relatively normal, the fitting results of Hough line detection, LSD, LS, and VLFGM are nearly identical and realistic, as depicted in Figure 14a,b. In this scenario, VLFGM degenerates into LS for obtaining higher scores in line fitting, considering the weighted votes of preferred points against other points. However, in special cases such as Figure 14c-e, there are some differences in the effects of different methods. Hough and LSD are feature-based methods relying on edge points or corner features, which are sensitive to significant gradient changes in pixels but cannot differentiate interest points, resulting in more noisy lines in the detection results and thus appearing more chaotic. Furthermore, LS cannot discard outliers in the points, leading to deviations in the fitted line slope as it directly regresses on the sample points. Consequently, LS fitting results exhibit larger errors in the vertexes of the enclosed shapes. In contrast, the proposed method can assign weights, extracting the principal component information from the sample data and outputting a line most related to the majority of points. This method demonstrates more stable fitting performance and, visually, can still achieve satisfactory results even in poorly performing images.

Error Analysis of Feature Points
A total of 60 experiments were designed to quantify the improvement effects of the proposed method, using GT as the baseline. The experimental subjects included both normal and abnormal segmented result images, and the comparative result is shown in Figure 15.
Furthermore, LS cannot discard outliers in the points, leading to deviations in the fitted line slope as it directly regresses on the sample points. Consequently, LS fitting results exhibit larger errors in the vertexes of the enclosed shapes. In contrast, the proposed method can assign weights, extracting the principal component information from the sample data and outputting a line most related to the majority of points. This method demonstrates more stable fitting performance and, visually, can still achieve satisfactory results even in poorly performing images.  From Figure 14, it can be observed that the LS method is highly sensitive to abnormal images. Moreover, the fitting performance of this method heavily relies on the results of Gaussian mixture clustering, as depicted in Figure 13a,b. Due to these reasons, the LS did not perform well in the experiments, even though some points had fitting errors as low as 0 pixel, the overall error still reached 7.70 pixels, as shown in Figure 15. However, the VLFGM method demonstrates stronger anti-interference capability, as long as the principal components of the point clusters are normal, the fitting results will not vary significantly. Of course, when the segmentation results are extremely poor, the improved method may still exhibit larger error points, but compared to the LS method within the same set of experiments, it is relatively smaller. Overall, the VLFGM method concentrates the vertex errors, and even when some image segmentation results are unsatisfactory, the average error in vertex fitting can still reach 5.39 pixels.

GT
In order to prove from another perspective that the proposed method has better fitting accuracy, centroid errors were compared with other methods, and the experimental result is shown in Table 6. From Figure 14, it can be observed that the LS method is highly sensitive to abnormal images. Moreover, the fitting performance of this method heavily relies on the results of Gaussian mixture clustering, as depicted in Figure 13a,b. Due to these reasons, the LS did not perform well in the experiments, even though some points had fitting errors as low as 0 pixel, the overall error still reached 7.70 pixels, as shown in Figure 15. However, the VLFGM method demonstrates stronger anti-interference capability, as long as the principal components of the point clusters are normal, the fitting results will not vary significantly. Of course, when the segmentation results are extremely poor, the improved method may still exhibit larger error points, but compared to the LS method within the same set of experiments, it is relatively smaller. Overall, the VLFGM method concentrates the vertex errors, and even when some image segmentation results are unsatisfactory, the average error in vertex fitting can still reach 5.39 pixels.
In order to prove from another perspective that the proposed method has better fitting accuracy, centroid errors were compared with other methods, and the experimental result is shown in Table 6. In theory, there is no direct relationship between vertex error and centroid error. Due to the presence of numerous internal points in the image, there can be error compensation phenomenon, even resulting in smaller centroid errors, as shown in Table 6. MBR exhibits larger centroid error, reaching 4.05 pixels, due to factors such as target pose during center extraction. EC relies solely on edge extraction for centroid estimation, which can lead to significant errors in cases of misclassification within the target region, as shown in Figure 12d. Therefore, EC performs slightly better than MBR. LS, by fitting quadrilaterals close to the original results, yields similar errors to CD. However, since the original connected regions contain more points, CD achieves a better average error performance, surpassing LS by 0.15 pixels. On the other hand, VLFGM is capable of fitting connected regions closer to the ground truth, resulting in a centroid error of only 1.72 pixels for the proposed method. This demonstrates its superior fitting performance

Conclusions and Discussion
In this study, the YoloV5s model was employed to extract the lifting hole regions from panoramic photos. Subsequently, the ADD model was utilized for pixel-level segmentation of the target regions, followed by fitting the segmented areas using the VLFGM method to output the key point coordinates for alignment. Experimental results demonstrate that the proposed approach achieves higher accuracy in coordinate estimation, thereby providing assurance for spatial point extraction. The following conclusions are drawn from this research: (1) The YoloV5s model is capable of identifying lifting holes in various environments, with accurate boundary predictions of anchor boxes, ensuring reliable coarse localization and laying the foundation for subsequent tasks. (2) The ADD model exhibits superior stability and precision compared to other commonly used models, achieving mIoU and mPA metrics of 95.07% and 99.16%, respectively. (3) Even in cases where the segmentation results are suboptimal, the VLFGM method still performs well in most fitting tasks. It exhibits less dependence on image segmentation results compared to other methods. Moreover, the method achieves higher accuracy, with vertex coordinate errors reaching 5.39 pixels in the presence of abnormal images, reducing the overall error by 30.00%. The centroid error is as low as 1.72 pixels, representing a reduction of approximately 28.93%.
Of course, our method also has some limitations. The current weight assignment in the VLFGM method is predefined. However, weight acquisition in this method can also be enhanced through Gaussian distributions and machine learning, enabling better adaptive performance. Since we utilized Python programming, when the number of sample points is within 100, VLFGM might require approximately 0.05 s, involving the efficiency of the programming language and the code. Furthermore, construction operations prioritize detection speed and accuracy. Due to limitations in image resolution, and hardware capabilities, our method can only achieve a frame rate of 9.4 FPS. To enhance the real-time capability of the localization method, two approaches can be considered. First, increasing model prediction speed, including object detection and image segmentation, can be pursued by adopting a smaller backbone with a precise trade-off between accuracy and speed. Second, given the frame rate constraints of industrial cameras, employing faster frame rate image capture devices (which may involve lowering resolution, accuracy, or increased investment) constitutes another limitation of our approach. In practical operational scenarios, suspensions are often difficult to keep still. If noticeable vibrations occur, captured images by the device may exhibit blurring or other interferences, which is another aspect to consider.
Obtaining accurate coordinate information for key point extraction forms the foundation of visual alignment and is crucial for visual guidance. Future work will primarily focus on enhancing the usability of semantic segmentation models on target images, particularly improving the restoration of boundaries and vertices. Target region fitting can only serve as a mitigating approach. At the same time, visual localization has become an important direction in intelligent construction. More measures need to be taken to address the accuracy of feature point coordinates, including improving the real-time performance and accuracy of target detection and segmentation, reducing the complexity of coordinate point extraction, and ultimately achieving efficient and high-precision visual alignment to raise the level of intelligence in construction projects.
Author Contributions: Z.Z. put forward many constructive suggestions and provided experimental equipment; W.X., F.Q. provided some hardware devices, completed the experiment and recorded the data. J.Q. proposed specific ideas, solutions, and wrote this article. All authors have read and agreed to the published version of the manuscript.