Next Article in Journal
Sample and Aggregate Voronoi Neighborhood Weighted Graph Neural Network (SAGE-Voronoi) and Its Capability for City-Sized Vehicle Traffic Time Series Prediction
Next Article in Special Issue
Correction: Xu et al. LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design. Appl. Sci. 2025, 15, 4134
Previous Article in Journal
Finite-Element Simulations of the Static Behavior and Explosive-Rupture Dynamics of 500 kV SF6 Porcelain Hollow Bushings
Previous Article in Special Issue
Terrain Surface Interpolation from Large-Scale 3D Point Cloud Data with Semantic Segmentation in Earthwork Sites
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Lightweight Improved YOLOv8-Based Method for Rebar Intersection Detection

1
College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310014, China
2
Key Laboratory of Special Purpose Equipment and Advanced Processing Technology, Ministry of Education & Zhejiang Province, Zhejiang University of Technology, Hangzhou 310014, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 12898; https://doi.org/10.3390/app152412898
Submission received: 6 November 2025 / Revised: 29 November 2025 / Accepted: 4 December 2025 / Published: 7 December 2025
(This article belongs to the Special Issue Advances in Smart Construction and Intelligent Buildings)

Abstract

As industrialized construction and smart building continue to advance, rebar-tying robots place higher demands on the real-time and accurate recognition of rebar intersections and their tying status. Existing deep learning-based detection methods generally rely on heavy backbone networks and complex feature-fusion structures, making it difficult to deploy them efficiently on resource-constrained mobile robots and edge devices, and there is also a lack of dedicated datasets for rebar intersections. In this study, 12,000 rebar mesh images were collected and annotated from two indoor scenes and one outdoor scene to construct a rebar-intersection dataset that supports both object detection and instance segmentation, enabling simultaneous learning of intersection locations and tying status. On this basis, a lightweight improved YOLOv8-based method for rebar intersection detection and segmentation is proposed. The original backbone is replaced with ShuffleNetV2, and a C2f_Dual residual module is introduced in the neck; the same improvements are further transferred to YOLOv8-seg to form a unified lightweight detection–segmentation framework for joint prediction of intersection locations and tying status. Experimental results show that, compared with the original YOLOv8L and several mainstream detectors, the proposed model achieves comparable or superior performance in terms of mAP@50, precision and recall, while reducing model size and computational cost by 51.2% and 58.1%, respectively, and significantly improving inference speed. The improved YOLOv8-seg also achieves satisfactory contour alignment and regional consistency for rebar regions and intersection masks. Owing to its combination of high accuracy and low resource consumption, the proposed method is well suited for deployment on edge-computing devices used in rebar-tying robots and construction quality inspection, providing an effective visual perception solution for intelligent construction.

1. Introduction

As industrialized construction and intelligent construction technologies rapidly advance, the demand for automated and intelligent equipment on construction sites is becoming increasingly urgent. Among the numerous procedures involved in reinforced concrete (RC) construction, rebar tying is one of the key processes for ensuring the structural load-bearing performance. At present, rebar tying is still predominantly performed manually under generally harsh site conditions, characterized by high labor intensity and low operational efficiency; moreover, the operation process is susceptible to fatigue and distraction, leading to quality risks such as missing ties and incorrect ties [1,2,3]. In complex construction environments, the speed and accuracy with which workers identify rebar intersections are further degraded by illumination variations and visual fatigue [4,5]. Consequently, achieving reliable automation of the rebar-tying process has become an important research direction in the field of construction robotics.
In recent years, deep-learning-based computer vision has demonstrated significant advantages in condition assessment and visual inspection of civil infrastructure [6,7,8]. When integrated with mobile robotic platforms, it has shown great potential in applications such as autonomous inspection, construction monitoring, and quality control in the building sector [9,10,11]. With regard to the automation of rebar tying, several rebar-tying robot prototypes have already been validated in engineering practice. Jin et al. [12] designed a rebar-tying robot with active perception and planning capabilities, which combines an RGB-D camera with a neural network model and achieves an intersection localization accuracy of approximately 89%. Feng et al. [13] developed a planar rebar-tying robot that employs an industrial camera for large-area detection of rebar intersections, attaining a recognition accuracy of 95.17%. These studies indicate that rebar-tying robots can effectively reduce manual labor intensity and improve construction safety and operational consistency; however, their overall system performance remains highly dependent on accurate and robust visual detection of rebar intersections under real construction-site conditions.
The rapid development of deep-learning-based object detection has provided a new technical pathway for visual perception in complex construction scenarios [14,15,16]. Compared with methods that rely on handcrafted features and traditional image-processing pipelines, convolutional neural network (CNN)-based detection frameworks can learn multi-scale and highly expressive features in an end-to-end manner, exhibiting stronger robustness under complex background conditions [9,10]. For rebar-related targets, Fan et al. [17] proposed a CNN-DC model for rebar counting and centroid localization, enabling automatic estimation of the number and positions of rebars; Xi et al. [18] investigated automatic measurement of rebar spacing using Faster R-CNN, based on two-dimensional deep-learning vision and camera imaging geometry; Kardovskyi et al. [19] developed the AI-QIM model based on Mask R-CNN for quality inspection of rebar tying on construction sites. Although these methods achieve high accuracy in rebar detection and quality assessment, their network structures are relatively complex with high computational and storage costs and a strong dependence on large-scale, high-quality annotated datasets. They still tend to miss or falsely detect small, densely distributed rebar intersections and are difficult to deploy directly on resource-constrained mobile robots and edge-computing devices [20,21].
To improve real-time performance and reduce model complexity, researchers have further focused on single-stage detection frameworks represented by YOLO and SSD. Li et al. [22] proposed a YOLOv3-based method for automatic rebar detection and counting, which significantly increases inference speed while maintaining high detection accuracy; Feng et al. [13] constructed a YOLOv5-P6-based rebar-intersection detection model for recognizing intersections over large-area rebar meshes, substantially improving intersection localization accuracy; Cheng et al. [23] investigated a rebar-intersection recognition network within a MobileNetV3-SSD framework, which simultaneously outputs the rebar positions and their tying status, thereby better matching the practical operational requirements of rebar-tying robots. Such YOLO-based approaches strike a balance between detection accuracy and inference speed in rebar and rebar-intersection detection. Nevertheless, under real-world conditions with severe illumination changes, heavy occlusions, and complex backgrounds, they still suffer from insufficient detection accuracy and limited feature representation capability for rebar-intersection targets that are small, densely distributed, and visually similar, leaving considerable room for further lightweight optimization in edge-computing scenarios with constrained computational resources [24].
In response to the above issues, many researchers have focused on lightweight object detection models. On the one hand, lightweight backbone networks such as GoogleNet, MobileNet, and EfficientNet have been adopted to replace complex backbones, where designs such as depthwise separable convolutions, channel splitting, and channel shuffling are used to substantially reduce the number of parameters and FLOPs [25,26,27]; on the other hand, in YOLO-like architectures, the detection heads and residual structures have been modified by incorporating grouped convolutions, attention mechanisms, and feature recombination to reduce redundant computation and enhance the representation of small targets. Zhang et al. [28] integrated a GhostNet backbone into YOLOv5 and combined it with an improved IoU loss, effectively reducing redundant computation. Duan et al. [29] proposed the YOLO-FAS model, which builds upon YOLOv5s by incorporating lightweight modules and quantization-aware training to achieve efficient detection of rebar-intersection locations and tying status. Zheng et al. [30] proposed the RebarNet network, which enhances the detection of small rebar targets through multi-scale feature fusion and embedded attention mechanisms. In addition, some studies have combined model compression and knowledge distillation techniques to further reduce model size while striving to preserve accuracy [31]. However, most of these methods are designed and evaluated on generic datasets or specific scenarios, and the potential of feature fusion and lightweight design for dense small targets such as rebar intersections has not yet been fully exploited. Compared with earlier versions of the YOLO family, YOLOv8 achieves a more favorable trade-off between accuracy and inference speed. However, although YOLOv8 surpasses its predecessors in detection performance by incorporating CSPDarkNet and C2f modules, the CSP structure still introduces redundant computation and substantial memory-access overhead in stacked convolutions [32], while the C2f modules incur considerable computational cost due to repeated convolutions and frequent feature concatenations. These limitations are particularly critical on resource-constrained edge devices, thereby hindering the long-term stable deployment and operation of the model. Furthermore, high-quality rebar-intersection datasets and systematic data acquisition schemes tailored to rebar-tying scenarios are still limited [33].
To address the aforementioned problems, this paper proposes a rebar-intersection detection method based on a lightweight improved YOLOv8 and conducts systematic validation on a self-constructed dataset and a dedicated server platform. The main contributions of this study are summarized as follows:
(1)
A rebar-intersection dataset comprising 12,000 images is constructed, accompanied by systematic preprocessing and annotation. The dataset covers diverse illumination conditions, occlusion patterns, and rebar layout configurations, thereby providing a comprehensive reflection of real construction-site environments and strong data support for model accuracy and generalization in complex scenarios.
(2)
In terms of model architecture, the original CSPDarkNet backbone in YOLOv8 is replaced with ShuffleNetV2. By leveraging channel splitting and channel shuffling, the proposed backbone effectively reduces the number of parameters and floating-point operations while maintaining multi-scale feature extraction capability, thereby markedly improving runtime efficiency in resource-constrained environments. To alleviate the computational redundancy of the C2f modules in YOLOv8, a DualConv structure is introduced, which combines grouped convolutions and pointwise convolutions in parallel. By exploiting group sparsity, DualConv reduces redundant computation and enhances the extraction and fusion of small-target features, further improving the model’s lightweight characteristics without compromising detection accuracy.
(3)
On the self-constructed dataset, comparative experiments are conducted between the proposed model and mainstream detectors such as Faster R-CNN and YOLOv5/6/7/8. The results show that the proposed method maintains a high level of accuracy in terms of mAP@50, while significantly reducing GFLOPs and model size, thereby demonstrating the feasibility and superiority of the improved model for deployment on edge devices and in practical engineering applications.

2. Materials and Methods

2.1. Dataset Construction and Processing

2.1.1. Dataset Construction

In rebar-intersection object detection and image segmentation tasks, the performance and generalization ability of deep learning models are highly dependent on the quality of the training data. However, there is currently no publicly available dataset dedicated to rebar intersections, and existing studies mostly rely on generic object detection or road/industrial-scene datasets, which cannot adequately cover the complex illumination, occlusion and background interference conditions encountered in real rebar construction. To meet the visual perception requirements of rebar-tying robots, this study constructed a dedicated rebar-grid intersection dataset containing 12,000 images to support the training and validation of the improved YOLOv8 models.
According to different task requirements, the dataset was divided into two subsets: an object detection subset and an image segmentation subset. The detection subset was used to train and evaluate the rebar-intersection detection model, with annotations in the form of bounding boxes and class labels for each intersection; the segmentation subset was used for pixel-level segmentation of rebar regions and intersection regions, providing shape constraints for fine localization and subsequent structural analysis. Some raw images were used in both tasks, differing only in annotation type and usage.
To enhance data diversity and scene representativeness, three image acquisition environments were designed, including two indoor scenes and one outdoor scene, as shown in Figure 1. The outdoor scene was arranged in an open area and captured under natural lighting conditions, covering situations such as strong sunlight, shadow boundaries and background variations, in order to simulate illumination changes and environmental disturbances during outdoor tying operations. The first indoor scene adopted a relatively simple background with uniform artificial lighting, highlighting rebar textures and intersection structures and facilitating basic feature learning by the model. In the second indoor scene, a complex background with textures or patterns was deliberately placed beneath the rebar grid and the lighting angle was adjusted, so that background interference and local occlusions appeared simultaneously in the images, making the scene closer to actual construction conditions.
In terms of hardware configuration, all three acquisition environments employed the same imaging system to ensure data consistency. A DCXW800 industrial camera (Jieruiweitong Electronics Technology Co., Ltd., Shenzhen, China) was used as the capture device and was mounted on a fixed stand with the lens oriented vertically toward the rebar grid on the ground. The imaging distance and camera height were kept constant to obtain images with similar viewpoints and controlled distortion. During acquisition, key parameters such as lens type, focal length, exposure time, and gain were maintained unchanged, while images were collected at multiple time periods to naturally incorporate variations in illumination intensity and shadow distribution. All images were recorded at a uniform resolution of 1280 × 720 pixels and were managed through a host-computer software system for acquisition and storage, as illustrated in Figure 2.
By adopting a multi-scene, multi-time acquisition strategy with unified camera parameters, the constructed dataset exhibits substantial diversity and representativeness in terms of illumination conditions, background complexity, rebar layout patterns, and the number of intersections. This provides a reliable data foundation for the training and evaluation of subsequent object detection and image segmentation models, and establishes a solid basis for analyzing the performance of the improved YOLOv8 in the rebar-intersection detection task.

2.1.2. Dataset Annotation and Processing

During the annotation and preprocessing stage, the object detection and image segmentation subsets were labeled and organized separately according to their respective task requirements. For the object detection subset, the open-source annotation tool LabelImg was used to manually label the collected rebar grid images. In each image, all visible rebar intersections were enclosed with rectangular bounding boxes and assigned corresponding class labels. The annotations were saved in VOC-format XML files, which mainly contain the image filename, image size, and the class name and bounding-box coordinates of each target. Figure 3 shows an example of the annotated samples for object detection, and Figure 4 illustrates the structure of the corresponding XML annotation files.
For the image segmentation subset, the objective was to obtain precise pixel-level contours of the rebar and intersection regions. The open-source tool Labelme was used to annotate a selected set of representative rebar grid images. During annotation, polygonal regions were manually drawn point by point along the edges of the rebars or intersections to form closed masks, and each polygon was assigned an appropriate class label. Upon completion, Labelme automatically generated JSON-format annotation files containing the image path and size, the class name of each annotated object, and the coordinates of the polygon vertices. This information provides the complete metadata required for subsequent mask generation and data loading during model training. Figure 5 presents an example of the annotated samples in the image segmentation subset.
After completing the annotations for both tasks, the dataset was unified, cleaned, and quality-checked. Images that were blurry, severely occluded, or clearly missing annotations were removed, and each image was inspected to ensure that all bounding boxes corresponded one-to-one with actual rebar intersections, thereby avoiding missing or incorrect labels. In addition, the class names in the JSON files were verified against the predefined category set to ensure semantic consistency in the annotations. The processed dataset is subsequently divided into training, validation, and test sets according to a fixed ratio, in order to ensure fair model evaluation and reproducible experimental results. Through the above annotation and processing workflow, the resulting dataset captures the distribution characteristics of rebar intersections under different scenes, illumination conditions, and background settings, and provides high-quality data support for training and performance evaluation of the proposed lightweight YOLOv8 models.

2.2. Overview of the YOLOv8 and YOLOv8-Seg Frameworks

In this study, all improvements are developed on top of the YOLOv8 framework. As shown in Figure 6, the overall architecture consists of an input layer, a backbone network (Backbone), a neck network (Neck), and a detection head (Head). The input images are first fed into the Backbone to extract multi-scale features, which are then fused in a top-down and bottom-up manner by the Neck, and finally passed to the Head to predict the class and bounding-box location of each rebar intersection.
The backbone network is built by stacking multiple convolutional layers and residual structures, with CBS, C2f, SPPF and Bottleneck as its core modules. As shown in Figure 7a, the CBS module consists of a convolutional layer, Batch Normalization and a SiLU activation function, and is used for basic local feature extraction and channel transformation. Figure 7b shows the C2f module, where the input first passes through a CBS block and is then split into two branches: one branch goes through several Bottleneck blocks for deep feature extraction, while the other serves as a shortcut branch that is concatenated with the Bottleneck output, followed by a final CBS to achieve feature reuse and efficient gradient flow. Figure 7c presents the SPPF module, in which the input is processed by a CBS block and multiple serial max-pooling layers to obtain features with different receptive fields; these are concatenated and fed into a concluding CBS, enabling multi-scale context aggregation with relatively low computational cost. Figure 7d illustrates two forms of the Bottleneck unit: the left variant includes a shortcut connection that adds the input to the output of two CBS layers to alleviate gradient vanishing and enhance feature representation, whereas the right variant omits the shortcut and uses only two stacked CBS layers for feature transformation when a residual path is not required.
The neck part adopts a combined FPN–PAN structure, as shown in Figure 8. The FPN passes high-level semantic information in a top-down manner, while the PAN propagates low-level detailed features in a bottom-up manner. Through upsampling, downsampling, and feature concatenation, multi-scale feature maps are projected into a unified feature space, enabling the network to exploit both local texture and global structural information when detecting small, densely distributed rebar intersections.
The detection head adopts a decoupled structure, as shown in Figure 9, where the classification branch predicts the category probability and the regression branch performs bounding-box localization. The classification branch uses the binary cross-entropy (BCE) loss to constrain the consistency between predicted probabilities and ground-truth labels. The regression branch incorporates the CIoU loss, which jointly considers overlap area, center-point distance, and aspect-ratio consistency between the predicted and ground-truth boxes, providing more precise localization guidance than IoU-based losses alone. To further enhance coordinate regression accuracy, YOLOv8 introduces the Distribution Focal Loss (DFL), which discretizes continuous box coordinates into several candidate positions and applies weighted constraints around the true location, making the predicted distribution more concentrated near the ground truth. In addition, YOLOv8 employs a Task-Aligned Assigner that integrates classification scores and IoU into a unified task-alignment metric, enabling the selection of positive samples that are both well-classified and well-localized.
When performing object detection alone, the YOLOv8 head contains only the classification and regression branches described above. For instance segmentation, YOLOv8-seg keeps the Backbone and Neck unchanged and introduces an additional segmentation branch in the Head, as illustrated in Figure 10. This branch first generates a set of prototype feature maps with K channels from the Neck features and outputs a K -dimensional mask coefficient vector for each predicted instance. By applying a channel-wise weighted sum between the prototype masks and the corresponding mask coefficients, followed by cropping and upsampling within each predicted bounding box, YOLOv8-seg produces instance masks that match the resolution of the original image. This design, which combines globally shared prototypes with a small set of per-instance coefficients, significantly reduces the dimensionality and computational cost of mask prediction while maintaining fine-grained segmentation quality.
In terms of the loss function, YOLOv8-seg retains the BCE classification loss, the CIoU bounding-box regression loss, and the DFL-based coordinate refinement loss used in detection, and additionally incorporates a mask segmentation loss. The overall loss can be expressed as:
L s e g = α L C l s l o s s + β L C I o U + γ L D F L + ω L m a s k
where L Cls denotes the binary cross-entropy (BCE) classification loss, which constrains the consistency between the predicted class probabilities and the ground-truth labels; L CIoU is the CIoU regression loss, which extends IoU by additionally considering the distance between box centers and the consistency of aspect ratios; L DFL is the Distribution Focal Loss, which refines the bounding-box coordinates by modeling discrete locations with weighted probabilities around the true position; and L mask is the mask segmentation loss, measuring the pixel-level discrepancy between the predicted and ground-truth instance masks, typically implemented using BCE or BCE combined with a Dice-based overlap constraint. The coefficients α , β , γ , and ω are loss-weight hyperparameters used to balance the relative importance of the detection and segmentation sub-tasks during joint optimization.
It is worth noting that YOLOv8-seg shares the same Backbone and Neck as the standard YOLOv8 detection model. Therefore, the lightweight improvements proposed in this study—including replacing the original backbone with ShuffleNetV2 and introducing the C2f_Dual residual module in the Neck—can be seamlessly transferred to YOLOv8-seg without modifying the loss formulation above, enabling both the detection and segmentation branches to benefit from the same efficient feature representations.

2.3. Lightweight Improved YOLOv8 Network

2.3.1. Backbone Improvement Using ShuffleNetV2

The backbone of YOLOv8 is built upon the CSPDarkNet53 architecture, which is responsible for extracting multi-scale feature maps from the input image and providing high-quality feature representations for the subsequent Neck and Head. As the core computational component of the network, the backbone has a direct impact on the overall parameter count, computational complexity, and inference efficiency. In practical applications—particularly when targeting mobile and embedded platforms—enhancing real-time inference performance under constrained computational resources and power limitations becomes a key design challenge. To address this issue, this study replaces the original CSPDarkNet53 backbone in YOLOv8 with ShuffleNetV2. As an improved version of ShuffleNetV1, ShuffleNetV2 effectively alleviates problems such as high memory access cost (MAC), excessive fragmentation caused by heavily grouped convolutions, and additional time and memory overhead introduced by element-wise additions in shortcut connections. Integrating ShuffleNetV2 into the YOLOv8 backbone allows the model to significantly reduce computational load and memory access while maintaining comparable detection accuracy. Figure 11 illustrates the basic building blocks of ShuffleNetV2, where Figure 11a shows the standard basic unit and Figure 11b shows the downsampling unit. These units are stacked hierarchically to form the complete ShuffleNetV2 backbone, optimizing the computational efficiency of the network as a whole.
In the basic unit of ShuffleNetV2, a channel split operation is first applied to divide the input feature map into left and right branches. The left branch usually serves as a shortcut path with a relatively simple structure, while the right branch undertakes the main nonlinear transformations and feature extraction. Channel splitting reduces the computation of each individual branch without significantly increasing the overall operation complexity, alleviates the performance bottleneck caused by multi-channel convolutions, and lowers memory access pressure. In the right branch, depthwise separable convolutions are introduced, whose structure is shown in Figure 12. A standard convolution is decomposed into a depthwise convolution followed by a pointwise convolution: the former performs spatial convolution independently within each channel, and the latter carries out a linear combination along the channel dimension. Compared with conventional convolutions, this factorization can significantly reduce the number of parameters and floating-point operations while maintaining feature representation capability, making ShuffleNetV2 particularly suitable for deployment on edge devices with limited computational resources.
In the downsampling basic unit, ShuffleNetV2 maintains the lightweight, dual-branch design philosophy of the standard unit but no longer performs channel splitting. Instead, spatial downsampling is achieved by introducing a stride-2 depthwise convolution followed by a pointwise convolution. The two branches process the input features in parallel and are subsequently concatenated along the channel dimension to form the output feature map. Because the spatial resolution is reduced by half, the number of channels is doubled to preserve the overall information capacity and representation power. After concatenation, the features pass through a Channel Shuffle operation, whose principle is illustrated in Figure 13. By grouping and reordering channels, information from different branches is fully mixed, improving cross-branch feature interaction, enhancing feature transmission efficiency, and strengthening the network’s ability to model complex patterns. This provides a more effective feature representation foundation for downstream rebar-intersection detection.
An important design principle of ShuffleNetV2 is to minimize the memory access cost (MAC) of the 1 × 1 convolution while keeping its computational complexity (FLOPs) fixed. To illustrate this, we provide the following analysis.
Assume the input feature map has a size of H × W × C 1 , where H and W denote the spatial height and width, and C 1 is the number of input channels. The 1 × 1 convolution kernel has a shape of ( C 1 , C 2 , 1,1 ) , where C 2 is the number of output channels. If the output feature map retains the same spatial resolution H × W , the corresponding computational cost is
B = H W C 1 C 2
where B represents the total FLOPs of the 1 × 1 convolution.
The total MAC of this convolution mainly arises from three sources. Before computation, the input feature map must be read from memory, resulting in an access cost of approximately H W C 1 . After computation, the output feature map is written back to memory, incurring an access cost of approximately H W C 2 . In addition, the convolution kernel parameters must be read once, generating an access cost of C 1 C 2 . Summing these three components, the MAC can be approximated as
M A C = H W C 1 + H W C 2 + C 1 C 2   = H W ( C 1 + C 2 ) + C 1 C 2
The three terms in (3), respectively, correspond to the memory access cost of reading the input feature map, writing the output feature map, and loading the kernel parameters. This analysis highlights why ShuffleNetV2 emphasizes balanced channel allocation and reduced reliance on expensive 1 × 1 convolutions to achieve efficient network execution on resource-limited devices.
From (2), it follows that
C 1 C 2 = B H W
Substituting (4) into (3) gives
M A C = H W ( C 1 + C 2 ) + B H W = ( H W ) 2 ( C 1 + C 2 ) 2 + B H W
Since H W > 0 , we can apply the arithmetic–geometric mean inequality
( C 1 + C 2 ) 2 4 C 1 C 2
For fixed C 1 and C 2 , the lower bound of ( C 1 + C 2 ) 2 is thus 4 C 1 C 2 . Combining this with (2), i.e., B = H W C 1 C 2 , yields
M A C H · W 2 · 4 C 1 C 2 + B H W
= 2 H · W C 1 C 2 + B H W
= 2 H · W · B + B H W
Equation (7) gives the theoretical lower bound of the MAC when the computational cost B and spatial size H , W are fixed. According to the arithmetic–geometric mean inequality, equality in (6) holds if and only if C 1 = C 2 , i.e., ( C 1 + C 2 ) 2 = 4 C 1 C 2 . Therefore, under given B and feature-map size H , W , the MAC of a 1 × 1 convolution is minimized only when the numbers of input and output channels are equal.
This result directly guides the network design of ShuffleNetV2: by keeping the input and output channel numbers of each 1 × 1 convolution as close as possible (ideally equal), the memory access cost can be significantly reduced without increasing the computational cost B . After integrating ShuffleNetV2 into the YOLOv8 backbone, the model achieves not only lower parameter count and FLOPs, but also reduced MAC, thereby improving deployment efficiency on resource-constrained edge devices.

2.3.2. Residual Module Improved with Dual Convolution

YOLOv8 makes extensive use of C2f residual blocks in both the Backbone and Neck for multi-scale feature extraction and information fusion. By combining shortcut connections with feature concatenation, C2f effectively enhances feature representation and alleviates gradient vanishing. However, in high-resolution, densely populated small-target scenarios such as rebar intersection detection, the standard convolutions used in the original C2f incur substantial computational cost on high-resolution feature maps, which significantly increases inference latency and hinders real-time deployment on embedded platforms and rebar-tying robots. To address this issue, while keeping the overall topology of C2f unchanged, this study introduces a lightweight Dual Convolutional Kernels (DualConv) structure to modify the C2f residual block, resulting in the C2f_Dual module that reduces computational load and improves inference efficiency. In order to enhance feature representation while lowering complexity, part of the standard convolutions in the original C2f are replaced with the DualConv structure. DualConv consists of two parallel convolution branches: one branch employs a grouped 3 × 3 convolution to focus on capturing local spatial context, and the other branch uses a 1 × 1 pointwise convolution to reorganize and fuse features along the channel dimension. Both branches share the same input feature channels and extract different types of information, which are then fused at the output (e.g., by element-wise addition or channel-wise concatenation), thereby providing richer feature representations at a lower computational cost. The filter design of DualConv is illustrated in Figure 14.
In Figure 14, the left branch represents a grouped 3 × 3 convolution kernel that performs spatial context modeling on the input feature map, while the right branch represents a 1 × 1 convolution kernel that conducts channel compression and linear transformation. The outputs of the two branches are fused at the end to form the output feature map of DualConv. In the figure, M denotes the number of input channels, N denotes the number of output channels, and G is the number of groups in the grouped convolution. In the DualConv structure, the N convolution kernels are divided into G groups, each operating on a subset of the input channels. Specifically, the input feature map with M channels is evenly partitioned into G groups along the channel dimension. For each group, a portion of the channels (e.g., M / G channels) is processed by the grouped 3 × 3 convolution to extract local spatial features, while the remaining channels are processed by the 1 × 1 convolution to perform channel reorganization and linear transformation. The outputs of the two branches are then fused along the channel dimension. This grouping mechanism induces a block-diagonal sparse structure in the convolutional weight matrix, encouraging strongly correlated channels to be grouped and convolved locally, thereby reducing redundant computation without significantly degrading representation capacity.
To quantitatively analyze the reduction in complexity brought by DualConv, we compare the FLOPs of a standard convolution and DualConv. The total computational cost of a standard convolution layer is given by
F L s t d = D 0 2 · K 2 · M · N
where the input feature map has spatial size D 0 × D 0 and M channels, the output feature map has N channels, and the convolution kernel size is K × K . In this case, the input feature map is processed by N kernels of size K × K × M , producing an output feature map of size D 0 × D 0 × N . The FLOPs of DualConv can be expressed as
F L D u a l 3 × 3 = D 0 2 K 2 M N + D 0 2 M N G
F L D u a l _ 1 × 1 = D 0 2 M N ( 1 1 G )
F L D u a l = F L D u a l 3 × 3 + F L D u a l 1 × 1 = D 0 2 K 2 M N G + D 0 2 M N
where G is the number of groups in the grouped convolution and serves as a tunable hyperparameter. Equation (11) shows that the total FLOPs of DualConv are composed of the computational cost of the grouped 3 × 3 branch and the 1 × 1 branch, as given in (9) and (10), respectively. The cost of additional linear operations such as element-wise addition is relatively small and can be neglected.
It should be noted that, in the earlier description of the DualConv design, the feature extraction process was abstracted into two types of operations (e.g., applying a 3 × 3 convolution to M / G channels and a 1 × 1 convolution to the remaining M M / G channels), where M / G and M M / G describe the two branches from the perspective of channel allocation. In contrast, the symbol G introduced in (11) and in Figure 14 denotes the number of groups in the grouped convolution, which is a structural hyperparameter of the network. In practice, the input feature map is evenly partitioned into G channel groups along the channel dimension, following a fixed order or rule, and the network automatically learns the weight distribution within each group during training; it is not necessary to manually specify which physical channel belongs to which group. The choice of G directly affects both the computational cost and the representation capacity of the model. In this work, a fixed small integer value of G is adopted, and the specific setting is detailed in the experimental configuration and ablation studies in Section 3.
By comparing the computational cost of a standard convolution in (8) with that of DualConv in (11), the reduction ratio can be expressed as
R = F L D u a l F L s t d = 1 G + 1 K 2
where R represents the proportion (or reduction ratio) of the DualConv FLOPs relative to the standard convolution. When R < 1 , DualConv requires fewer computations than the standard convolution, and a smaller R indicates a more significant reduction in computational cost. The remaining symbols are consistent with those in (8) and (11). Equation (12) shows that, with a reasonable choice of the group number G and channel partition strategy, DualConv can achieve a substantial reduction in FLOPs and parameters compared with a standard convolution. The ablation experiments in Section 3 further confirm that replacing the original C2f block with the DualConv-based C2f_Dual block in the rebar-intersection detection task leads to a clear decrease in FLOPs and inference time, while maintaining or even slightly improving detection accuracy. Overall, the C2f_Dual module based on DualConv effectively enhances the lightweight nature of YOLOv8 without sacrificing feature representation capability, making the improved network more suitable for real-time applications such as rebar-tying robots.

2.3.3. Overall Network Architecture

After applying the lightweight improvements to the backbone and residual blocks, the proposed modules are integrated into a unified detection–segmentation framework to construct an improved YOLOv8-based rebar intersection detection network. The overall architecture of the modified network is shown in Figure 15. Compared with the original YOLOv8, the proposed model follows the same global “Backbone–Neck–Head” design paradigm, but replaces the backbone with ShuffleNetV2 and adopts the C2f_Dual modules in the residual structures, thereby achieving a more favorable balance between computational efficiency and detection accuracy.
In the backbone part, the original YOLOv8 adopts a CSPDarkNet-based network to progressively extract multi-scale semantic features. In this work, it is replaced with a lightweight ShuffleNetV2-based backbone, as shown in Figure 15, which is composed of a series of ShuffleNetV2 basic units and downsampling units. By incorporating channel split, depthwise separable convolutions and channel shuffle, the improved backbone significantly reduces the number of parameters and multiply–accumulate operations while still effectively capturing rebar textures and intersection structures, thus providing high-quality multi-scale features for subsequent fusion. In the neck part, the FPN–PAN structure of YOLOv8 is retained to achieve top-down and bottom-up feature fusion, while all original C2f modules are replaced by C2f_Dual modules based on DualConv. The neck, which consists of multiple C2f_Dual blocks together with upsampling and downsampling layers, enhances the perception of rebar intersections at different scales during multi-scale feature concatenation and fusion, and, compared with the original C2f, further reduces computational cost so that high inference speed can be maintained even under high-resolution inputs. In the head part, the improved network keeps the decoupled detection and segmentation heads of YOLOv8: the detection branch receives multi-scale features and outputs class probabilities and bounding-box offsets for rebar intersection localization and classification, while the segmentation branch generates rebar masks based on prototype masks and mask coefficients. Both branches share the same lightweight backbone and C2f_Dual-based fused features, enabling unified modeling of detection and segmentation tasks without incurring a significant increase in computational burden.
It should be emphasized that the lightweight modifications to YOLOv8 proposed in this paper are not only applicable to object detection, but can also be directly transferred to the YOLOv8-seg instance segmentation model. YOLOv8-seg shares exactly the same Backbone and Neck as YOLOv8 and only adds a prototype-mask and mask-coefficient branch in the Head. Therefore, once CSPDarkNet is replaced with ShuffleNetV2 in the backbone and the C2f modules in the neck are replaced with DualConv-based C2f_Dual modules, both the detection and segmentation heads benefit from the same lightweight feature representations. The improved Backbone and Neck provide efficient multi-scale features for bounding-box prediction, as well as for prototype-mask generation and mask-coefficient regression, allowing YOLOv8-seg to significantly reduce computational cost and model size while largely preserving segmentation accuracy. In Section 3, comparative experiments between the lightweight YOLOv8/YOLOv8-seg and the original YOLOv8 as well as other mainstream detectors will quantitatively evaluate the performance of the proposed architecture on rebar-intersection detection in terms of parameter count, FLOPs, inference time and accuracy.

2.4. Experimental Environment and Evaluation Metrics

With the widespread application of deep learning in object detection and image segmentation, network architectures have become increasingly complex, and both training and inference are progressively more dependent on computational resources. To ensure fair and reproducible comparisons, all models in this study—including the original YOLOv8/YOLOv8-seg, the proposed lightweight YOLOv8/YOLOv8-seg, and comparison algorithms such as Faster R-CNN, YOLOv5, YOLOv6, and YOLOv7—are trained and tested under the same software and hardware environment.
All experiments are conducted on a deep-learning server equipped with a GPU, running a 64-bit Linux operating system with the corresponding NVIDIA driver, CUDA, and cuDNN installed to support GPU acceleration. The deep-learning framework is PyTorch, combined with Python for model construction, training, and inference. During training, TensorBoard and other visualization tools are used to monitor the convergence of the loss and the evolution of evaluation metrics such as mAP, precision, recall, and F1-score with the number of epochs, which facilitates timely observation of the training status and potential overfitting.
The detailed software and hardware configuration of the server is listed in Table 1, including CPU model and frequency, memory capacity, GPU model and memory, operating system version, as well as the versions of PyTorch and CUDA. By adopting a unified experimental platform, the influence of hardware differences on training speed and inference performance can be effectively eliminated.
During training, all models use the same dataset split, i.e., they share a common training set, validation set, and test set. The image preprocessing procedures are kept consistent: all input images are resized to the same resolution and normalized, and the data augmentation strategy is unified to include random flipping, brightness perturbation, and contrast perturbation. In addition, under the premise of not affecting normal convergence, the training epochs, batch size, base learning rate, and learning-rate decay strategy are kept as consistent as possible across models, with only minor adjustments when necessary for specific architectures. These settings ensure that the comparison results mainly reflect differences in network structure rather than biases introduced by training conditions.
For the object detection and image segmentation of rebar intersections, there is currently no unified evaluation standard, and different applications place varying emphasis on accuracy, recall, real-time performance, and model complexity. To comprehensively evaluate the proposed lightweight improved YOLOv8/YOLOv8-seg models, this section introduces the evaluation metrics from two perspectives: detection accuracy and model complexity.
(1)
Precision
P = T P T P + F P × 100 %
Precision measures the proportion of predicted positive samples that are true positives and reflects the level of false positives (FP).
(2)
Recall
P = T P T P + F N × 100 %
Recall measures the proportion of true positive samples that are successfully detected and reflects the level of false negatives (FN).
(3)
Average Precision (AP)
Average Precision (AP) is defined on the basis of the precision–recall (P–R) curve and provides a comprehensive evaluation of a model’s performance for a given class across different recall levels. It can be expressed as
A P = 0 1 P r d r
where P(r) is the precision at recall r. In practice, AP is computed via discrete sampling and numerical approximation.
(4)
Mean Average Precision (mAP)
m A P = 1 n i = 1 n A P i
where n is the number of classes in the detection task and APi is the Average Precision of the i-th class. In this study, mAP@50 (IoU threshold of 0.5) is adopted as the core accuracy metric for rebar-intersection detection and segmentation.
(5)
F1-score
F 1 = 2 × P × R P + R
The F1-score is the harmonic mean of precision and recall and is used to assess the overall balance between low false-positive and low false-negative rates.
(6)
Model Complexity Metrics
GFLOPs: the number of floating-point operations (in billions) required for a single forward pass at a standard input size, used to measure computational cost.
Parameters (Params): the total number of trainable parameters, usually measured in millions (M), which affects storage overhead and the risk of overfitting.
Model Size: the storage size of the model weight file (MB), reflecting the memory requirement on the deployment device.
In the subsequent experiments and discussions, the above accuracy and complexity metrics are jointly used to systematically compare the original YOLOv8/YOLOv8-seg, the proposed improved models, and the other baseline algorithms.

3. Results and Discussion

3.1. Object Detection Experiments

To evaluate the performance of the improved YOLOv8 model on the rebar-intersection detection task, several typical object detection algorithms, including Faster R-CNN, YOLOv5, YOLOv6, and YOLOv7, are selected as comparison methods. All models are trained and tested on the same dataset with identical data preprocessing procedures and similar training strategies. The comparison results are summarized in Table 2.
As shown in Table 2, the improved YOLOv8 achieves a favorable balance among accuracy, computational efficiency, and model lightweighting. Compared with two-stage algorithms such as Faster R-CNN, the improved YOLOv8 provides higher detection speed and a smaller model size. Relative to YOLOv5, YOLOv6, and YOLOv7, the improved YOLOv8 maintains competitive accuracy while offering clear advantages in terms of complexity. In other words, the proposed improved YOLOv8 model significantly reduces computational resource consumption while preserving high detection accuracy, thereby providing a practical solution for real-time, low-power object detection in industrial scenarios.
To examine the convergence behavior of the improved YOLOv8 during training, Figure 16 presents the curves of mAP@50, precision, F1-score, and recall as functions of the number of epochs:
Figure 16a shows the variation in mAP@50 with epochs, reflecting the improvement of overall detection accuracy. Figure 16b plots the precision curve, which characterizes the change in false detections as training progresses. Figure 16c gives the F1-score curve, illustrating the evolution of the balance between precision and recall, while Figure 16d shows the recall curve, which describes how the coverage of true rebar intersections changes with training. As can be seen from Figure 16, all four metrics increase gradually from relatively low initial values and tend to stabilize in the later training stages. The precision and recall both exceed 0.96, mAP@50 reaches more than 0.97, and the F1-score remains above 0.95. These results indicate that the model converges well on the proposed dataset without noticeable oscillation or degradation.
Figure 17 illustrates qualitative detection results of the improved YOLOv8 on several test images. In the figure, rectangular bounding boxes are drawn around the detected rebar intersections, and the class labels and confidence scores are shown next to each box. It can be observed that most rebar intersections are accurately localized and correctly classified under complex backgrounds and varying illumination conditions, with only a small number of false detections and missed detections, thereby confirming the detection capability of the improved YOLOv8 on real images.
From the comparison results in Table 2, it can be seen that, for the rebar-intersection detection task, the improved YOLOv8 attains performance in precision, recall, and mAP@50 that is on par with, or slightly superior to, mainstream one-stage detectors such as YOLOv5, YOLOv6, and YOLOv7, while its GFLOPs and model size are both significantly reduced. This demonstrates that, without sacrificing detection accuracy, the proposed lightweight design effectively compresses the computational and storage requirements of the model, making it more suitable as a detection module for edge devices such as rebar-tying robots.
This performance gain mainly benefits from the joint improvements to the backbone and neck structures. The ShuffleNetV2 backbone employs channel splitting, depthwise separable convolutions, and channel shuffling to substantially reduce the multiply–accumulate operations associated with 1 × 1 convolutions while preserving the key feature representation capability, thereby greatly lowering the computational complexity of the backbone at a given input resolution. In the multi-scale feature fusion stage of the neck, the C2f_Dual residual blocks based on DualConv decompose part of the standard convolutions into a grouped 3 × 3 convolution branch and a 1 × 1 convolution branch, which reduces redundant computation while maintaining or even enhancing the representation of dense small targets. Combined with the convergence behavior of the training curves in Figure 16, it can be concluded that the improved YOLOv8 does not exhibit evident underfitting or overfitting on this task, indicating that the lightweight structures do not weaken the effective representation capability of the model.

3.2. Image Segmentation Experiments

In the image segmentation experiments, an improved lightweight segmentation model is developed on the basis of YOLOv8-seg. The same rebar image data source and training strategy as in the object detection task are adopted, and instance-level segmentation is performed for the rebar regions and intersections. During training, the loss function follows the same structure as that used for the detection model, with an additional mask segmentation loss term L m a s k introduced to constrain the quality of the instance masks. The performance evolution of the improved YOLOv8-seg during training is illustrated in Figure 18.
Figure 18a presents the mAP@50 curve, which measures the overall accuracy of the instance segmentation results at an IoU threshold of 0.5. Figure 18b shows the precision curve, reflecting the proportion of correctly identified instance masks among the segmentation results. Figure 18c gives the F1-score curve, indicating the overall balance between precision and recall in the segmentation task, while Figure 18d displays the recall curve, representing the coverage of true instance masks by the model. As can be seen from Figure 18, all four metrics exhibit an overall upward trend with the number of training epochs and become stable after more than 300 iterations. The precision exceeds 0.93, the recall is higher than 0.96, mAP@50 surpasses 0.96, and the F1-score remains above 0.95. This indicates that, although the growth rates of precision and recall gradually slow down, the model has reached a stable high-performance regime and can converge well on the segmentation task.
To intuitively demonstrate the segmentation performance, Figure 19 provides visualizations of the segmentation results on several test images. The semi-transparent regions in different colors represent the predicted instance masks, whose polygonal boundaries are generally consistent with the actual rebar contours. It can be observed that, under varying backgrounds and illumination conditions, the improved YOLOv8-seg can accurately segment rebar regions and intersections, thereby providing pixel-level information support for subsequent geometric analysis and robot localization.
Judging from both the training process and the visualization results, the lightweight modifications do not impair the effective learning capability of the segmentation branch. The training curves in Figure 18 show that the mAP@50, precision, recall, and F1-score of the improved YOLOv8-seg on the segmentation task increase steadily with the number of epochs and gradually level off in the later stage, indicating that, under shared lightweight features, the segmentation branch can still sufficiently learn the shape and contour information of the targets. The visual results in Figure 19 further demonstrate that, in typical rebar scenes, the instance masks produced by the improved YOLOv8-seg exhibit good contour alignment and regional consistency. In resource-constrained scenarios, the lightweight design based on ShuffleNetV2 and C2f_Dual thus provides a reasonable starting point for segmentation tasks. For applications that demand higher segmentation accuracy, part of the original C2f structures may be retained, or the degree of backbone lightweighting may be reduced, thereby enabling a more refined trade-off between accuracy and complexity.

3.3. Ablation Study

To quantitatively analyze the impact of each lightweight module on model performance, ablation experiments are designed around the backbone and the residual blocks, and tests are conducted on the improved YOLOv8 and improved YOLOv8-seg models. For the object detection task, three types of network configurations are considered. The first configuration is the original YOLOv8L baseline model, i.e., the large-size detection variant of YOLOv8, which offers high accuracy but requires substantial computational resources and is used as the reference benchmark. The second configuration replaces only the backbone with ShuffleNetV2, in order to observe the changes in model complexity and performance introduced by the lightweight backbone. The third configuration is the complete lightweight model, in which the C2f_Dual residual blocks are further incorporated on the basis of the lightweight backbone, so as to evaluate the overall trade-off between accuracy and efficiency when both the backbone and residual structures are lightweighted simultaneously. The experimental results of these three configurations on the rebar-intersection dataset are summarized in Table 3.
From Table 3, it can be observed that when only the original YOLOv8L baseline model is used, mAP@50 reaches 0.9888, whereas the computational cost is as high as 165.4 GFLOPs and the model storage size is 83.6 MB. When only the residual blocks are modified, mAP@50 decreases slightly by approximately 0.6%, while the computational cost is markedly reduced by 18.7% and the model size is reduced by 17.7%. This indicates that improving only the residual blocks can effectively lower the computational complexity of the model while maintaining high detection accuracy. When only the improved backbone network structure is introduced, mAP@50 decreases by 1.0%, but the computational cost is reduced by 51.0% and the model size by 42.7%. This demonstrates that the improved backbone structure is highly effective for model lightweighting, albeit at the expense of a certain loss in detection accuracy. When both the improved backbone and residual blocks are employed simultaneously, mAP@50 decreases by 1.1%, while the computational cost is reduced by 58.1% and the model size by 51.2%. These results show that the combined effect of the two improvements does not lead to a simple additive loss in accuracy, suggesting a certain degree of complementarity in their optimization directions. In addition, the reduction in model size even exceeds the decrease in computational cost, indicating that the proposed strategies are particularly effective in optimizing storage-intensive parameters.
On the other hand, to evaluate the impact of the lightweight improvements on the segmentation task, similar ablation experiments are conducted on YOLOv8L-seg. Table 4 summarizes the performance of four configurations—namely, the original YOLOv8L-seg model, replacing only the backbone, replacing only the C2f modules, and jointly replacing the backbone and C2f_Dual—in terms of mAP@50, GFLOPs, number of parameters, and model size on the segmentation task.
As shown in Table 4, the YOLOv8L-seg model exhibits a clear accuracy–efficiency trade-off during the lightweight optimization process. When only the C2f modules are optimized, the computational cost is reduced by 9.9%, the model size is compressed to 54.0 MB, and mAP@50 decreases by only 0.47%. When the backbone is replaced with its lightweight counterpart, the computational cost drops by 28.7% and the model size is reduced by 15.2%, whereas the accuracy decreases by 1.8%. When both improvements are applied jointly, the computational cost is reduced by 30.6% and the model size is shrunk by 18.6%, while mAP@50 decreases by 2.0% to a final value of 0.9437. These results indicate that the lightweight structures exert varying degrees of influence on segmentation performance and model complexity.
From a structural-mechanism perspective, the roles of ShuffleNetV2 and C2f_Dual in the rebar-intersection detection and segmentation tasks can be understood in terms of feature representation and computational efficiency. Rebar intersections belong to a category of dense, regular, and small-sized targets that are highly sensitive to local textures and geometric structures. ShuffleNetV2 introduces depthwise separable convolutions and channel shuffling in the shallow layers of the network, thereby improving feature representation efficiency per unit computation and enabling the model to capture sufficiently rich local texture information under limited computational resources. Combined with the MAC analysis formula in Section 2.3.1, it can be shown that, under relatively balanced configurations of input and output channels, the memory-access cost of 1 × 1 convolutions can be significantly reduced, which provides theoretical support for the lightweight design of the backbone.
C2f_Dual decomposes the standard convolutions in the original C2f module into a grouped 3 × 3 convolution branch and a 1 × 1 convolution branch, which primarily focus on local spatial relationship modeling and channel-wise information recombination, respectively. This design is more suitable for representing complex textures around rebar intersections, such as intersecting line segments, overlapping regions, and rust-induced noise. The grouped 3 × 3 convolutions concentrate on local structures within a relatively small receptive field, whereas the 1 × 1 convolutions accomplish information fusion along the channel dimension. With an appropriately chosen number of groups G, DualConv can markedly reduce computational cost while maintaining the number of output channels and the feature representation capability, as verified by the ablation results in Table 3 and Table 4. Both the theoretical analysis and experimental results demonstrate that the proposed lightweight design does not merely rely on naively reducing the number of layers or channels; instead, it achieves highly cost-effective modeling of rebar-intersection features by employing more efficient convolutional structures within the overall YOLOv8 framework.

4. Conclusions

This paper addresses the problem of rebar-intersection detection and segmentation for rebar-tying robots and construction quality control and proposes a rebar-intersection detection method based on a lightweight improved YOLOv8. On this basis, a corresponding lightweight YOLOv8-seg segmentation scheme is developed. To address the lack of publicly available rebar-intersection datasets, rebar mesh images were collected in two indoor scenes and one outdoor scene under diverse illumination conditions, background complexities, and tying statuses. LabelImg and Labelme were used to perform object-detection and image-segmentation annotation, resulting in a dedicated dataset for training and evaluation. While keeping the overall YOLOv8 framework unchanged, the backbone is replaced with ShuffleNetV2, and the C2f residual blocks in the neck are modified into C2f_Dual modules based on DualConv. By analyzing the computational cost of standard convolutions and DualConv, it is theoretically shown that, with appropriately chosen group numbers and channel-splitting strategies, the proposed design can substantially reduce FLOPs and the number of parameters while preserving the critical feature representation capability. Experimental results indicate that, compared with mainstream detection algorithms such as Faster R-CNN, YOLOv5, YOLOv6, and YOLOv7, the improved YOLOv8 achieves competitive or slightly better performance in terms of mAP@50, precision, and recall, while GFLOPs, the number of parameters, and model size are significantly reduced, making it more suitable for deployment on resource-constrained embedded platforms. The improved YOLOv8-seg likewise exhibits favorable performance in segmentation accuracy and instance mask quality, and ablation studies further demonstrate that the ShuffleNetV2 backbone and C2f_Dual residual blocks play a key role in achieving a favorable balance between accuracy and efficiency.
Nevertheless, this study still has several limitations. The current dataset size and diversity of working conditions remain to be further improved, and the adaptability of the model to extreme illumination, severe occlusions, and more complex rebar layouts has yet to be fully verified. In addition, the current lightweight design mainly relies on architectural modifications; in future work, model compression techniques such as knowledge distillation, network quantization, and pruning can be incorporated to further reduce computational and storage overhead while essentially preserving accuracy. Moreover, future work may explore integrating the rebar-intersection detection and segmentation results with 3D vision, multi-sensor fusion, and building information modeling (BIM) to establish a multi-source cooperative perception framework for rebar engineering, thereby more closely supporting quality control and path planning of construction robots in intelligent construction processes.

Author Contributions

Conceptualization, R.W. and F.S.; methodology, R.W.; software, L.F.; validation, L.F., R.W. and F.S.; formal analysis, R.W.; investigation, F.S.; resources, K.L.; data curation, J.S.; writing—original draft preparation, R.W.; writing—review and editing, F.S. and J.S.; visualization, F.S.; supervision, L.Z. and Y.S.; project administration, Y.S.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NO. U21A20122).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the Key Laboratory of Special Purpose Equipment and Advanced Processing Technology, Ministry of Education & Zhejiang Province, Zhejiang University of Technology, for access to facilities and equipment that enabled data collection and experimentation. We also thank our laboratory colleagues for their assistance in constructing the rebar-intersection dataset and performing image annotation. We thank the members of the Structural Vision and Sensing group for their meticulous efforts in specimen preparation, camera rig calibration, and multi-operator cross-checking of the annotation workflow, which helped ensure the reliability and reproducibility of the experimental results.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
APAverage Precision
BCEBinary Cross-Entropy
BIMBuilding Information Modeling
C2fFaster Implementation of CSP Bottleneck with 2 Convolutions
C2f_DualC2f Block with DualConv
CBSConv–BatchNorm–SiLU
CIoUComplete Intersection over Union
CNNConvolutional Neural Network
CSPCross Stage Partial
CSPDarknet53Cross Stage Partial Darknet-53
CUDACompute Unified Device Architecture
cuDNNCUDA Deep Neural Network Library
DFLDistribution Focal Loss
DualConvDual Convolution
F1F1-Score (Harmonic Mean of Precision and Recall)
FLOPsFloating-Point Operations
FPNFeature Pyramid Network
GFLOPsGiga Floating-Point Operations
IoUIntersection over Union
MACMemory Access Cost
mAPMean Average Precision
OpenCVOpen Source Computer Vision Library
PANPath Aggregation Network
RCReinforced Concrete
RGB-DRed–Green–Blue plus Depth (Color–Depth Camera)
R-CNNRegion-based Convolutional Neural Network
Faster R-CNNFaster Region-based Convolutional Neural Network
SPPFSpatial Pyramid Pooling—Fast
VRAMVideo Random-Access Memory
YOLOYou Only Look Once
YOLOv8LLarge Variant of YOLOv8
YOLOv8-segYOLOv8 for Instance Segmentation
YOLOv8L-segLarge Variant of YOLOv8 for Instance Segmentation

References

  1. Ferreiro-Cabello, J.; Fraile-Garcia, E.; Lara-Santillán, P.M.; Mendoza-Villena, M. Assessment and Optimization of a Clean and Healthier Fusion Welding Procedure for Rebar in Building Structures. Appl. Sci. 2020, 10, 7045. [Google Scholar] [CrossRef]
  2. Liu, M.; Guo, J.; Deng, L.; Wang, S.; Wang, H. Enhanced Vision-Based 6-DoF Pose Estimation for Robotic Rebar Tying. Autom. Constr. 2025, 171, 105999. [Google Scholar] [CrossRef]
  3. Tan, X.; Xiong, L.; Zhang, W.; Zuo, Z.; He, X.; Xu, Y.; Li, F. Rebar-Tying Robot Based on Machine Vision and Coverage Path Planning. Robot. Auton. Syst. 2024, 182, 104826. [Google Scholar] [CrossRef]
  4. Guo, W.; Liang, G.; Ren, S.; Zeng, C. Automatic Crack Detection Method for High-Speed Railway Box Girder Based on Deep Learning Techniques and Inspection Robot. Structures 2024, 68, 107116. [Google Scholar] [CrossRef]
  5. Gong, Y.; Yang, K.; Seo, J.; Lee, J.G. Wearable Acceleration-Based Action Recognition for Long-Term and Continuous Activity Analysis in Construction Site. J. Build. Eng. 2022, 52, 104448. [Google Scholar] [CrossRef]
  6. Mahamivanan, H.; Matthews, J.; Love, P.E.D.; Nasirzadeh, F. Toward Accurate Detection of Small Objects in Rail Construction: A Deep Learning Perspective. Eng. Appl. Artif. Intell. 2025, 160, 111977. [Google Scholar] [CrossRef]
  7. Cai, Z.; Chebil, G.A.; Tan, Y.; Kessler, S.; Fottner, J. Deep Learning-Based on-Site Identification and Volume Measurement of Bulk Material in Construction Industry. Eng. Appl. Artif. Intell. 2026, 163, 112797. [Google Scholar] [CrossRef]
  8. Xu, J.; Pan, W. Deep Learning-Based Object Detection for Dynamic Construction Site Management. Autom. Constr. 2024, 165, 105494. [Google Scholar] [CrossRef]
  9. Zeng, L.; Guo, S.; Wu, J.; Markert, B. Autonomous Mobile Construction Robots in Built Environment: A Comprehensive Review. Dev. Built Environ. 2024, 19, 100484. [Google Scholar] [CrossRef]
  10. Du, S.; Du, M.; Gao, Y.; Yang, M.; Hu, F.; Weng, Y. Optimized Motion Planning for Mobile Robots in Dynamic Construction Environments with Low-Feature Mapping and Pose-Based Positioning. Autom. Constr. 2025, 177, 106334. [Google Scholar] [CrossRef]
  11. Chen, P.; Huang, B.; Yuan, S.; Jin, Y.; Hou, J. Automated Mobile Robot-Based Detection of in-Process Tunnel Lining Deformation during Construction. Autom. Constr. 2025, 180, 106560. [Google Scholar] [CrossRef]
  12. Jin, J.; Zhang, W.; Li, F.; Li, M.; Shi, Y.; Guo, Z.; Huang, Q. Robotic Binding of Rebar Based on Active Perception and Planning. Autom. Constr. 2021, 132, 103939. [Google Scholar] [CrossRef]
  13. Feng, R.; Jia, Y.; Wang, T.; Gan, H. Research on the System Design and Target Recognition Method of the Rebar-Tying Robot. Buildings 2024, 14, 838. [Google Scholar] [CrossRef]
  14. Chua, W.P.; Cheah, C.C. Deep-Learning-Based Automated Building Construction Progress Monitoring for Prefabricated Prefinished Volumetric Construction. Sensors 2024, 24, 7074. [Google Scholar] [CrossRef]
  15. Helian, B.; Huang, G.; Geimer, M. Temporal Sequence-Based Object Detection and Action Recognition for Mobile Machinery on Construction Sites. Adv. Eng. Inform. 2025, 68, 103691. [Google Scholar] [CrossRef]
  16. Xiao, H.; Yang, B.; Lu, Y.; Chen, W.; Lai, S.; Gao, B. Automated Detection of Complex Construction Scenes Using a Lightweight Transformer-Based Method. Autom. Constr. 2025, 177, 106330. [Google Scholar] [CrossRef]
  17. Fan, Z.; Lu, J.; Qiu, B.; Jiang, T.; An, K.; Josephraj, A.N.; Wei, C. Automated Steel Bar Counting and Center Localization with Convolutional Neural Networks. arXiv 2019, arXiv:1906.00891. [Google Scholar] [CrossRef]
  18. Xi, J.; Gao, L.; Zheng, J.; Wang, D.; Tu, C.; Jiang, J.; Miao, Y.; Zhong, J. Automatic Spacing Inspection of Rebar Spacers on Reinforcement Skeletons Using Vision-Based Deep Learning and Computational Geometry. J. Build. Eng. 2023, 79, 107775. [Google Scholar] [CrossRef]
  19. Kardovskyi, Y.; Moon, S. Artificial Intelligence Quality Inspection of Steel Bars Installation by Integrating Mask R-CNN and Stereo Vision. Autom. Constr. 2021, 130, 103850. [Google Scholar] [CrossRef]
  20. Mangal, M.; Li, M.; Gan, V.J.L.; Cheng, J.C.P. Automated Clash-Free Optimization of Steel Reinforcement in RC Frame Structures Using Building Information Modeling and Two-Stage Genetic Algorithm. Autom. Constr. 2021, 126, 103676. [Google Scholar] [CrossRef]
  21. Wang, H.; Ye, Z.; Wang, D.; Jiang, H.; Liu, P. Synthetic Datasets for Rebar Instance Segmentation Using Mask R-CNN. Buildings 2023, 13, 585. [Google Scholar] [CrossRef]
  22. Li, Y.; Lu, Y.; Chen, J. A Deep Learning Approach for Real-Time Rebar Counting on the Construction Site Based on YOLOv3 Detector. Autom. Constr. 2021, 124, 103602. [Google Scholar] [CrossRef]
  23. Cheng, B.; Deng, L. Vision Detection and Path Planning of Mobile Robots for Rebar Binding. J. Field Robot. 2024, 41, 1864–1886. [Google Scholar] [CrossRef]
  24. Zhao, X.; Yang, Z.; Zhao, H. DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery. Remote Sens. 2025, 17, 2989. [Google Scholar] [CrossRef]
  25. Yang, S.; Chen, L.; Wang, J.; Jin, W.; Yu, Y. A Novel Lightweight Object Detection Network with Attention Modules and Hierarchical Feature Pyramid. Symmetry 2023, 15, 2080. [Google Scholar] [CrossRef]
  26. Liu, H.; Yao, D.; Yang, J.; Li, X. Lightweight Convolutional Neural Network and Its Application in Rolling Bearing Fault Diagnosis under Variable Working Conditions. Sensors 2019, 19, 4827. [Google Scholar] [CrossRef]
  27. Farman, H.; Nasralla, M.M.; Khattak, S.B.A.; Jan, B. Efficient Fire Detection with E-EFNet: A Lightweight Deep Learning-Based Approach for Edge Devices. Appl. Sci. 2023, 13, 12941. [Google Scholar] [CrossRef]
  28. Zhang, X.; Zhang, X.; Jiang, H.; Ma, B.; Si, Y. Efficient Rebar Bundling: Vision Robotics Innovations. J. Chin. Inst. Eng. 2024, 47, 312–324. [Google Scholar] [CrossRef]
  29. Duan, H.; Yu, M.; Ai, T.; Zhu, M.; Jiang, H.; Guo, S. YOLO-FAS: A Lightweight Model for Detecting Rebar Intersections Location and Tying Status. Neurocomputing 2025, 624, 129485. [Google Scholar] [CrossRef]
  30. Zheng, Y.; Zhou, G.; Lu, B. A Multi-Scale Rebar Detection Network with an Embedded Attention Mechanism. Appl. Sci. 2023, 13, 8233. [Google Scholar] [CrossRef]
  31. Zhu, A.; Xie, J.; Wang, B.; Guo, H.; Guo, Z.; Wang, J.; Xu, L.; Zhu, S.; Yang, Z. Lightweight Defect Detection Algorithm of Tunnel Lining Based on Knowledge Distillation. Sci. Rep. 2024, 14, 27178. [Google Scholar] [CrossRef] [PubMed]
  32. Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Network. arXiv 2023, arXiv:2303.03667. [Google Scholar] [CrossRef]
  33. Zhao, Y.; Sun, F.; Wu, X. FEB-YOLOv8: A Multi-Scale Lightweight Detection Model for Underwater Object Detection. PLoS ONE 2024, 19, e0311173. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Rebar grid image acquisition site.
Figure 1. Rebar grid image acquisition site.
Applsci 15 12898 g001
Figure 2. Host computer for image acquisition.
Figure 2. Host computer for image acquisition.
Applsci 15 12898 g002
Figure 3. Annotation examples of object detection dataset.
Figure 3. Annotation examples of object detection dataset.
Applsci 15 12898 g003
Figure 4. Object detection annotation results.
Figure 4. Object detection annotation results.
Applsci 15 12898 g004
Figure 5. Instance of labeling for image segmentation samples.
Figure 5. Instance of labeling for image segmentation samples.
Applsci 15 12898 g005
Figure 6. YOLOv8 network overall architecture diagram.
Figure 6. YOLOv8 network overall architecture diagram.
Applsci 15 12898 g006
Figure 7. Structural diagrams of the core modules in the YOLOv8 backbone. (a) CBS module architecture; (b) C2f module architecture; (c) SPPF module architecture; (d) Bottleneck module architecture.
Figure 7. Structural diagrams of the core modules in the YOLOv8 backbone. (a) CBS module architecture; (b) C2f module architecture; (c) SPPF module architecture; (d) Bottleneck module architecture.
Applsci 15 12898 g007
Figure 8. FPN and PAN structure diagram.
Figure 8. FPN and PAN structure diagram.
Applsci 15 12898 g008
Figure 9. Detection head principle.
Figure 9. Detection head principle.
Applsci 15 12898 g009
Figure 10. Principle of the detection head in image segmentation.
Figure 10. Principle of the detection head in image segmentation.
Applsci 15 12898 g010
Figure 11. Network structure diagram of the ShuffleNetV2 basic units. (a) illustrates the standard basic unit, while (b) presents the downsampling unit.
Figure 11. Network structure diagram of the ShuffleNetV2 basic units. (a) illustrates the standard basic unit, while (b) presents the downsampling unit.
Applsci 15 12898 g011
Figure 12. Depth separable convolutional structure.
Figure 12. Depth separable convolutional structure.
Applsci 15 12898 g012
Figure 13. Channel shuffle operation.
Figure 13. Channel shuffle operation.
Applsci 15 12898 g013
Figure 14. DualConv convolutional filter design diagram.
Figure 14. DualConv convolutional filter design diagram.
Applsci 15 12898 g014
Figure 15. Improved YOLOv8 network architecture.
Figure 15. Improved YOLOv8 network architecture.
Applsci 15 12898 g015
Figure 16. Changes in training performance metrics for the improved YOLOv8: (a) mAP@50; (b) Precision; (c) F1 curve; (d) Recall.
Figure 16. Changes in training performance metrics for the improved YOLOv8: (a) mAP@50; (b) Precision; (c) F1 curve; (d) Recall.
Applsci 15 12898 g016
Figure 17. Object detection results visualization.
Figure 17. Object detection results visualization.
Applsci 15 12898 g017
Figure 18. Changes in training performance metrics for the improved YOLOv8-seg: (a) mAP@50; (b) Precision; (c) F1 curve; (d) Recall.
Figure 18. Changes in training performance metrics for the improved YOLOv8-seg: (a) mAP@50; (b) Precision; (c) F1 curve; (d) Recall.
Applsci 15 12898 g018
Figure 19. Image segment results visualization.
Figure 19. Image segment results visualization.
Applsci 15 12898 g019
Table 1. Server software and hardware configuration.
Table 1. Server software and hardware configuration.
Configuration ItemSpecification
CPUIntel® Xeon® Gold 6133
GPUNVIDIA GeForce RTX A5000 × 2
GPU memory (VRAM)24 GB × 2
OSUbuntu22.04 LTS-Desktop
Deep learning frameworkPyTorch 1.9.0
Python version3.8.19
OpenCV4.9.0.80
CUDA11.0
CUDNNv8.05.39
Table 2. Comparison of experimental results.
Table 2. Comparison of experimental results.
ModelmAP@50GFLOPs(M)Model Size (MB)
Faster R-CNN0.9435131.591.6
YOLOv50.9878135.3101
YOLOv60.9809391.9211
YOLOv70.9691105.271.3
This Paper0.977469.340.8
Table 3. Object detection ablation study results.
Table 3. Object detection ablation study results.
YOLOv8LImproved BackboneImproved C2fmAP@50GFLOPs(M)Model Size (MB)
0.9888165.483.6
0.9828134.568.8
0.978981.147.9
0.977469.340.8
Table 4. Ablation experiment results.
Table 4. Ablation experiment results.
YOLOv8L-SegImproved BackboneImproved C2fmAP@50GFLOPsModel Size (MB)
0.9633191.461.7
0.9588172.354.0
0.9464136.552.3
0.9437132.850.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, R.; Shi, F.; She, Y.; Zhang, L.; Lin, K.; Fu, L.; Shi, J. A Lightweight Improved YOLOv8-Based Method for Rebar Intersection Detection. Appl. Sci. 2025, 15, 12898. https://doi.org/10.3390/app152412898

AMA Style

Wang R, Shi F, She Y, Zhang L, Lin K, Fu L, Shi J. A Lightweight Improved YOLOv8-Based Method for Rebar Intersection Detection. Applied Sciences. 2025; 15(24):12898. https://doi.org/10.3390/app152412898

Chicago/Turabian Style

Wang, Rui, Fangjun Shi, Yini She, Li Zhang, Kaifeng Lin, Longshun Fu, and Jingkun Shi. 2025. "A Lightweight Improved YOLOv8-Based Method for Rebar Intersection Detection" Applied Sciences 15, no. 24: 12898. https://doi.org/10.3390/app152412898

APA Style

Wang, R., Shi, F., She, Y., Zhang, L., Lin, K., Fu, L., & Shi, J. (2025). A Lightweight Improved YOLOv8-Based Method for Rebar Intersection Detection. Applied Sciences, 15(24), 12898. https://doi.org/10.3390/app152412898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop