The proposed pipeline, as illustrated in
Figure 1, aims to reconstruct the geometry of the bridge cross-section from pixel-based input data. It begins with a cross-sectional view in pixel format, segmented from a larger construction drawing. This segmentation step lies outside the scope of the pipeline, as it has already been addressed in prior work by Mafipour et al. [
33] or Peng et al. [
76]. Instead, the cross-sectional view can be manually cropped. Within the segmented view, an object detection model localizes bridge cross-sections and the corresponding cross-section types, as described in
Section 3.1. The resulting bounding box then guides a segmentation step, explained in
Section 3.2, where a pre-trained segmentation network generates a binary mask indicating which pixels belong to the cross-section. During post-processing, detailed in
Section 3.3, this mask is converted into a polygon representation. Next, a parametric template is fitted to the polygon to reconstruct the cross-section geometry. A global optimization algorithm adjusts the template parameters by minimizing a loss function, as discussed in
Section 3.4. These parameters can then be used in modeling tools such as Allplan Bridge to generate the cross-section geometry, as outlined in
Section 3.5.
3.1. Cross-Section Detection
A single cross-sectional view may contain multiple bridge cross-sections, as is often the case in road bridges with separate superstructures for each direction of travel. Each cross-section requires separate parameter extraction. Because their positions can vary across the view, accurate detection is essential to ensure reliable processing in subsequent steps.
To enable individual detection, this study employs the object detection network YOLOv8 (
https://github.com/ultralytics/ultralytics (accessed on 15 December 2025)), which was selected for its high detection accuracy, well-maintained documentation, and ease of use. As one of the latest versions in the YOLO model family, YOLOv8 builds on several architectural improvements. The original model, introduced by Redmon et al. [
78], pioneered real-time object detection using a fully Convolutional Neural Network (CNN) that simultaneously predicted bounding boxes and class labels. Over time, the model has been refined through several key innovations: Anchor boxes were introduced to improve localization [
79], mosaic data augmentation and Self-Adversarial Training (SAT) enhanced training efficiency [
80], and compound scaling provided more effective network scaling [
81].
The architecture of YOLOv8 is illustrated in
Figure 2 and follows the conventional design of single-stage detection networks, consisting of three main components [
82]: backbone, neck, and head. The backbone, based on a modified Darknet CNN, extracts visual features from the input image. These features are then processed by the neck, which fuses and refines them, acting as a bridge between the backbone and the head. In YOLOv8, this is implemented using a Path Aggregation Network (PANet), which enhances detection across different object scales. Finally, the head receives the fused features and makes predictions using two branches: one for object classification and one for bounding box regression. To support multi-scale detection, YOLOv8 employs three parallel detection heads, each operating at a different scale.
The model processes each drawing view as a complete image. Although patch-based processing is commonly applied to large-format drawings to improve the detection of small-scale objects [
83,
84,
85], this strategy is not required in the present study. This is because the target cross-sections occupy a substantial portion of each drawing view, enabling reliable detection at full-view scale without spatial segmentation.
3.2. Cross-Section Segmentation
Using the bounding boxes predicted by the YOLOv8 model, semantic segmentation is applied to derive pixel-level masks of the cross-sections. Traditional segmentation networks such as Mask R-CNN [
86] or YOLOv7-Mask [
81] can detect and segment objects in a single pass. However, they require training datasets containing both bounding boxes and corresponding pixel-level masks. While bounding boxes can be annotated efficiently, producing accurate segmentation masks remains significantly more labor-intensive [
26]. To overcome this limitation, this study adopts an alternative method that eliminates the need for time-consuming mask annotations by leveraging the pre-trained segmentation model Segment Anything Model (SAM) [
87]. SAM’s zero-shot capability enables accurate segmentation of bridge cross-sections without any additional training. This approach significantly reduces development time and facilitates rapid adaptation to other cross-section types.
The architecture of the SAM model is depicted in
Figure 3. SAM processes an image to predict object masks based on additional guidance. Because the model is agnostic to object classes, it requires an additional input, referred to as a prompt, to indicate which object in the image should be segmented. This prompt is processed alongside the image and can take various forms, including sparse inputs such as keypoints, bounding boxes, or text, as well as dense inputs like masks. In this study, the bounding boxes predicted by the YOLOv8 model serve as prompts. Preliminary tests using keypoints, specifically the center point of each bounding box, were also conducted, but they did not yield satisfactory results (cf.
Section 4.5.1).
SAM consists of three main components: an image encoder, a prompt encoder, and a mask decoder. The image encoder generates feature embeddings from the input image using a pre-trained backbone based on the masked autoencoder framework [
88], implemented with a Vision Transformer (ViT) architecture [
89]. Notably, image encoding is performed only once per image, and the resulting embeddings can be reused for multiple object prompts.
The prompt encoder processes these input prompts using techniques tailored to their type. Geometric prompts, i.e., keypoints or bounding boxes, are encoded through positional encoding combined with learned embeddings. Bounding boxes are defined by the coordinates of their top-left and bottom-right corners. Each keypoint prompt includes coordinates and an associated label indicating whether the point belongs to the object. Text prompts are handled by the CLIP encoder [
90], while mask prompts are processed using a custom CNN in combination with learned embeddings. Since this study does not involve text or mask prompts, these components are omitted in
Figure 3.
Lastly, the mask decoder combines the image and prompt embeddings to generate the segmentation mask. It employs a transformer-based architecture with cross-attention mechanisms to fuse the two inputs and produce accurate predictions. However, prompt ambiguity can occur. For instance, a bounding box around an object may refer to the entire object or only a part of it. To account for such cases, SAM generates three candidate masks per prompt, each with an associated confidence score. In this study, only the mask with the highest confidence is processed.
3.4. Template-Based Parameter Extraction
To extract cross-section parameters, a parametric template is fitted to the polygon obtained from the segmentation mask. Since cross-sections can exhibit a wide variety of shapes, reflecting differences in structural function and design, multiple templates are defined, each tailored to approximate a specific cross-section type. Specifically, analysis of the dataset introduced in
Section 4.1 identified three common cross-section types: a simple slab girder, a T-girder, and a tapered T-girder. However, the proposed set of cross-sections can be readily extended to include additional types, increasing the method’s flexibility. Each template is defined by a set of parameters listed in
Table 1. Parameters are reused across templates to simplify design and maintenance without affecting the overall process. The cross-section types and their corresponding parametric templates are illustrated in
Figure 5.
Because the templates can be freely positioned within the two-dimensional pixel space, parameters P1 and P2 define their horizontal and vertical placement, respectively, relative to the top-left corner. Parameters P3 to P5 control the heights of structural members such as the flange and web, while P6 to P8 specify their widths, thereby defining the shape of the cross-section. This study assumes that cross-sections are vertically symmetric and aligned without rotation. This assumption holds for most cases in the dataset; however, the approach does not depend on this restriction. The parameter set and associated templates can be extended to represent asymmetric or rotated cross-sections without requiring fundamental changes to the method.
To extract the true parameters of the depicted cross-section in the view, the template is adjusted to best match the shape derived from the segmentation mask. For any given configuration
P1,
P2, …,
Pn the corresponding template polygon
is calculated as
where
f is the generator function that produces the polygonal template. The specific form of
f, along with the number of parameters
n and the number of vertices
m in
, is determined by the chosen cross-section type. The geometric similarity between this polygon
and the simplified segmentation polygon
is then quantified to assess how well they match. Based on this, the direction of optimization can be inferred, that is, how the parameters should be adjusted to improve the match. A well-designed geometric similarity metric is, therefore, essential for achieving both efficient optimization and accurate reconstruction.
In this work, the Complete IoU (CIoU) loss [
93] is used, as it captures relevant geometric factors: the overlap between the polygons, the distance between their centroids, and the ratio of their widths and heights. Furthermore, CIoU is invariant to resolution, ensuring consistent evaluation across cross-sectional views of varying scales. A perfectly aligned polygon pair yields a CIoU loss of zero, indicating complete geometric agreement. The equation used to compute the CIoU is:
In this formula, Intersection over Union (IoU) measures the overlap between the two polygons, while the variables
d,
c,
, and
V capture additional geometric relationships, with each term addressing a distinct aspect of the geometric similarity.
Figure 6 provides an overview of these factors.
The first factor, overlap, is measured using the IoU:
Here,
denotes the intersection area between the two polygons,
is the area of the parametric template polygon, and
is the area of the polygon derived from the segmentation mask. The IoU quantifies the extent to which the two polygons overlap, as visualized in
Figure 6a. However, when the polygons do not intersect, IoU returns a score of zero, failing to distinguish between polygons that are closely positioned and those that are far apart. As a result, it provides no information on how the parameters should be adjusted to improve geometric similarity.
To address this limitation, the CIoU loss incorporates the distance between the polygons’ centroids, as initially introduced in the Distance IoU (DIoU) loss by Zheng et al. [
94]. In Equation (
3), the variables
d and
c represent this spatial relationship. The value
d denotes the Euclidean distance between the centroids of the two polygons. To normalize this distance,
c is defined as the length of the diagonal of the smallest enclosing axis-aligned bounding box containing both polygons, as illustrated in
Figure 6b.
Despite the improvements introduced by the distance term, one limitation remains: When one polygon fully encloses the other and their centroids coincide, the centroid distance is near zero, and the IoU only suggests increasing the area of the enclosing polygon without indicating how to adjust its geometry, such as making it wider or taller. As a result, the optimization process lacks a clear direction for modifying the parameters to improve geometric similarity. To address this, Zheng et al. [
93] introduced an additional penalty term based on the aspect ratio of the bounding boxes (cf.
Figure 6c). This refinement is included in Equation (
3) through the parameters
and
V, defined in Equations (
5) and (
6), respectively:
Here,
w and
h refer to the width and height of the minimum enclosing bounding boxes of the polygon derived from the segmentation and the template polygon.
The parameter
controls the influence of the aspect ratio penalty based on the current IoU score. However, as proposed by Zheng et al. [
94], this term is applied only when the IoU exceeds a certain threshold (i.e., 0.5). This reflects the rationale that proportion differences become meaningful only when the polygons already share sufficient overlap. When they are far apart, spatial proximity becomes a more relevant factor. Therefore, in low-overlap cases, the CIoU loss effectively reduces to the DIoU loss.
To reduce the parameter search space during optimization, providing an initial guess within a plausible range is beneficial. This improves computational efficiency by reducing the effort needed to locate the loss minimum. In this study, the axis-aligned bounding box of
is used to initialize the parameters. A more advanced method for estimating the initial parameters was also evaluated; however, it did not improve convergence speed or result quality. A comparison of both approaches is presented in
Section 4.5.2.
The bounding box-based approach was therefore adopted, with its coordinates used to estimate the template parameters as follows. The top-left corner of the bounding box provides the initial estimate for the offset parameters P1 and P2. For the slab girder, the height parameters P3 and P4 are assumed equal, each set to half the bounding box height. The flange width parameter P6 is initialized as one-eighth of the bounding box width, assuming that the total flange width accounts for one-quarter and is symmetrically distributed. The web width P7 is assumed to occupy the remaining three-quarters of the width.
For the T-girder, the web height P5 is set to half the bounding box height, with the total flange height accounting for the other half. Therefore, the flange height parameters, P3 and P4, are each set to one-quarter of the height. The web width P7 is estimated as half the bounding box width, while the remaining half is allocated to the flanges, making each flange width parameter P6 one-quarter of the total width.
The tapered T-girder builds on the assumptions used for the T-girder, maintaining the same height proportions for
P3,
P4, and
P5. However, the width is divided differently: the web width
P7 and flange width
P6 each occupy one-quarter of the bounding box width, while the tapered web parameter
P8 is assigned the remaining one-quarter, resulting in a value of one-eighth of the total width. An overview of the initial parameter estimates is presented in
Table 2.
For each initial parameter guess, a search range of ±500 pixels is defined, assuming that the optimal configuration lies within these bounds. These ranges are passed to the optimization algorithm, which searches for the parameter values that minimize the loss function defined in Equation (
3). The optimization is performed using the Dual Annealing optimization algorithm [
95], as implemented in the SciPy package (
https://scipy.org/ (accessed on 15 December 2025)). The template used for optimization is selected based on the cross-section classification produced by the YOLOv8 model. Consequently, introducing new cross-section template types requires retraining the YOLOv8 classifier. An alternative approach would fit all available template types to the segmentation mask and select the configuration that yields the best fit. In this approach, the YOLOv8 model would only be used to localize cross-sections, making the method independent of the number of supported template types. This design may simplify training and improve robustness, particularly when the set of templates is extended.