Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition

Lei, Lei; Ji, Yu; Zhou, Yihong; Zhao, Chunju; Wang, Fang; Zhou, Huawei; Liang, Zhipeng

doi:10.3390/app16115479

Open AccessArticle

Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition

by

Lei Lei

^1,2,

Yu Ji

^1,2,

Yihong Zhou

^1,2,*,

Chunju Zhao

^1,2

,

Fang Wang

^1,2,

Huawei Zhou

^1,2 and

Zhipeng Liang

^1,2

¹

Key Laboratory of Intelligent Health Perception and Ecological Restoration of Rivers and Lakes, Ministry of Education, Hubei University of Technology, Wuhan 430068, China

²

School of Civil Engineering, Architecture and Environment, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5479; https://doi.org/10.3390/app16115479

Submission received: 30 April 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 1 June 2026

(This article belongs to the Topic Advancing Construction Safety and Health: Innovations and Strategies)

Download

Browse Figures

Versions Notes

Abstract

The quality of concrete consolidation is critical to the structural safety of dams. However, current vibration machines lack reliable information regarding the distribution map of concrete vibration states in a pouring block. This informational deficiency often causes machinery to repeatedly traverse previously vibrated regions, which compromises the internal structural integrity of the concrete. Accurately recognizing these vibrated regions is challenging because vibration-induced particle rearrangement and concrete flow-ability lead to high inter-class similarity, minimal color variance, and blurred boundaries in practical construction environments. To address this issue, this study proposes a recognition method utilizing a Transformer architecture with masked attention. This design enhances feature representation, effectively focusing on target regions and ambiguous boundaries while suppressing background noise. Experimental results demonstrate the robustness of the proposed method, achieving a mean Average Precision of approximately 0.9. This performance outperforms existing benchmark models, exceeding YOLACT, YOLO11, CondInst, and SOLOv2 by roughly 97%, 53%, 17%, and 12%, respectively. Practically, embedding concrete state distribution information into machinery provides visual guidance for path planning, enabling a lean control from data acquisition to physical execution. Future research will integrate these findings with construction equipment to enable autonomous operation and decision-making for intelligent quality control.

Keywords:

quality monitoring; vibration region recognition; image segmentation; masked attention; transformer architecture

1. Introduction

As primary components of hydraulic engineering projects, dams play a critical role in flood control, power generation, and irrigation [1]. Consequently, ensuring dam quality is essential for both public safety and economic development [2]. Specifically, the quality control of concrete pouring process directly determines the ultimate strength, durability, and structural stability of the dam [3]. Currently, research focusing on concrete vibration quality control primarily evaluates the immediate efficacy of the operational process. Existing studies extensively utilize sensors or cameras installed directly on the vibration frames to acquire and analyze key performance indicators, such as surface conditions [4], vibration duration [5], and spatial coverage [6]. Despite these advancements, the continuous monitoring of vibrated concrete during its sensitive, early phase of consolidation and curing is profoundly overlooked. In practical dam construction, the randomness of vibrator trajectories, compounded by the visual blind spots of human operators, frequently causes machines to repeatedly traverse already-vibrated regions. These redundant passes impose severe external dynamic loads that induce the unintended rearrangement of aggregate particles. During the critical setting stage of concrete, such disturbances severely disrupt the continuity and adhesion within the cement micro-structure and accelerate the initiation and propagation of internal defects and micro-cracks [7]. In addition, intermittent dynamic disturbances from construction equipment reduce the hydration rate of cementitious minerals, leading to a decreased formation of critical hydration products, such as calcium silicate hydrate gel, hydrated calcium aluminate crystals, and calcium hydroxide crystals [8]. These components are essential for maintaining the structural integrity of concrete. Fundamentally, preventing redundant traversal requires the precise recognition and distribution feedback of vibrated regions across the pouring block. Furthermore, with the rapid advancement of intelligent construction equipment, concrete vibration machines are evolving from traditional automated devices into intelligent agents equipped with advanced perception and autonomous decision-making capabilities [9], bridging this informational gap is becoming increasingly urgent. Therefore, accurately mapping the distribution of vibrated regions is no longer merely a monitoring requirement, but a fundamental prerequisite for enabling higher-level autonomy in operation planning and trajectory optimization for intelligent construction.

To facilitate effective on-site information feedback, deep learning-based image recognition methods have emerged as a promising tools for construction quality control [10]. These algorithms extract critical construction parameters directly from site videos and image sequences, enabling objective quality monitoring [11]. Specifically, object detection models, such as YOLOv4, have been utilized to track hammer trajectories, thereby automating the derivation of tamping points, duration and counts [12]. To enhance the environmental robustness, Yang et al. [13] combined YOLO-Fastestv2 with a time series model, effectively mitigating visual disturbances from fog and rain to ensure continuous and reliable feedback. Furthermore, integrating deep learning algorithms with positioning techniques enables the synchronous extraction of multiple key indicators for dynamic compaction monitoring [14]. In the context concrete vibration, deep learning architectures have been adapted to evaluate physical states and temporal parameters. For instance, a hybrid framework employing YOLOv8 and SENet-50 has been introduced to assess the completion degree of concrete vibration, facilitating highly precise temporal control over the vibration process [15]. To comprehensively assess concrete compaction quality, deep residual networks (e.g., ResNet-50) have been integrated with Internet of Things (IoT) technologies to capture physical metrics such as insertion depth and vibration duration [16]. Additionally, a multi-model framework based on convolutional neural networks (CNNs) is employed for on-site worker identification and location, these results can be integrated with pose estimation and action recognition to generate operational instructions, thereby standardizing on-site practices [17]. Despite this advancements, significant challenges remain in applying CNN architecture to real-world raw images collected from dam construction pouring surfaces. The inherent flow-ability of concrete [18] and the rearrangement of particles during vibration [19] inevitably lead to ambiguous edge features between different concrete states. Compounding this physical challenge are the algorithmic limitations of CNNs. Their restricted receptive fields hinder the effective fusion of local and global features. Furthermore, convolution operations inherently cause a loss of edge pixel information. This drastically reduces the model’s accuracy in recognizing vibrated regions, thereby exacerbating the difficulty of accurately acquiring their distribution.

To overcome the inherent limitations of CNNs in capturing global contextual information, recent state-of-the-art studies have explored Transformer-based end-to-end architectures for instance segmentation tasks. Unlike CNN-based models constrained by local receptive fields, Transformer-based models such as ChangeFormer [20] employ an architecture with non-local self-attention. Thus, transformer architectures excel at modeling long-range dependencies across different spatial locations within an image, thereby facilitating the extraction of global contextual information [21]. In scenarios characterized by complex backgrounds and large-scale variations, they effectively mitigate the over-segmentation and under-segmentation of indistinct boundaries, significantly enhancing edge detection accuracy and robustness [22]. However, Transformer-based architectures exhibit inherent limitations in recognizing fine-grained detailed texture features [23]. In concrete pouring scenarios, this deficiency makes it exceedingly difficult to distinguish the subtle textural differences between vibrated and non-vibrated regions, ultimately inducing frequent false and missed detection.

This study is driven by the need for a practical solution to accurately map the distribution of vibrated concrete regions on the pouring block surface, thereby preventing structural damage caused by repeated traversal. We recognize that a hybrid architecture combining CNN and Transformer is essential for processing the complex raw images acquired from construction sites [24]: (1) CNNs leverage their inherent translation invariance to effectively extract local features of vibrated regions, enhancing the model’s ability to capture fine-grained textural details within shallow network layers [25]; (2) Transformers utilize their self-attention mechanisms to strengthen the extraction of ambiguous edge features, exploiting long-range dependencies to ensure the complete recognition of discrete distributed vibrated regions [26]. Notably, dynamic on-site interference exist in construction environments, such as moving equipment, visual occlusions, and illumination variations caused by uneven lighting. These factors further exacerbate the difficulty of distinguishing between vibrated and non-vibrated regions, which inherently share high textural similarity. thereby degrading the model’s predictive performance and exacerbating the challenge of accurately mapping the distribution. Consequently, such interference degrade the predictive performance of existing segmentation models and exacerbate the challenge of accurately mapping the distribution of vibrated regions.

To address these challenges, this paper proposes a novel end-to-end segmentation model featuring a masked attention-driven Transformer architecture [27]. Specifically, building upon the CNN-Transformer architecture, the masked attention mechanism leverages the generated masks to allocate differential attention thresholds across distinct regions. This strategy strictly constrains the attention scope to target-specific regions, thereby effectively filtering and mitigating interference from complex background noise. Such noise includes non-target construction factors (e.g., machinery, tools, and personnel) as well as the high inter-class similarity between vibrated and non-vibrated regions. Simultaneously, a Multi-Scale Deformable Attention (MSDeformAttn) module is employed to fuse feature maps across different scales. Hence, the proposed model achieves a substantially enhanced capability to capture the blurred edge and textural features of vibrated concrete regions. In summary, extensive experiments confidently demonstrate that the proposed end-to-end framework achieves superior segmentation accuracy in concrete pouring process. By successfully overcoming texture similarities and boundary ambiguities, this approach provides highly reliable automated feedback on the distribution of vibrated concrete from raw site images. Ultimately, this work establishes a robust visual data foundation for trajectory optimization, successfully preventing structural damage caused by repeated traversal in intelligent dam construction projects.

The remainder of this paper is organized as follows. Section 2 provides a review of image segmentation methods based on CNN and Transformer frameworks. Section 3 details the datasets construction for the vibrated concrete region recognition task and presents the three-part core architecture of the proposed model. Section 4 validates the performance of the proposed algorithm through comparative experiments with various mainstream segmentation models. Finally, Section 5 concludes the paper by summarizing the key contributions and outlining directions for future research.

2. Related Works

Image segmentation techniques have undergone a paradigm shift since the advent of the deep learning era, continuously evolving to address increasingly complex computer vision tasks. This section provides a brief review of deep learning-based image segmentation methodologies, with a particular emphasis on the evolutionary trajectory from CNNs to Vision Transformer (ViT) architectures.

2.1. CNN-Based Segmentation Models

Deep learning-based image segmentation methods have rapidly advanced with the development of computer technology. CNN-based architectures have become the mainstream paradigm for image segmentation. Representative models, such as U-Net [28], Faster R-CNN [29], YOLO [30], and their variants, have been widely adopted and demonstrated remarkable performance. These methods have been successfully applied to tasks such as crack detection [31] and particle gradation [32]. Single-stage detectors offer high inference efficiency and end-to-end detection capability, enabling researchers to develop numerous YOLO-based variants to optimize feature extraction for fine-grained target detection. For instance, Wang et al. [33] proposed an instance segmentation model based on YOLOv8-seg, which improves feature representation and the detection of fine cracks in complex backgrounds. Similarly, Yang et al. [34] introduced the YOLOv8-GSD model for metro tunnel crack segmentation, significantly enhancing the preservation of crack texture features. YOLO-based variants generally achieve satisfactory detection performance, as cracks typically exhibit thin and elongated structures that align well with the local spatial neighborhoods captured by convolutional receptive fields. In addition to single-stage models, two-stage detectors and encoder–decoder architectures have been widely adopted. Li et al. [29] developed a multi-layer feature fusion network based on Faster R-CNN for automatic tunnel surface defect detection under insufficient illumination. In parallel, architectures such as U-Net and YOLACT have also been adopted for image segmentation tasks. For example, Li et al. [35] embedded a residual network structure into the U-Net framework to improve the segmentation accuracy of tunnel lining cracks. Liu et al. [36] proposed an enhanced YOLACT-based method for simultaneously identifying the shapes, categories, and depths of tunnel defects. Overall, CNN-based models exhibit profound advantages in extracting high-precision local information and identifying targets with regular geometric shapes or distinct textural gradients. However, owing to the inherent inductive biases of convolution operations, which include limited receptive fields and translation invariance, CNNs encounter substantial bottlenecks in extracting global contextual features. This is especially evident when dealing with irregularly shaped object regions (e.g., the vibrated concrete regions) or targets whose textures heavily resemble the background. This deficiency restricts the acquisition of high-level semantic information, frequently resulting in incomplete structural representations and a consequent decline in detection precision.

To tackle the limitation of global feature modeling in CNNs, numerous studies have explored various strategies such as attention mechanisms. For instance, Al-Huda et al. [37] proposed an ADDU-Net model based on U-Net, which introduced a dual attention module to improve the recognition of both thick and fine cracks under diverse environmental conditions. Qu et al. [38] developed a concrete pavement crack detection algorithm that utilizes attention mechanisms with multi-feature fusion to enhance the model’s ability to capture global contextual information. Attention mechanisms partially compensate for the limited global modeling capability of CNNs by focusing on specific channel and spatial features [39]. However, in large-area domains with highly heterogeneous surface texture, conventional attention mechanisms often suffer from attention collapse or fail to maintain a balanced, comprehensive feature distribution. Therefore, Transformer has been introduced into CV, enabling models to capture long-range dependencies and generate more globally consistent feature representations.

2.2. Transformer-Based Segmentation Models

In image segmentation tasks, such as crack recognition [40], Transformer effectively extracts global features through the self-attention mechanism while preserving more spatial information than CNN. For instance, Wang et al. [41] proposed the Transformer-based SegCrack model that enhances contextual information capture in concrete crack segmentation by modeling relationships between all pixel pairs. Cao et al. [42] introduced the Swin-Unet model, a purely Transformer-based U-shaped network for image segmentation. In this model, the input image is divided into non-overlapping patches and fed into a Transformer-based encoder, enabling progressive restoration of spatial resolution. In extreme underwater environments, SegFormer relies solely on linear MLP layers in its decoder, making it difficult to suppress background noise and leading to missed detection of micro-cracks. To address this, Li et al. [43] adopted SegFormer as the architectural backbone and introduced a hybrid attention mechanism, thereby improving crack continuity and segmentation accuracy. These purely Transformer-based paradigms can effectively mitigate missed detection in environment with complex backgrounds by establishing a global receptive field from the initial layers. Nevertheless, the lack of spatial inductive biases in pure Transformers limits their ability to model fine-grained local features, frequently leading to degraded segmentation performance at intricate edge regions or boundaries with subtle morphological variations.

To address the limitations between global contextual modeling and local feature preservation, recent research has gravitated towards hybridizing CNNs with Transformer-based architectures [24]. This architectural hybridization leverages the complementary strengths of both paradigms. Specifically, CNN architecture possesses spatial inductive biases, enabling efficient extraction of low-level, high-frequency local details, such as edge gradients and fine-grained textures. In contrast, Transformer architecture captures long-range spatial dependencies and global semantic structures through dynamic self-attention mechanisms. For example, Zhang et al. [44] proposed the UTCD-Net dam crack segmentation network, integrating Transformer and CNN in parallel. In this design, the Transformer branch extracts global context features for crack detection, while the CNN branch captures local feature information. Gan et al. [45] combined convolutional operations with Transformer to improve crack detection accuracy by mitigating the loss of long-range information, enabling the model to learn both detailed features and global context. Similarly, Shamsabadi et al. [46] proposed TransUNet, which employs a hybrid architecture of CNN and ViT for crack detection on asphalt and concrete surfaces. Experimental results indicate that this hybrid approach improves generalization while retaining the capability to extract both global and local features. These models based on the combination of Transformer and CNN enhance image segmentation performance by learning complex features such as shape, size, and spatial relationships [25]. Despite these architectural advancements, deploying CNN-Transformer hybrid models in real-world, complex environments remains challenging. These environments typically encompass both the region of interest and extensive, heterogeneous background noise. Hybrid models remain susceptible to background interference because standard self-attention mechanisms compute correlations uniformly across all tokens. This interference dilutes the attention scores of critical foreground features and subsequently degrades segmentation fidelity.

Masked attention enhances model discrimination and robustness in complex environments by suppressing non-essential background noise [47]. Zuo et al. [48] proposed a masked attention method that focuses on defective regions to enhance the model’s discriminative capacity for the local spatial features of pipe cracks. Fundamentally, unlike standard cross-attention that uniformly attends to all spatial locations across the flattened feature map, masked attention restricts the attention computation strictly within the predicted foreground regions. This mechanism utilizes learnable semantic queries combined with dynamically generated binary masks. This design forces the network to allocate its representational capacity exclusively to the target object, thereby decoupling target features from overwhelming background noise. The MP-Former [49] and the Mask DINO [50] models both introduce mask branches to enhance attention to object regions. Masked attention methods address the issue of insufficient object attention in Transformer models by enhancing focus on target regions, particularly under complex recognition conditions [49]. Therefore, this paper introduces an architecture comprising a CNN backbone integrated with a masked attention-based Transformer decoder. The proposed methodology leverages the local detail-preserving strengths of convolutions while employing masked attention to suppress complex background interference, thereby improving focus on target regions and edge feature extraction.

3. Methodology

While CNN-based models excel at extracting local textural features, their restricted receptive fields hinder the effective fusion of local and global information. In contrast, Transformer-based models leverage long-range dependencies to capture interactions across distant spatial locations, thereby enhancing the accurate recognition of distributed regions. However, the inherent flow-ability of concrete and the rearrangement of aggregate particles during vibration often result in ambiguous edge features between vibrated and non-vibrated regions. Furthermore, on the dam pouring surface, dynamic interference such as equipment movement and illumination variations exacerbate the difficulty of differentiation, thereby degrading the model’s predictive performance. To accurately map the distribution of vibrated concrete, this paper proposes a novel end-to-end segmentation model that features a masked attention-driven CNN-Transformer architecture. In this framework, the convolution operations of the CNN effectively capture fine-grained texture details, particularly within vibrated regions. To address boundary ambiguity, the MSDeformAttn is employed to fuse feature maps across different scales, improving edge feature extraction in vibrated regions. Additionally, masked attention suppresses interference from complex background noise, significantly enhancing the model’s focus on target regions.

3.1. The Proposed Framework

This research proposes a method for segmenting the vibrated concrete regions in digital images of dam pouring surfaces. The proposed framework comprises three main steps, as illustrated in Figure 1. In the first step, on-site video data are converted into image frames, followed by the pixel-level annotation of the vibrated concrete regions. In the second step, the framework extracts and enhances texture features from images via a backbone network and MSDeformAttn module, and then models the texture features of the target regions using a Transformer architecture with masked attention. Specifically, the input images are processed by the backbone network for multi-scale feature extraction and encoding, which allows the model to capture hierarchical features from local textures to global morphological characteristics. Subsequently, MSDeformAttn enhances and fuses the multi-scale feature maps, gradually restoring high-resolution feature representations and generating mask features for subsequent predictions. Next, a fixed number of learnable query vectors are introduced into the Transformer decoder to interact with the enhanced multi-scale feature maps through masked attention. This interaction allows the model to focus on target regions by filtering out irrelevant backgrounds noise. In the final step, two parallel prediction heads utilize the refined target representations from each query to simultaneously output a class label and a corresponding binary mask. End-to-end segmentation of all target regions is achieved by integrating mask and class predictions.

3.2. Multi-Scale Feature Extraction for Input Images

Vibrated concrete regions on dam pouring block exhibit multi-scale visual characteristics, manifested as surface texture variations and specular reflections induced by cement slurry exudation and enhanced compactness. To effectively extract these multi-scale features, ResNet-50 is adopted, where residual connections mitigate gradient degradation, ensuring stable training and robust feature extraction. Furthermore, ResNet-50 affords an optimal balance between representational capacity and computational efficiency. The architecture comprises five hierarchical stages. As the network deepens, each stage extracts increasingly abstract features with richer semantic information, accompanied by a progressive reduction in spatial resolution. In the first stage, input images are downsampled to generate low-level feature representations. From the second to the fifth stages, each stage utilizes identity and projection shortcuts, formulated in Equations (1) and (2), respectively.

y = F (x, W_{i}) + x

(1)

where

x

denotes the input vector to the residual block and also serves as the identity mapping in the shortcut connection;

y

is the output vector of the residual block;

F (x, W_{i})

represents the residual mapping function to be learned, which is typically composed of a series of operations such as convolutional layers, batch normalization, and nonlinear activation functions; and

W_{i}

denotes the set of learnable parameters associated with the

i

-th layer within the residual function

F

.

y = F (x, W_{i}) + W_{s} (x)

(2)

where

W_{s} (x)

denotes the linear projection on the shortcut connection; and

W_{s}

is a learnable projection matrix used to transform the input

x

so that its dimensionality matches that of

F (x, W_{i})

.

At each stage, feature maps of different scales are progressively generated. Through successive convolution, downsampling, and nonlinear transformations, the input image undergoes hierarchical abstraction and semantic reconstruction, enabling an effective transition from low-level fine-grained texture information to high-level semantic representations.

3.3. Edge Feature Enhancement for Vibrated and Non-Vibrated Regions

To mitigate the issue of indistinct edge features, an MSDeformAttn model is introduced. This architectural design improves the boundary perception capability of the network and constructs a more robust multi-scale feature representation framework. Initially, multi-scale feature maps are projected to a unified channel dimension. Subsequently, they are processed by MSDeformAttn, where each reference point adaptively learns sampling points across multiple scales. The structural implementation of the deformable attention (DeformAttn) model is depicted in Figure 2, with its mathematical formulation detailed in Equation (3). Building upon this foundation, the formulation for MSDeformAttn is expressed in Equation (4). Specifically, the dimensionally aligned feature maps are individually flattened into token sequences and concatenated into a single sequence. Concurrently, scale-level and spatial positional embedding are incorporated into each feature map to preserve spatial and hierarchical positional contexts. Each token is assigned a unique reference point, which directs the generation of sampling points across multiple scales within each attention head, thereby facilitating cross-scale feature aggregation and enhancing the representation of target regions.

D e f o r m A t t n (z_{q}, p_{q}, x) = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m q k} \cdot W_{m}^{'} x (p_{q} + {∆ p}_{m q k})]

(3)

where

z_{q}

is the feature vector of the

q

-th query;

p_{q}

represents the normalized two-dimensional reference point indicating the spatial location of the query on the feature map;

x

denotes the input multi-scale feature maps;

M

is the number of attention heads;

W_{m}

is the output projection matrix for the

m

-th attention head;

K

denotes the number of sampled key points per query in each attention head;

A_{m q k}

represents the attention weight assigned to the

k

-th sampled point for the

q

-th query in the

m

-th head;

W_{m}^{'}

denotes the input projection matrix for the

m

-th head;

x (p_{q} + {∆ p}_{m q k})

represents the feature value obtained via bilinear interpolation at the spatial location

p_{q} + {∆ p}_{m q k}

on the feature map;

{∆ p}_{m q k}

is a learnable two-dimensional offset for the

k

-th sampling point of the

q

-th query in the

m

-th head; and

D e f o r m A t t n (z_{q}, p_{q}, x)

denotes the output of DeformAttn at the query position

p_{q}

.

\begin{array}{l} M S D e f o r m A t t n & (z_{q}, {\hat{p}}_{q}, {\{x^{l}\}}_{l = 1}^{L}) \\ = \sum_{m = 1}^{M} W_{m} [\sum_{l = 1}^{L} \sum_{k = 1}^{K} A_{m q l k} \cdot W_{m}^{'} x^{l} (\emptyset ({\hat{p}}_{q}) + {∆ p}_{m l q k})] \end{array}

(4)

where

{\{x^{l}\}}_{l = 1}^{L}

represents the set of feature maps from

L

different scales; the reference point

{\hat{p}}_{q}

∈

{[0, 1]}^{2}

represents the normalized spatial location of the

q

-th query;

\emptyset (\cdot)

represents a scale mapping function that transforms the normalized reference point

{\hat{p}}_{q}

to the actual coordinate location on the

l

-th scale feature map;

A_{m q l k}

represents the attention weight for the

q

-th query in the

m

-th head; at the

l

-th scale and the

k

-th sampled point; and

{∆ p}_{m l q k}

represents the corresponding spatial offset.

Finally, the original 1/4-scale feature map and the enhanced 1/8-scale feature map are first aligned in dimension via convolutional layers and then fused through element-wise addition. To mitigate the aliasing effect induced by the upsampling process and further refine the fused representations, an additional convolutional layer is applied for smoothing. The resulting high-resolution feature map serves directly as the per-pixel embedding for subsequent mask generation.

3.4. Mask Modeling and Prediction for Vibrated Regions

To suppress interference from complex background noise, a Transformer decoder driven by a masked attention mechanism is introduced, as illustrated in Figure 3. Masked attention imposes spatial constraints on the attention range of each query based on its preceding mask predictions and is a variant of the cross-attention mechanism. Specifically, the mask predictions generated by the previous decoder layer are first transformed into probability maps through Sigmoid function. Subsequently, binary thresholding is performed using a threshold of 0.5 to generate the attention mask. Positions with probabilities greater than or equal to 0.5 are regarded as foreground regions and assigned a value of 1. Conversely, positions with probabilities lower than 0.5 are treated as background regions and assigned a value of 0. The binary mask generation rule is defined in Equation (5).

M (x, y) = \{\begin{matrix} 1, p (x, y) \geq 0.5 \\ 0, p (x, y) < 0.5 \end{matrix}

(5)

where

M (x, y)

denotes the generated binary attention mask at position

(x, y)

, and

p (x, y)

represents the foreground prediction probability at the corresponding position after the Sigmoid transformation.

When the mask value at a certain position is 0, the corresponding attention score is set to

- \infty

. After Softmax normalization, the resulting attention weight approaches zero, thereby effectively suppressing irrelevant background interference. In contrast, when the mask value is 1, no spatial constraint is imposed on the attention computation, allowing the query to focus on informative foreground regions. Through this mechanism, the decoder progressively refines the segmentation region in a coarse-to-fine manner while enhancing region-aware feature aggregation. The formulation of masked attention is presented in Equation (6).

X_{l} = s o f t m a x (M_{l - 1} + Q_{l} K_{l}^{T}) V_{l} + X_{l - 1}

(6)

where

X_{l}

represents the updated feature output by the

l

-th layer of the Transformer decoder;

X_{l - 1}

represents the output from the previous layer, which serves as both the identity mapping in the residual connection and the initial query features for the current layer;

Q_{l}

represents the query matrix obtained by applying a linear transformation to the output feature

X_{l - 1}

of the previous layer;

K_{l}

and

V_{l}

represents the key matrix and value matrix generated by linear transformations of enhanced multi-scale feature maps, respectively; and

M_{l - 1}

represents the mask attention map obtained from the previous layer.

Transformer decoder is implemented as follows. First, the enhanced multi-scale feature maps serve as the key and value matrices. Masked cross-attention is then executed between these matrices and the query vectors, operating under the spatial constraint of the prediction mask from the prior layer. This mechanism allows the network to preliminarily localize image regions corresponding to potential targets. Subsequently, the query vectors interact through the self-attention mechanism to model contextual relationships among targets and mitigate duplicate predictions. Each query vector is then processed by a feed-forward network (FFN), further enriching its representational capacity. Through a cascade of multiple decoder layers, the initial queries are iteratively refined into a set of output query vectors imbued with rich, high-level semantic information. Finally, the mask prediction head performs dot product between a set of output query vectors and the mask features to generate mask predictions for each target. To address the presence of multiple disconnected vibrated regions, our method defines each spatially isolated vibration patch as an independent instance. With the bipartite matching mechanism, the model assigns distinct object queries to individual disjoint vibration regions in a one-to-one manner. This assignment enables effective instance-level separation. In parallel, the classification head applies a linear transformation followed by a Softmax function to produce class probabilities. Through this query-based parallel processing, the network can effectively identify and segment multiple disconnected vibration regions simultaneously, thereby completing the segmentation of the input image.

3.5. Image Augmentation, Loss Function, Optimizer, and Learning Rate Scheduling Strategy

Considering the random sizes and irregular shapes of vibrated regions, this study adopts global resize with aspect ratio preservation and random horizontal flipping as the primary image augmentation strategies. In conventional Large Scale Jittering (LSJ), random cropping may disrupt the natural boundaries of targets and introduce artificially truncated edges, thereby weakening the global contextual relationship between the target and the background. In contrast, the introduced global resizing strategy preserves the authentic morphology of the target while maintaining complete background context information during image transformation. This approach prevents irregular structural features from being distorted or damaged. As a result, the network is encouraged to focus on learning the texture differences and natural boundary characteristics between vibrated and non-vibrated regions.

The loss function provides both optimization guidance and constraint mechanisms for the model. In this study, three loss functions are jointly employed, including Cross Entropy Loss for classification, Binary Cross Entropy (BCE) Loss for pixel-level mask supervision, and Dice Loss for global mask optimization. For the single-category detection task considered in this work, the construction-site environment contains numerous non-target regions, such as construction machinery and concrete formwork, resulting in highly complex backgrounds. Without appropriate constraints on the 100 queries generated by the decoder, most queries tend to be assigned to the background class. Consequently, background gradients dominate the optimization process, making the model prone to converging to suboptimal local minima. To alleviate this issue, an imbalanced class weighting strategy (class_weight = [1.0, 0.1]) is introduced into the classification loss, where the background penalty is reduced to one-tenth of the foreground target weight. This strategy effectively suppresses excessive interference from complex negative samples and significantly improves the recall capability for vibrated regions. In addition, BCE Loss provides stable local pixel-wise gradients, enabling the model to rapidly learn the initial contours of target regions. However, relying solely on pixel-wise BCE supervision makes the optimization process highly sensitive to the imbalance between foreground and background regions. To address this issue, Dice Loss is further introduced to constrain the prediction results from a global spatial perspective. Through the complementary interaction between local pixel-level supervision and global shape-aware optimization, the network is able to effectively learn the segmentation boundary characteristics of targets with highly variable sizes and irregular shapes.

A well-designed optimization and learning rate scheduling strategy is essential for balancing convergence efficiency and model generalization. The iteration-based Poly decay strategy often suffers from excessively long convergence periods due to its slow decay behavior and is also sensitive to fluctuations in batch size. Therefore, this study adopts the SGD optimizer together with an epoch-based scheduling scheme that combines linear warmup and MultiStepLR decay. Compared with other optimization methods, SGD exhibits a lower risk of overfitting on datasets of this scale. When integrated with staged learning rate decay, it provides a more stable convergence trajectory and better exploits the feature extraction capability of the proposed framework. The initial linear warmup stage provides essential gradient stabilization for the decoder module, effectively preventing severe gradient oscillation and weight collapse caused by random initialization at the early stage of optimization. In the middle and later stages, the stepwise learning rate decay enables the model to transition smoothly from coarse target localization to fine-grained boundary refinement. Considering the irregular and highly variable shapes of the targets, this strategy substantially improves the boundary prediction accuracy of the final segmentation masks.

4. Experiment

In this section, the overall architecture of the proposed workflow is presented, as illustrated in Figure 4. First, the process involves converting on-site videos into sequential image frames, followed by the precise annotation of the vibrated concrete regions. The proposed method formulates the segmentation of vibrated concrete as the joint prediction of a fixed set of masks and their corresponding category labels. Specifically, the backbone network extracts multi-scale features from the input images, while MSDeformAttn enhances the edge texture features of vibrated concrete. Then, under the guidance of an attention mask, the model progressively learns discriminative representations of the target regions. Finally, the Transformer decoder outputs class predictions and mask predictions in parallel, achieving end-to-end segmentation of the vibrated concrete regions.

The model code is developed using Python (version 3.8.0), and the network architecture is implemented via the open-source deep learning framework PyTorch (version 2.0.0). All training and evaluation procedures are conducted on a machine equipped with an NVIDIA RTX 4090D GPU (NVIDIA, Santa Clara, CA, USA), utilizing CUDA 11.8 for hardware acceleration.

4.1. Dataset

To evaluate the proposed method, a dedicated dataset was first constructed, as no publicly available dataset exists for dam pouring surfaces. Thus, images of construction activities were collected from different pouring surfaces to establish the database. First, to extract representative static images from video streams, we employed a fixed temporal sampling strategy. Specifically, we extracted frames from the construction videos at 10 s intervals to minimize redundancy between adjacent frames. Subsequently, blurred and low-quality images were discarded. We then utilized the Structural Similarity Index to evaluate the similarity between frames. This assessment took into account key visual features such as the positioning of vibration equipment, concrete surface morphology, lighting conditions, and the distribution of vibration regions. Consequently, this evaluation allowed us to filter out highly redundant images. Following this screening process, a total of 436 images were obtained. Based on this, the dataset was randomly split into training and validation sets at an 8:2 ratio. Considering the strong temporal correlation between consecutive video frames, random frame-level splitting may lead to data leakage by placing similar neighboring frames into different subsets. To address this, we adopted a temporally ordered dataset partitioning strategy, where data from earlier periods were used for training and later data for validation. Moreover, the concrete dam construction process involves dynamic changes in workspace geometry, surface morphology, equipment positions, vibration distribution, and lighting conditions over time. As a result, data from different temporal stages exhibit significant variations in scene structure, texture, and target distribution, which enhances the independence between training and validation sets and improves the reliability and generalization of model evaluation. Due to variations in the duration of effective vibration activities across different videos, the number of effective samples derived from each video varies. Furthermore, to ensure the robustness of the model evaluation, the datasets encompasses three distinct concrete casting sections, as shown in Table 1. Specifically, images were captured from three dam pouring blocks with varying construction conditions, under diverse time periods and weather conditions. While dam blocks contain common components such as leveling machines, vibration machines, and concrete formwork, they possess distinct visual characteristics. Block 1 features a group of massive, cylindrical rebar cages placed at an angle. Block 2 contains a towering red steel scaffolding structure. Meanwhile, Block 3 is characterized by a prominent, vertical cylindrical survey stake standing in the middle of the wet concrete. These clear visual differences facilitate the effective differentiation of images across various scenes. The next step was to annotate the images for model training. To ensure high-quality ground truth labels, this study used LabelMe (version 5.3.1), a Python-based annotation tool, to manually annotate vibrated concrete regions. To guarantee complete consistency in annotation criteria, all images were annotated by a single professional annotator. Annotation was performed following a standardized protocol based on the morphological characteristics of concrete surfaces. Specifically, the annotator delineated vibrated regions by relying on key visual features, including cement paste diffusion patterns, mortar liquefaction, texture uniformity, and local gloss variation. In addition, strict rules were established for boundary delineation and the handling of discontinuous vibrated regions to ensure annotation consistency. To further ensure annotation quality, a second-round review of the entire datasets was conducted. This review corrected potential deviations in annotation standards that may have developed during the annotation process. In addition, an independent reviewer further examined a subset of samples to ensure the accuracy and reliability of the mask annotations. Figure 5 presents three annotation examples, all labeled as ‘vibrated region’ and highlighted in green.

4.2. Implementation Details

We trained and validated our models on the dataset described in Section 4.1. All input images were resized to a resolution of 640 × 640 pixels regardless of their original dimensions. To ensure sufficient model convergence, we systematically configured the training strategies and optimization parameters. Model parameters were initialized using weights pre-trained on the ImageNet dataset to accelerate the convergence process. Based on preliminary experiments, the model was trained for 50 epochs. Due to GPU memory constraints, the batch size was set to 4. Furthermore, parallel data loading with two worker threads was enabled to improve the efficiency of data preprocessing and transfer. Given the limited data scale and the single-category nature of this industrial scenario, we adopted the Stochastic Gradient Descent (SGD) optimizer. Compared to adaptive optimization algorithms, although SGD exhibits a slower initial convergence rate, it typically converges to flatter minima, which contributes to superior generalization capability. The initial learning rate was set to 0.001 to balance convergence efficiency while preventing training instability caused by excessive parameter updates. During preliminary testing, higher rates (e.g., 0.01) induced oscillations or divergence, whereas lower rates (e.g., 0.0001) significantly hindered the convergence speed. The momentum parameter was fixed at 0.9. This introduced a momentum term that incorporates historical gradients to accelerate optimization, suppress gradient fluctuations, and facilitate escaping from shallow local optima. Furthermore, a weight decay coefficient of 0.0001 was applied for L2 regularization. This regularization effectively limits network parameter complexity and mitigates the risk of overfitting inherent in this single-category task.

The initial stage of the proposed architecture entails multi-scale feature extraction from the input images. As depicted in Figure 6, ResNet-50 processes an input RGB image of dimensions 640 × 640 × 3 through a hierarchical sequence of convolutional layers. The ResNet-50 architecture consists of five stages. As the spatial resolution progressively decreases across these stages, the extracted features become increasingly abstract and semantically enriched. Specifically, the first stage employs a 7 × 7 convolution layer and a 3 × 3 max pooling layer for initial downsampling, reducing the spatial resolution of the original image to 1/4 scale. The subsequent four stages consist of cascaded residual bottleneck blocks. At these stages, ResNet-50 architecture outputs hierarchical feature maps of varying scales, denoted as C₁ (160 × 160 × 256), C₂ (80 × 80 × 512), C₃ (40 × 40 × 1024) and C4 (20 × 20 × 2048). These multi-scale feature maps {C₂, C₃, C₄} are jointly fed into MSDeformAttn for cross-scale feature enhancement and fusion, while the feature map C₁ is retained to generate a high-resolution per-pixel embedding map.

The next step is to enhance and fuse the multi-scale feature maps to generate enhanced multi-scale feature maps and a high-resolution per-pixel embedding map. Specifically, a 1 × 1 convolution layer projects the multi-scale feature maps {C₂, C₃, C₄} to a uniform channel dimension of 256, producing {D_2, D₃, D₄} with spatial resolutions of 80 × 80, 40 × 40, and 20 × 20, respectively. This channel-wise alignment establishes a unified semantic space across different scales, thereby facilitating subsequent cross-scale feature interactions. The feature maps {D₂, D₃, D₄} are flattened and concatenated into a continuous token sequence, which is further augmented with both scale-level and spatial positional embedding. Each token in this sequence corresponds to a normalized reference point. Directed by these reference points, multiple sampling locations are generated across different feature levels for each attention head. For each reference point, learnable linear layers predict a set of sparse sampling offsets and their corresponding attention weights. With eight attention heads and four sampling points per head at each scale, the model adaptively samples a small number of informative locations from all three feature levels. The feature values at these sampled locations are obtained via bilinear interpolation and aggregated using Softmax-normalized attention weights. The output of each MSDeformAttn layer is processed through residual connections, layer normalization, and an FFN consisting of two fully connected layers with ReLU activation. As a result, semantic information across different scales undergoes deep interaction and enhancement, producing the enhanced multi-scale feature maps, denoted as P₂ (80 × 80 × 256), P₃ (40 × 40 × 256), P₄ (20 × 20 × 256). Subsequently, spatial and channel alignment is performed to construct the segmentation embedding. The feature map C₁ is processed by a 1 × 1 convolution for channel alignment, while the enhanced feature map P₂ is bilinearly upsampled to match spatial dimensions. These two aligned feature representations are subsequently fused via element-wise addition. As illustrated in Figure 7, the resulting feature map P₁ (160 × 160 × 256) is a high-resolution per-pixel embedding feature map utilized for mask prediction.

In the final step, the pixel-level features of the vibrated concrete regions are processed by a Transformer decoder to delineate the target segmentation masks, as shown in Figure 8. The Transformer decoder is composed of three cascaded layers, each containing three sub-decoder layers, and is designed based on a hierarchical processing paradigm. Within each cascaded layer, masked attention is applied across the three sub-decoder layers to the enhanced feature maps {P4, P3, P2} in sequence, following a progression from low to high spatial resolution. Specifically, the initial decoder layer operates without mask constraints. A set of learnable query vectors is introduced to perform unconstrained cross-attention, where attention weights are computed based on the similarity between the query vectors and the enhanced feature maps. This process establishes the initial spatial representation of the target regions. Then, the self-attention mechanism is applied among the query vectors to suppress redundant predictions. The query vectors undergo linear transformation into mask embeddings, which are interacted with the high-resolution per-pixel embedding feature map via a dot product to generate initial binary masks. Subsequent cascaded layers iteratively refine the query representations, guided by the attention masks predicted in the preceding layer. To enforce local focus, a large negative bias is imposed on pixels with mask values below a predefined threshold, constraining the query vectors to focus exclusively on candidate regions predicted by the previous layer. During the prediction phase, the model employs a set of learned object queries to represent potential targets. Through a Hungarian bipartite matching strategy, these individual queries are optimally assigned to distinct disconnected vibrated patches, establishing a one-to-one correspondence. Subsequently, the mask prediction head computes dot product between the mask embeddings and the high-resolution per-pixel embedding feature map to generate the final binary segmentation masks. Simultaneously, the classification prediction head projects the query representations through a linear layer, followed by a Softmax activation function, to output the class probability distribution for each instance. This dual-head formulation finalizes the precise segmentation of the vibrated concrete regions.

4.3. Evaluation Metrics

In this paper, the performance of the deep learning model is evaluated using the following metrics [51]: precision, recall, F₁ score, mAP@0.5:0.95, and mAP@0.5. Each metric is specifically selected to assess different aspects of the algorithm’s reliability in practical engineering applications.

Precision calculates the proportion of correctly predicted positive samples out of all predicted positives, as defined in Equation (7). In dam pouring blocks, high precision is crucial for minimizing false alarms, ensuring that non-vibrated regions are not misclassified as vibrated regions, which would waste subsequent inspection resources. Recall assesses the proportion of correctly predicted positive samples out of all actual positive samples, as indicated in Equation (8). This metric is strictly monitored to prevent missed detection, guaranteeing that even isolated vibration regions are successfully identified. Since precision and recall often trade off against each other, the F1 score, computed using Equation (9), is introduced as their harmonic mean to provide a balanced measure of the model’s robustness.

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

Recall = \frac{TP}{TP + FN}

(8)

F_{1} Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(9)

where Equations (7) and (8),

T P

(True Positive) means that both the ground truth and the prediction correspond to the vibrated concrete region;

F N

(False Negative) occurs when a vibrated concrete region in the ground truth is incorrectly classified by the model as a non-vibrated region; and

F P

(False Positive) refers to the case where a non-vibrated concrete region is mistakenly predicted as a vibrated concrete region.

Furthermore, because this research formulates the task as instance segmentation, each disconnected vibrated concrete region is defined as an independent instance. In actual construction scenarios, factors such as construction techniques, pouring sequences, and uneven vibration processes often cause the already-vibrated concrete to manifest as multiple spatially independent and disconnected regions. While mIoU evaluates global pixel-wise overlap and can be overly dominated by massive continuous vibration regions, it often obscures the failure to detect discrete patches. In contrast, the mAP metric rigorously assesses the capability of the model to individually recall and precisely segment each independent vibrated concrete instance. Consequently, the overall segmentation performance is evaluated using mAP rather than semantic segmentation metrics such as mIoU. The computation of mAP is given in Equation (10). The mAP@0.5 metric refers to the average precision calculated at an Intersection over Union (IoU) threshold of 0.5, which serves as a standard benchmark for segmentation algorithms. Similarly, mAP@0.5:0.95 represents the average precision computed over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This metric provides a more comprehensive evaluation of algorithm performance across a wider range of IoU thresholds.

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(10)

In Equation (10).

A P_{i}

denotes the average precision of the

i

-th category, which is defined as the area under the Precision-Recall curve for that category;

N

denotes the total number of samples of the target class in the dataset.

4.4. Results

The proposed method utilizes ResNet-50 to extract multi-scale features from the input images. Simultaneously, masked attention is introduced to filter out and mitigate interference from complex background noise, and MSDeformAttn is employed to reinforce the edge feature representation of the vibrated concrete regions.

Figure 9 illustrates the training and validation performance of the proposed method, including the loss_mask and loss_dice terms, alongside the mAP@0.5:0.95 and mAP@0.5 metrics. The training curves show that loss_mask and loss_dice present relatively high values during the early stage of training. As the number of iterations increases, all loss values display a clear downward trend and gradually stabilize. In particular, after approximately 30 epochs, fluctuations in the loss curves become minimal, indicating that the training process has reached convergence. This convergence behavior of the loss functions reflects the effectiveness of the optimization process in reducing prediction errors. After 30 epochs, both mAP@0.5:0.95 and mAP@0.5 show stable convergence trends. The mAP@0.5 metric reaches a value of approximately 0.96, while under the stricter IoU threshold range of 0.5 to 0.95, the mAP@0.5:0.95 achieves around 0.90, indicating consistent performance across different evaluation criteria. In summary, the final loss values remain at a low and stable level, and the model maintains balanced performance across multiple evaluation metrics, demonstrating its ability to effectively recognize vibrated concrete.

In order to further demonstrate the excellent performance of the proposed method, multiple state-of-the-art models such as YOLACT, YOLO11, SOLOv2, and CondInst were trained and evaluated on the same datasets. The YOLO11 baseline in this study refers specifically to the YOLO11-n-seg model implemented within the official open-source Ultralytics framework [52], while YOLACT, Mask2Former, SOLOv2, and CondInst were implemented using the open-source MMDetection framework. YOLO11 represents the latest evolution in high-efficiency detection and segmentation models. YOLACT serves as a representative baseline for prototype-based instance segmentation. To ensure a comprehensive evaluation beyond traditional box-based approaches, SOLOv2 and CondInst are included as representative anchor-free and dynamic convolution-based segmentation methods, respectively. This diverse set of comparisons enables a thorough validation of the effectiveness of the proposed method across different architectural paradigms. The experimental results are summarized in Table 2. Figure 10 illustrates the variations in different evaluation metrics over 50 training epochs, with validation performed every two epochs.

As shown in Table 2, the proposed method achieves the highest values in the segmentation of vibrated concrete, with precision, recall, F₁ score, mAP@0.5:0.95, and mAP@0.5 of approximately 0.94, 0.94, 0.94, 0.90, and 0.96, respectively. For the mAP@0.5:0.95 metric, the proposed method demonstrates increases of approximately 12%, 17%, and 97% compared with SOLOv2, CondInst, and YOLACT, respectively. When compared with YOLO11, the proposed method demonstrates improvements of approximately 6%, 20%, 13%, 53%, and 12% in precision, recall, F₁ score, mAP@0.5:0.95, and mAP@0.5, respectively. The results indicate that higher precision is obtained while maintaining a relatively high recall, reflecting a balance between false positives and false negatives across the evaluated metrics.

As shown in Figure 10, the limited representational capacity of the mask prototypes in YOLACT restricts its ability to capture irregular target regions, leading to persistently insufficient performance throughout the training process. YOLO11, SOLOv2, and CondInst exhibit better overall performance than the proposed method during the early stage of training. This trend can be attributed to the local perception capability of CNN architectures, which enables rapid extraction of texture features and leads to stronger initial performance. Conversely, the initial performance of the proposed method is not fully manifested, as the self-attention mechanism increases optimization complexity for global relationship modeling of target regions. Nevertheless, in the later stages of training, the increasing number of iterations allows the global context modeling of Transformer architectures to fully capture the distribution information within the target regions. This effectively improves the model’s capacity to represent irregular target regions. In contrast to CNN architectures, the proposed method demonstrates superior overall consistency and finer detail preservation, thereby achieving higher segmentation accuracy and robust performance.

In addition to the comparisons with the aforementioned models, Mask2Former [27] was further selected as the baseline model for comparative analysis under the same experimental settings and datasets. The corresponding experimental results are presented in Figure 11.

As illustrated in Figure 11, the proposed method outperforms the baseline Mask2Former model across all evaluation metrics. This demonstrates the effectiveness of our integrated strategies, specifically the tailored image augmentation, loss function weight adjustment, and learning rate scheduling. Specifically, the Precision improved from 0.863 to 0.939, the Recall increased from 0.857 to 0.937, and the F1-score rose from 0.860 to 0.938. Notably, the slightly higher improvement in Recall suggests that the proposed method is particularly effective in reducing missed detection. Concurrently, the increase in Precision indicates that false detection is effectively suppressed, thereby achieving a more favorable trade-off between precision and recall. Furthermore, mAP@0.5 increased from 0.915 to 0.962, and mAP@0.5:0.95 improved from 0.845 to 0.901. The stable improvement observed under the more rigorous mAP@0.5:0.95 metric highlights the model’s enhanced capability in delineating target boundaries at higher IoU thresholds, confirming that the resulting segmentation masks are both finer and more robust. Collectively, the consistent growth across all metrics, even under strict evaluation criteria, indicates that the proposed method possesses excellent adaptability and generalization capabilities in complex operational environments.

To validate the stability and reliability of the proposed method, multiple independent experiments were conducted using different random seeds under identical data splits and training configurations. A total of five random seeds were adopted to mitigate the potential influence of stochastic variations in a single run, as reported in Table 3.

To evaluate performance fluctuations under different stochastic conditions, this study conducts statistical analysis on the results of five repeated experiments by computing the standard deviation

S

, as illustrated in Figure 12. Let

x_{1}, x_{2}, x_{3}, x_{4}, x_{5}

denote the values of a given evaluation metric across the five runs. The corresponding mean value is calculated according to Equation (11).

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(11)

The standard deviation is used to measure the extent of performance variation and the stability of the model under different random initialization conditions. A smaller standard deviation indicates that the model exhibits more consistent performance across different training runs, reflecting better robustness. In contrast, a larger standard deviation suggests that the model is more sensitive to stochastic factors and may suffer from unstable performance. Therefore, in addition to average performance, the standard deviation provides a more comprehensive assessment of model reliability for practical engineering applications. The standard deviation is computed using the sample standard deviation formulation, as shown in Equation (12).

S = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(12)

Overall, all evaluation metrics, including Precision, Recall, F₁-score, mAP@0.5:0.95, and mAP@0.5, demonstrate strong consistency, indicating that the training process of the proposed model is highly robust. Precision fluctuates between 0.90 and 0.93, while Recall varies within a similar range of 0.91 to 0.93. The average F1-score is approximately 0.92, with relatively minor variations, suggesting a well-balanced trade-off between precision and recall. For the more stringent metric mAP@0.5:0.95, the mean value is approximately 0.8876 with a standard deviation of 0.0087, indicating that the model maintains stable detection performance across different random initialization. Under the more relaxed evaluation criterion mAP@0.5, the mean reaches 0.95, with an even smaller standard deviation of 0.0076, further confirming the consistency of the model’s recognition capability. Overall, the results of multiple runs demonstrate that the proposed method not only achieves strong average performance but also maintains low variance under different random initialization conditions, thereby validating its stability and robustness in complex scenarios.

To evaluate the segmentation performance in recognizing vibrated concrete regions, inference was conducted on randomly selected test images from different dam pouring block scenarios that are distinct from those used in the training set. Figure 11 and Figure 12 present the original images along with the corresponding segmentation results. A comparative analysis of YOLACT, YOLO11, CondInst, and SOLOv2 reveals varying degrees of false or missed detection. In Figure 11, YOLACT, SOLOv2, and CondInst exhibit evident missed detection. Although SOLOv2 produces relatively high confidence scores within target regions, the overall segmentation results remain incomplete. In contrast, YOLO11 achieves a lower miss rate; however, its confidence scores are comparatively modest, ranging from 0.60 to 0.90. In Figure 12, YOLO11 generates false detection with confidence scores below 0.90. YOLACT incorrectly segments wheel tracks as target regions, with confidence values ranging from 0.40 to 0.85. SOLOv2 continues to exhibit significant missed detection, while CondInst shows relatively low confidence in localized regions, with scores dropping to 0.72. In contrast, the proposed method demonstrates superior robustness across both complex scenarios. It achieves high-precision and high-confidence segmentation in Figure 13, and effectively distinguishes targets from background interference in Figure 14, thereby ensuring reliable segmentation performance. Notably, the confidence scores of the proposed method consistently exceed 0.95 across different scenarios. As a result, the proposed method exhibits superior accuracy and robustness in the recognition of vibrated concrete regions and remains reliable under complex and dynamic construction conditions.

5. Discussion

In this study, we proposed and validated a novel end-to-end segmentation model based on a CNN-Transformer architecture driven by a masked attention mechanism. This framework successfully achieves high-precision mapping of vibrated concrete regions within complex dam construction environments.

Concrete vibration is a critical process that directly dictates the homogeneity, compactness, and ultimate durability of dam structures. In practice, the randomness of vibrator trajectories, compounded by the visual blind spots of human operators, frequently leads to machines repeatedly traversing already-vibrated concrete, thereby compromises material integrity. However, visually identifying these vibrated concrete regions on the pouring surfaces remains highly challenging due to the ambiguous texture of the concrete, the dense presence of machinery, and the continuous movement of equipment. Conventional CNN-based methods, which rely exclusively on local textural feature extraction, exhibit inadequate performance applied on raw, low-contrast images collected from unstructured on-site environments.

To bridge this algorithmic gap, this study proposed a hybrid CNN-Transformer architecture incorporates MSDeformAttn to fuse multi-scale features, effectively captures both fine-scale local textures and global contextual information. Experimental results show that this method achieves significant improvements in the mAP@0.5:0.95 metric, which serves as comprehensive measure of detection and segmentation accuracy. Specifically, the proposed method outperformed baselines, YOLACT, YOLO11, SOLOv2, and CondInst, by margins of approximately 97%, 53%, 17%, and 12%, respectively. Correspondingly, in terms of the F₁ score, our method exceeded these baselines by 42%, 13%, 10%, and 6%.

For construction stakeholders, these substantial performance gains represent a critical breakthrough. The results confirm that the masked attention mechanism enables the model to accurately distinguish between visually similar but semantically distinct concrete regions, demonstrating remarkable robustness against on-site visual interference. Crucially, for the scientific discipline of construction automation, these findings provide the missing perceptual link required for intelligent equipment. Accurate mapping of vibrated regions supplies the vital environmental understanding that is critical for advancing the development of more autonomous and intelligent construction systems. Rather than relying on human operation, these intelligent vibration agents can leverage distribution data for dynamic, data-driven decision-making and optimal trajectory planning, thereby effectively eliminating redundant traversals.

Despite its superior accuracy, the proposed method faces limitations regarding computational cost, particularly when processing larger datasets or deploying on resource-constrained edge devices. In addition, dynamic occlusions caused by the frequent movement of mechanical equipment on the pouring surface also pose challenges to target recognition. Figure 15 presents a representative failure case in Scene 2 of Section 4.4, where occlusion from mechanical equipment leads to temporary missed detections in certain target regions. This issue is mainly attributed to the loss of local feature information due to occlusion, which degrades the stability of frame-level detection. However, such occlusion only affects local recognition results and does not significantly impact the overall performance of the model in identifying vibrated concrete regions. Future research will focus on developing lightweight adaptations of this architecture. Techniques such as knowledge distillation and network pruning will be explored to reduce computational complexity without sacrificing segmentation accuracy. Meanwhile, future work will further integrate multi-source sensor data and embedded electronic circuits to enhance robustness and stable recognition capability under dynamic occlusion in real-world industrial scenarios. Additionally, efforts will be directed toward expanding the dataset under varying illumination and weather conditions. A systematic quantitative robustness analysis will also be conducted to further improve the model’s adaptability and generalization capability in diverse construction environments. On this basis, future work will explore and compare other Transformer-based architectures of similar design. The goal is to identify more competitive alternatives in terms of both accuracy and efficiency, thereby further improving the overall technical framework for this task.

In conclusion, the implications of these findings extend far beyond improved image segmentation, contributing vital new insights to the discipline of civil engineering informatics. By accurately mapping vibrated regions, this visual feedback system provides a novel management perspective that directly addresses the persistent issue of repeated machine traversals. Beyond quality assurance of concrete vibration process, the proposed framework also demonstrates the potential of vision-based intelligent monitoring for concrete performance assessment and non-destructive quality evaluation in complex construction environments. These capabilities establish meaningful connections with emerging intelligent structural assessment and monitoring frameworks in modern civil engineering. Inspired by recent developments in intelligent structural assessment [56] and non-destructive monitoring [57], future work may further integrate multi-modal sensing strategies to enhance robustness and engineering applicability under complex field conditions. Furthermore, this study lays a foundation for the deployment of autonomous robotics in hydraulic engineering, accelerating the industry’s paradigm shift toward fully unmanned, intelligent, and autonomous construction operations.

6. Conclusions

This study proposes a highly accurate visual recognition method for identifying vibrated concrete regions, effectively addressing the dual challenges of ambiguous boundary features between concrete states and dynamic visual interference in construction environments. Comparative experimental results show that the state-of-the-art segmentation baselines, including YOLACT, YOLO11, SOLOv2, and CondInst, suffer from varying degrees of missed and false detection. The proposed method demonstrates superior robustness, effectively suppressing complex background noise and overcoming the inherent challenges of high texture similarity.

The proposed method introduces a novel CNN-Transformer architecture driven by a masked attention mechanism, enabling visual information feedback on the distribution of vibrated concrete regions. Compared with the CNN-based YOLO11 model, the proposed method achieves improvements of approximately 6%, 20%, 13%, 53%, and 12% in terms of precision, recall, F₁ score, mAP@0.5:0.95, and mAP@0.5, respectively. The improved performance is attributed to the synergistic architectural design. Specifically, integrating the ResNet-50 backbone with the MSDeformAttn module significantly strengthens the learning of detailed texture features and enhances the capability of edge feature representation. Moreover, the masked attention mechanism effectively filters out complex background interference, ensuring fidelity extraction and improving the overall recognition accuracy of vibrated regions. As a result, the model successfully translates unstructured visual data into an accurate distribution map of vibrated concrete. When integrated into vibrating machinery, this visualized distribution data can serve as an important operational constraint, providing perceptual awareness of construction status that supports higher-level autonomy and facilitates intelligent trajectory planning.

Despite these advancements, several limitations remain to be addressed in future work. First, the high parameter counts of the current model limits computational efficiency, making it challenging to meet the demands for high-frame-rate, low-latency feedback on dynamic dam pouring surfaces. Second, model accuracy remains susceptible to degradation under severe illumination changes (e.g., intense back-lighting, shifting shadows) and adverse weather conditions (e.g., rain, fog). Finally, the current reliance on single-source visual data makes the system vulnerable to visual occlusion caused by moving construction equipment, which can interrupt recognition continuity.

To overcome these challenges, future research will focus on the development of lightweight models via techniques such as knowledge distillation and network pruning to achieve real-time, edge-device compatibility. In addition, deployment-oriented acceleration on embedded edge computing platforms will be further investigated to improve real-time inference capability and ensure compatibility with edge devices. Furthermore, the dataset will be expanded to encompass a wider variety of environmental and meteorological conditions, thereby improving model generalization. Ultimately, multi-modal fusion approaches incorporating ultrasonic sensing [58] or other sensor-based data can be considered to establish a robust recognition system capable of maintaining stable target identification under dynamic visual occlusions. To further ensure system stability, rigorous statistical methods such as the Taguchi-ANOVA approach [59] can be employed for hyper-parameter optimization, thereby improving both the stability and accuracy of model performance.

Author Contributions

Conceptualization, L.L. and Y.Z.; data curation, C.Z.; methodology, Y.J. and L.L.; software, C.Z. and Y.J.; resources, Y.Z.; validation, Y.J. and F.W.; formal analysis, C.Z.; funding acquisition, L.L.; project administration, Y.Z.; Supervision, Y.Z.; visualization, H.Z.; writing—original draft, Y.J. and L.L.; writing—review and editing, H.Z. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financed by the Foundation Project for Doctors’ Research in Hubei University of Technology, China (XJ2023006901). And the APC was funded by the Foundation Project for Doctors’ Research in Hubei University of Technology, China (XJ2023006901).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, Q.; Li, H.; Li, M. Towards online monitoring of concrete dam displacement subject to time-varying environments: An improved sequential learning approach. Adv. Eng. Inform. 2023, 55, 101881. [Google Scholar] [CrossRef]
Huang, B.; Kang, F.; Li, J. Displacement prediction model for high arch dams using long short-term memory based encoder-decoder with dual-stage attention considering measured dam temperature. Eng. Struct. 2023, 280, 115686. [Google Scholar] [CrossRef]
Ren, B.; Wang, H.; Wang, D. Vision method based on deep learning for detecting concrete vibration quality. Case Stud. Constr. Mater. 2023, 18, e02132. [Google Scholar] [CrossRef]
Quan, Y.; Wang, F. Machine learning-based real-time tracking for concrete vibration. Autom. Constr. 2022, 140, 104343. [Google Scholar] [CrossRef]
Li, T.; Wang, H.; Tan, J.; Tan, J. Intelligent quality assessment of concrete vibration using computer vision and large language models. Autom. Constr. 2025, 180, 106507. [Google Scholar] [CrossRef]
Li, J.; Tian, Z.; Ma, Y.; Ma, Y. Feedback control system for vibration construction of fresh concrete. Mech. Syst. Signal Process. 2024, 216, 111461. [Google Scholar] [CrossRef]
Boumiz, A.; Vernet, C. Mechanical properties of cement pastes and mortars at early ages: Evolution with time and degree of hydration. Adv. Cem. Based Mater. 1996, 3, 94–106. [Google Scholar] [CrossRef]
Zhu, B.; Zheng, Y.; Rong, Z.; Zhao, Z.; Yang, J.; Yang, B. Impact of early-age vibration on the permeability and microstructure of mature-age concrete. Constr. Build. Mater. 2024, 453, 139124. [Google Scholar] [CrossRef]
Li, T.; Wang, H.; Tan, J.; Tan, J. A continuous concrete vibration method for robots based on machine vision with integrated spatial features. Appl. Soft Comput. 2024, 167, 112231. [Google Scholar] [CrossRef]
Xu, Y.; Zhou, Y.; Sekula, P. Machine learning in construction: From shallow to deep learning. Dev. Built Environ. 2021, 6, 100045. [Google Scholar] [CrossRef]
Li, T.; Wang, H.; Pan, D. A machine vision approach with temporal fusion strategy for concrete vibration quality monitoring. Appl. Soft Comput. 2024, 160, 111684. [Google Scholar] [CrossRef]
Zhang, H.; Jin, Y.; Liu, Q. Intelligent monitoring method for tamping times during dynamic compaction construction using machine vision and pattern recognition. Measurement 2022, 193, 110835. [Google Scholar] [CrossRef]
Yang, K.; Wang, H.; Wang, K. An effective monitoring method of dynamic compaction construction quality based on time series modeling. Measurement 2024, 224, 113930. [Google Scholar] [CrossRef]
Zhang, H.; Yang, Q.; Liu, Q. Multi-sensor integrated monitoring equipment and its application to dynamic compaction quality in construction. Autom. Constr. 2023, 156, 105151. [Google Scholar] [CrossRef]
Jiang, D.; Kong, L.; Wang, H. Precise control mode for concrete vibration time based on attention-enhanced machine vision. Autom. Constr. 2024, 158, 105232. [Google Scholar] [CrossRef]
Wang, D.; Ren, B.; Cui, B. Real-time monitoring for vibration quality of fresh concrete using convolutional neural networks and IoT technology. Autom. Constr. 2021, 123, 103510. [Google Scholar] [CrossRef]
Wang, S.; Chen, L.; Shi, P. Computer vision based manual concrete vibration quality monitoring. Dev. Built Environ. 2026, 26, 100895. [Google Scholar] [CrossRef]
Cao, G.; Bai, Y.; Shi, Y. Investigation of vibration on rheological behavior of fresh concrete using CFD-DEM coupling method. Constr. Build. Mater. 2024, 425, 135908. [Google Scholar] [CrossRef]
Koch, J.A.; Castaneda, D.I.; Ewoldt, R.H. Vibration of fresh concrete understood through the paradigm of granular physics. Cem. Concr. Res. 2019, 115, 31–42. [Google Scholar] [CrossRef]
Mia, M.S.; Arnob, A.B.H.; Naim, A. ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain. In Proceedings of the 2023 International Conference on the Cognitive Computing and Complex Data (ICCD), Huai’an, China, 21–22 October 2023; pp. 101–117. [Google Scholar] [CrossRef]
Shan, J.; Huang, Y.; Jiang, W. DCUFormer: Enhancing pavement crack segmentation in complex environments with dual-cross/upsampling attention. Expert Syst. Appl. 2025, 264, 125891. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar] [CrossRef]
Cheng, B.; Schwing, A.G.; Kirillov, A. Per-Pixel Classification is Not All You Need for Semantic Segmentation. arXiv 2021. [Google Scholar] [CrossRef]
Goo, J.M.; Milidonis, X.; Artusi, A. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Autom. Constr. 2025, 170, 105960. [Google Scholar] [CrossRef]
Ma, Y.; Zhang, Z. Position-Guided Hybrid Convolutional Neural Network and Transformer Network for steel strip surface defect detection. Eng. Appl. Artif. Intell. 2025, 162, 112741. [Google Scholar] [CrossRef]
Beyene, D.A.; Tola, K.D.; Yigzew, F.E. Hybrid multi-scale CNN-Transformer network for structural surface crack segmentation. Results Eng. 2026, 30, 110145. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
Tanaka, H.; Shibano, K.; Suzuki, T. Scene classification-assisted deep learning for crack detection in asphalt pavements. Case Stud. Constr. Mater. 2025, 23, e05064. [Google Scholar] [CrossRef]
Li, D.; Xie, Q.; Gong, X. Automatic defect detection of metro tunnel surfaces using a vision-based inspection system. Adv. Eng. Inform. 2021, 47, 101206. [Google Scholar] [CrossRef]
Fan, Z.; Chen, B.; Wang, Z. Defects recognition in on-site GPR images of mountain tunnel linings based on YOLOv8N model. Case Stud. Constr. Mater. 2025, 23, e05196. [Google Scholar] [CrossRef]
Zhao, N.; Song, Y.; Liu, H. A novel MPDENet model and efficient combined loss function for real-time pixel-level segmentation detection of tunnel lining cracks. Case Stud. Constr. Mater. 2025, 22, e04618. [Google Scholar] [CrossRef]
Fan, H.; Tian, Z.; Xu, X. Rockfill material segmentation and gradation calculation based on deep learning. Case Stud. Constr. Mater. 2022, 17, e01216. [Google Scholar] [CrossRef]
Wang, S.; Han, R.; Wu, X.; Zhao, D.; Zeng, X.; Yin, R.; Han, Z.; Liu, Y.; Shu, S. Crack segmentation and quantification in concrete structures using a lightweight YOLO model based on pruning and knowledge distillation. Expert Syst. Appl. 2025, 283, 127834. [Google Scholar] [CrossRef]
Yang, K.; Bao, Y.; Li, J. Deep learning-based YOLO for crack segmentation and measurement in metro tunnels. Autom. Constr. 2024, 168, 105818. [Google Scholar] [CrossRef]
Li, H. Tunnel lining crack detection method based on deformable convolution and feature fusion with image enhancement of Retinex theory. Expert Syst. Appl. 2026, 299, 130285. [Google Scholar] [CrossRef]
Liu, B.; Zhang, J.; Lei, M. Simultaneous tunnel defects and lining thickness identification based on multi-tasks deep neural network from ground penetrating radar images. Autom. Constr. 2023, 145, 104633. [Google Scholar] [CrossRef]
Al-Huda, Z.; Peng, B.; Algburi, R.N.A. Asymmetric dual-decoder-U-Net for pavement crack semantic segmentation. Autom. Constr. 2023, 156, 105138. [Google Scholar] [CrossRef]
Qu, Z.; Chen, W.; Wang, S.-Y. A Crack Detection Algorithm for Concrete Pavement Based on Attention Mechanism and Multi-Features Fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11710–11719. [Google Scholar] [CrossRef]
Li, M.; Yuan, J.; Ren, Q. CNN-Transformer hybrid network for concrete dam crack patrol inspection. Autom. Constr. 2024, 163, 105440. [Google Scholar] [CrossRef]
Ma, Y.; Bao, T.; Li, Y. GANFormerNet: A UAV-based Concrete Crack Segmentation Model for Water-related Structures Using Vision Transformer and Graph Attention Network. Adv. Eng. Inform. 2025, 68, 103725. [Google Scholar] [CrossRef]
Wang, W.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Springer Nature: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar] [CrossRef]
Li, G.; Zhang, J.; Hu, K. HAIS-SegFormer: A Lightweight Underwater Crack Segmentation Network Based on Hybrid Attention and Feature Inhibition. J. Mar. Sci. Eng. 2026, 14, 526. [Google Scholar] [CrossRef]
Zhang, E.; Shao, L.; Wang, Y. Unifying transformer and convolution for dam crack detection. Autom. Constr. 2023, 147, 104712. [Google Scholar] [CrossRef]
Gan, G.; Xu, X.; Ding, Y. Pixel-Level Detection of Cracks Based on Loop Semantic Diffusion Integration. J. Comput. Civ. Eng. 2025, 39, 04025049. [Google Scholar] [CrossRef]
Asadi Shamsabadi, E.; Xu, C.; Rao, A.S. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
Zhu, X.; Yu, W.; Dong, X. MR-Former: Improving universal image segmentation via refined masked-attention transformer. Alex. Eng. J. 2025, 131, 232–244. [Google Scholar] [CrossRef]
Zuo, X.; Sheng, Y.; Shen, J. Multilabel Sewer Pipe Defect Recognition with Mask Attention Feature Enhancement and Label Correlation Learning. J. Comput. Civ. Eng. 2025, 39, 04024050. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Xu, H. MP-Former: Mask-Piloted Transformer for Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18074–18083. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Xu, H. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
Meda, D.; Ahmed, M.M.; Kalapatapu, P.; Pasupuleti, V.D.K. Enhanced Structural Damage Detection, Segmentation, and Quantification Using Computer Vision and Deep Learning. J. Comput. Civ. Eng. 2025, 39, 04025066. [Google Scholar] [CrossRef]
Pantoja-Rosero, B.G.; Salamone, S. Integrating extended reality and AI-based damage segmentation for near real-time, traceable bridge inspections. Autom. Constr. 2025, 180, 106567. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. arXiv 2020, arXiv:2003.05664. [Google Scholar] [CrossRef]
Khatir, A.; Capozucca, R.; Khatir, S.; Magagnini, E.; Le Thanh, C.; Riahi, M.K. Advancements and emerging trends in integrating machine learning and deep learning for SHM in mechanical and civil engineering: A comprehensive review. J. Braz. Soc. Mech. Sci. Eng. 2025, 47, 419. [Google Scholar] [CrossRef]
Bouabdallah, A.; Benaissa, A.; Bouabdallah, M.A.; Malab, S.; Khatir, A. Development and performance evaluation of self-leveling sand concrete: Enhanced fluidity, mechanical strength, durability, and non-destructive analysis. Constr. Build. Mater. 2025, 468, 140463. [Google Scholar] [CrossRef]
Cacciola, M.; Angiulli, G.; Burrascano, P.; Laganà, F.; Versaci, M. A Prototypical Fuzzy Similarity-Based Classification Framework for Ultrasonic Defect Detection in Concrete. Eng 2026, 7, 88. [Google Scholar] [CrossRef]
Laganà, F.; Pratticò, D.; Quattrone, M.F.; Pullano, S.A.; Calcagno, S. Hybrid AI–Taguchi–ANOVA Approach for Thermographic Monitoring of Electronic Devices. Eng 2026, 7, 28. [Google Scholar] [CrossRef]

Figure 1. The proposed framework in this study.

Figure 2. Illustration of the DeformAttn.

Figure 3. Architecture of the Transformer Decoder with masked attention.

Figure 4. The architecture of the proposed workflow.

Figure 5. Image Annotation Examples for vibrated concrete regions. (a) Annotation Example of Vibrated Regions in dam pouring block 1. (b) Annotation Example of Vibrated Regions in dam pouring block 2. (c) Annotation Example of Vibrated Regions in dam pouring block 3.

Figure 6. Feature Extraction Pipeline of the ResNet-50.

Figure 7. Generation Pipeline of the embedding feature map.

Figure 8. Process of Target Modeling and Segmentation Prediction.

Figure 9. Training Curves of the Proposed Method: Loss, mAP@0.5, and mAP@0.5:0.95.

Figure 10. Comparison of different models in terms of Precision, Recall, mAP@0.5, and mAP@0.5:0.95.

Figure 11. Performance comparison between Mask2former and the proposed method.

Figure 12. Mean Performance and Standard Deviation Statistics of Repeated Experiments.

Figure 13. Visual Comparison of Segmentation Results Across Different Models in scene 1.

Figure 14. Visual Comparison of Segmentation Results Across Different Models in scene 2.

Figure 15. Recognition Results under Construction Equipment Occlusion in Scene 2.

Table 1. Composition of the multi-condition dam pouring datasets.

Dam Pouring Block ID	Video ID	Acquisition Time	Weather Condition	Training Images	Validation Images	Total
1	V₁	4:00–4:40 PM	Cloudy	104	26	146
1	V₂	5:10–5:30 PM	Cloudy	12	4	146
2	V₃	10:30–11:00 AM	Sunny	56	16	198
2	V₄	2:00–2:30 PM	Sunny	103	23	198
3	V₅	10:40–11:10 AM	Sunny	73	19	92
Total	-	-	-	348	88	436

Table 2. Performance comparison of various models on vibrated region segmentation.

Models	P	R	F1	mAP@0.5:0.95	mAP@0.5
YOLACT [53]	0.8140	0.5580	0.6621	0.4580	0.6860
YOLO11 [52]	0.8888	0.7832	0.8327	0.5908	0.8607
Solov2 [54]	0.9220	0.7900	0.8509	0.7730	0.8670
CondInst [55]	0.9000	0.8740	0.8868	0.8070	0.9250
Our study	0.9390	0.9370	0.9380	0.9010	0.9620

Table 3. Results of Five Repeated Experiments Under Different Random Seeds.

Metrics	1	2	3	4	5
Precision	0.9390	0.8970	0.9060	0.9130	0.9250
Recall	0.9370	0.9110	0.9190	0.9230	0.9310
F₁	0.9380	0.9039	0.9124	0.9180	0.9280
mAP@0.5:0.95	0.9010	0.8780	0.8830	0.8860	0.8901
mAP@0.5	0.9620	0.9430	0.9490	0.9520	0.9590
mean ± std	0.9160 ± 0.0164	0.9242 ± 0.0102	0.9201 ± 0.0133	0.8876 ± 0.0087	0.9530 ± 0.0076

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lei, L.; Ji, Y.; Zhou, Y.; Zhao, C.; Wang, F.; Zhou, H.; Liang, Z. Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Appl. Sci. 2026, 16, 5479. https://doi.org/10.3390/app16115479

AMA Style

Lei L, Ji Y, Zhou Y, Zhao C, Wang F, Zhou H, Liang Z. Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Applied Sciences. 2026; 16(11):5479. https://doi.org/10.3390/app16115479

Chicago/Turabian Style

Lei, Lei, Yu Ji, Yihong Zhou, Chunju Zhao, Fang Wang, Huawei Zhou, and Zhipeng Liang. 2026. "Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition" Applied Sciences 16, no. 11: 5479. https://doi.org/10.3390/app16115479

APA Style

Lei, L., Ji, Y., Zhou, Y., Zhao, C., Wang, F., Zhou, H., & Liang, Z. (2026). Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Applied Sciences, 16(11), 5479. https://doi.org/10.3390/app16115479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based Segmentation Models

2.2. Transformer-Based Segmentation Models

3. Methodology

3.1. The Proposed Framework

3.2. Multi-Scale Feature Extraction for Input Images

3.3. Edge Feature Enhancement for Vibrated and Non-Vibrated Regions

3.4. Mask Modeling and Prediction for Vibrated Regions

3.5. Image Augmentation, Loss Function, Optimizer, and Learning Rate Scheduling Strategy

4. Experiment

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI