Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection

Zhang, Zichen; Guo, Chengjun

doi:10.3390/buildings16112105

Open AccessArticle

Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection

by

Zichen Zhang

and

Chengjun Guo

^*

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(11), 2105; https://doi.org/10.3390/buildings16112105

Submission received: 28 April 2026 / Revised: 14 May 2026 / Accepted: 20 May 2026 / Published: 25 May 2026

(This article belongs to the Special Issue AI-Enhanced Defect Detection and Quality Assurance in Building Structures)

Download

Browse Figures

Versions Notes

Abstract

Vision-based structural defect detection methods based on YOLOv11 have achieved promising performance in recent years; however, their robustness in real engineering environments remains limited due to illumination variation, shadow occlusion, surface contamination, and complex background textures. Existing data-driven approaches primarily rely on visual appearance features while neglecting the intrinsic geometric continuity and morphological characteristics associated with structural failures such as cracks and spalling. To address these challenges, this study proposes an enhanced defect detection framework termed GCA-YOLO for intelligent structural inspection. The proposed method integrates a Geometric Constraint Attention (GCA) module and a Residual Efficient Channel Attention (RECA) module to improve feature representation. Instead of explicit physical simulation, the GCA module embeds morphology-guided geometric priors into the attention mechanism using differentiable gradient and Laplacian operators. This enforces structural continuity perception and suppresses geometrically inconsistent responses caused by background noise. Furthermore, a geometry confidence gating mechanism adaptively modulates the contribution of morphological features, while the RECA module recalibrates channel-wise responses to enhance the representation of weak and low-contrast defects. To comprehensively evaluate the proposed method, experiments were conducted on three representative datasets, including a public crack dataset and two self-built datasets (one for peeling/detachment and one for crack defects). These datasets were collected from diverse civil infrastructure scenarios such as bridges, tunnels, and pavements under challenging conditions including low illumination, shadow occlusion, complex textures, and heterogeneous backgrounds. Compared with the baseline YOLOv11 model, the proposed GCA-YOLO framework improves mAP@0.5 by 2.2%, 2.5%, and 15.9% on the public crack dataset, the self-built peeling/detaching dataset, and the self-built crack dataset, respectively. Meanwhile, Recall is improved by 4.6%, 3.8%, and 33.1%, respectively, demonstrating the effectiveness of the proposed dual-attention framework in enhancing the completeness of defect localization and reducing missed detections. Despite these performance gains, the proposed framework maintains a lightweight architecture and does not introduce significant computational overhead. Experimental results demonstrate that the proposed framework achieves strong robustness, stable generalization capability, and favorable detection efficiency across different defect categories and engineering scenarios, demonstrating promising potential for intelligent infrastructure inspection, urban safety monitoring, and practical engineering deployment.

Keywords:

YOLOv11; object detection; dual attention mechanism; geometry-constrained attention; residual channel attention

1. Introduction

As urbanization continues to accelerate worldwide, the demand for high-quality infrastructure construction, maintenance, and lifecycle management has increased substantially. Meanwhile, a large number of aging infrastructure assets are gradually approaching or exceeding their designed service life, leading to increasingly common structural damages such as cracks, spalling, and corner breakage [1]. These defects not only reduce structural durability, serviceability, and long-term reliability, but may also evolve into severe safety hazards, potentially causing catastrophic failures under extreme loading or adverse environmental conditions. Similar infrastructure risks are widely reported in cases of bridge collapse and earthquake-induced damage [2], while advanced structural monitoring technologies such as satellite-based InSAR underscore the importance of continuous and reliable inspection systems for large-scale infrastructure networks [3]. Therefore, accurate and timely detection of infrastructure damage is of great importance for accident prevention and urban safety assurance.

Nonetheless, usual manual inspection methods are commonly time-consuming, human-labor-demanding, and quite reliant on expert knowledge, which limits their suitability in large-scale monitoring tasks. In this regard, approaches based upon deep learning, especially YOLO-based detectors, have become a prevailing solution for automated structural damage detection [4].

Recent advances in YOLO-based architectures have significantly improved defect detection performance by incorporating enhanced feature representation strategies. In particular, recent studies have explored YOLOv11 and its variants for infrastructure inspection tasks. Huang et al. and Tian et al. proposed optimized YOLOv11 frameworks for crack and multi-category defect detection, respectively, demonstrating improved accuracy in concrete surface analysis. In addition, Shan et al. [5] introduced a lightweight YOLOv11-based segmentation model for steel surface defect detection, further highlighting the trend toward efficient and deployable architectures.

Ruggieri et al. [6] proposed an attention-enhanced YOLO11 framework for reinforced concrete bridge inspection, utilizing a real-world dataset of expert-annotated surface defects. Their method integrates attention to focus on relevant structural regions, achieving improved precision and recall while maintaining real-time efficiency for field deployment. They also validated its reliability with explainable AI (e.g., Eigen-CAM), demonstrating model interpretability in practical inspection scenarios.

Despite these improvements, recent YOLO-based methods still rely heavily on static or weakly adaptive attention mechanisms. As a result, they struggle to dynamically adjust feature importance across varying defect scales and morphological characteristics, leading to suboptimal performance in detecting small-scale or early-stage structural damage. Furthermore, stacking multiple attention modules may introduce high-frequency noise amplification and overfitting issues.

To further improve performance, recent studies have introduced attention-enhanced YOLO variants. Weng et al. improved crack feature extraction through attention-based optimization, while Wang et al. [7] incorporated multi-scale contextual modeling for UAV-based road damage detection. Similarly, Badar et al. [8] proposed lightweight transformer-based architectures for real-time edge deployment in intelligent transportation systems. Nevertheless, these approaches still focus on visual feature enhancement without embedding explicit morphology-driven structural prior knowledge.

More importantly, conventional vision-based detectors often underutilize the inherent morphological characteristics observed in civil engineering defect patterns. In practical infrastructure scenarios, defects such as cracks, spalling, and corner breakage usually exhibit certain visual regularities, including edge continuity, boundary sharpness, and local connectivity. However, existing models may still produce discontinuous predictions, such as fragmented crack detections or inconsistent defect boundaries, potentially affecting detection stability and engineering interpretability.

Motivation and Gap:

In civil engineering applications, structural damage is reflected not only by appearance variations but also by morphology-related spatial characteristics within image space. Although recent YOLO-based detectors have demonstrated strong visual recognition capability, limited attention has been paid to incorporating morphology-aware structural information into the feature learning process. Consequently, a gap still exists between purely data-driven visual perception and structure-aware defect representation.

Additionally, while many studies have investigated YOLO-based and attention-enhanced frameworks for defect detection in recent years, the current literature still lacks comprehensive integration of morphology-driven structural priors into the detection pipeline. Existing works mainly focus on improving detection accuracy through architectural refinement, multi-scale feature fusion, or generic attention mechanisms. For instance, Ruggieri et al. [6] proposed an attention-enhanced YOLO11 framework for reinforced concrete bridge inspection and demonstrated improved detection performance under real engineering conditions. However, their method primarily relies on visual attention modeling and does not explicitly incorporate morphology-aware structural constraints to enforce geometric consistency of defect shapes. This reflects a broader limitation of the existing approaches, where structural priors remain insufficiently explored in the context of deep learning-based defect detection.

Comparative Analysis and Novelty:

In contrast to existing YOLO-based and attention-based methods, the proposed GCA-YOLO explicitly introduces morphology-aware structural priors into the feature learning process through a Geometric Constraint Attention (GCA) mechanism. Unlike prior works that mainly enhance feature representation at the semantic or channel level, our approach focuses on embedding geometric continuity and structural consistency directly into the attention formulation. This enables the model to better preserve defect morphology characteristics such as edge continuity, boundary sharpness, and texture discontinuity, which are critical for reliable infrastructure inspection.

Furthermore, while existing methods, including attention-enhanced YOLO variants, improve detection performance through general feature refinement, they do not explicitly model the structural regularities inherent in civil engineering defects. The proposed framework bridges this gap by integrating a dual-attention mechanism (GCA + RECA), where GCA enforces morphology-aware spatial consistency and RECA enhances channel-wise discriminability for weak and low-contrast defects.

Contributions:

To address the above limitations, we propose a novel framework called GCA-YOLO, integrating a Geometric Constraint Attention (GCA) module and Residual Efficient Channel Attention (RECA) into YOLOv11. The main contributions are summarized as follows:

We propose a Geometric Constraint Attention (GCA) module that incorporates morphology-aware structural priors inspired by visual regularities in civil infrastructure defects into feature learning. Characterized by edge continuity, boundary sharpness, and texture discontinuity, these priors are implemented via differentiable image-domain operators to enhance structural consistency without explicit physical modeling.
We design a Residual Efficient Channel Attention (RECA) module to improve channel-wise feature representation while alleviating noise amplification and enhancing robustness for fine-grained defect detection tasks.
We develop a dual-attention YOLOv11 framework combining morphology-aware spatial modeling (GCA) and adaptive channel refinement (RECA), achieving improved detection accuracy and stronger structural consistency in complex real-world infrastructure inspection scenarios.

2. Related Works

2.1. Deep Learning for Structural Damage Detection

The YOLO series has been widely applied to structural damage detection. Recent studies have focused on adapting YOLOv11 and its variants for engineering inspection tasks. Huang et al. [9] proposed a real-time crack detection model based on YOLOv11, while Tian et al. [10] developed a multi-category defect detection framework for concrete surfaces. In addition, Shan et al. [5] introduced a lightweight YOLOv11-based segmentation model for industrial steel surface inspection.

Although these approaches improve detection performance, they remain primarily vision-driven and do not explicitly incorporate morphology-aware structural priors for defect appearance.

2.2. Attention Mechanism in Visual Tasks

The attention mechanism has been widely applied in defect detection to enhance feature representation. Weng et al. [11] introduced an attention-based feature extraction improvement method for crack detection, while Wang et al. [7] proposed a road damage detection method based on drones, incorporating a multi-scale context enhancement module. Badar et al. [8] further integrated a lightweight backbone network based on transformers into the Internet of Things system for real-time deployment. Ruggieri et al. [6] demonstrated that the attention-enhanced YOLO11 framework can achieve robust concrete bridge defect detection performance even in complex detection conditions involving occlusion and different defect appearances.

Despite these advancements, in many cases, the general attention mechanism often lacks clear modeling of consistent structural priors, making it susceptible to background textures such as concrete noise, stains, and illumination changes.

2.3. Morphology-Aware Structural Prior Learning

Recent studies have started to assess the integration of structural priors into deep-learning frameworks for defect d. These priors are usually derived from observed visual regularities in civil infrastructure damage, like cracks, surface flaking, corner crumbling, and surface c.

Ding et al. [12] explored learning-based structural assessment strategies using sensor-guided representations, while Khan et al. [13] summarized machine learning approaches for structural health monitoring and emphasized the importance of incorporating domain-specific prior knowledge.

However, most existing methods either rely on indirect feature enhancement or do not explicitly define how morphology-based structural priors (e.g., edge continuity, boundary sharpness, and texture discontinuity) are embedded into detection networks. This limits their ability to enforce structural consistency in vision-based defect localization tasks.

2.4. From Detection to Segmentation: The Role of Foundation Models

While the aforementioned studies focus primarily on object-level localization, there is a growing consensus that pixel-level accuracy is indispensable for detailed defect severity grading, such as measuring crack widths or the exact volumetric loss of spalled concrete. Recent paradigm shifts in computer vision, particularly the emergence of foundation models like the Segment Anything Model (SAM), have opened new avenues for civil engineering applications. Specifically, researchers have demonstrated that bounding-box-level outputs can serve as effective visual prompts for downstream segmentation tasks. For instance, recent work on accurate concrete spalling segmentation highlights that bounding box supervision can effectively guide SAM to achieve high-fidelity masks without the need for exhaustive pixel-level training labels. This positioning of detection frameworks—not as an end-state, but as a critical prompt-generator for segmentation—better contextualizes models like GCA-YOLO within the broader structural health monitoring (SHM) ecosystem.

3. Methodology

3.1. Overall Network Architecture

The proposed building defect detection model is based on the YOLOv11 framework [14]. The overall architecture follows the standard object detection paradigm of YOLOv11, with a three-stage structure comprising a Backbone, Neck, and Head. The proposed framework incorporates a parallel dual-attention enhancement mechanism composed of the Geometric Constraint Attention (GCA) module and the Residual Efficient Channel Attention (RECA) module.

The network input resolution is fixed at

640 \times 640

, and the model is trained using multi-scale hierarchical feature representations from the P3, P4, and P5 stages. The overall architecture is illustrated in Figure 1.

In terms of network architecture design, a parallel dual-attention enhancement strategy is introduced at the high-level semantic stage of the Backbone. Specifically, after extracting high-level shared features at the P5 stage, a Geometric Constraint Attention (GCA) branch and a Residual Efficient Channel Attention (RECA) branch are constructed in parallel to process the same semantic feature maps. Similar to recent attention-enhanced YOLO-based defect detection frameworks [15], the proposed design aims to improve feature representation capability under complex engineering environments while maintaining computational efficiency.
The GCA module explicitly embeds morphology-based structural priors of civil infrastructure defects, defined as visual regularities like edge continuity, boundary sharpness, and texture discontinuity. These priors guide the network to generate structurally consistent defect representations, ensuring that predicted regions align with typical morphological characteristics of cracks, spalling, and corner damage. The design motivation is inspired by recent studies emphasizing the importance of morphology-aware and engineering-informed representations in structural defect analysis [16].
In parallel, the RECA module focuses on adaptive channel-wise feature recalibration to enhance the discriminative representation of fine-grained defects such as thin cracks and minor surface spalling. The RECA branch employs adaptive one-dimensional convolution to model local inter-channel interactions, with the convolution kernel size dynamically determined by the channel dimension. This lightweight channel interaction strategy is conceptually related to recent efficient attention-enhanced YOLO architectures [17].
The outputs of the GCA and RECA branches are subsequently fused by feature concatenation, forming a unified high-level feature representation. This explicit parallel design enables the collaborative optimization of both morphology-aware structural consistency and channel-wise semantic feature enhancement.
Backbone:
The Backbone is based on the original YOLOv11 backbone and is responsible for extracting multi-scale feature representations from levels P1–P5 [18]. The backbone mainly consists of convolutional layers, C3k2 modules, and the C2PSA structure for hierarchical semantic extraction. Similar hierarchical feature extraction strategies have been widely adopted in recent lightweight and attention-enhanced YOLO frameworks for defect detection tasks [19].
After generating high-level shared semantic features from the P5 stage, a parallel dual-branch attention structure is introduced.
Specifically, the GCA branch enhances feature representations using morphology-aware structural priors derived from image-domain gradient and Laplacian operators. The structural branch employs depthwise $3 \times 3$ Sobel and Laplacian convolution kernels initialized as weakly learnable operators to enhance edge-aware structural features. Such gradient-aware representations are particularly beneficial for detecting elongated crack patterns and irregular boundary defects commonly observed in civil infrastructure inspection scenarios [20].
Meanwhile, the RECA branch performs adaptive channel-wise recalibration using global average pooling followed by lightweight one-dimensional convolution. The adaptive convolution kernel size is defined as:

$k = {|\frac{{log}_{2} (C) + b}{γ}|}_{odd},$

(1)

where C denotes the channel dimension, with hyperparameters $b = 1$ and $γ = 2$ . The nearest odd integer is selected to preserve symmetric convolution behavior. Similar adaptive channel interaction mechanisms have demonstrated effectiveness in recent attention-based object detection frameworks [21].
The outputs of the GCA and RECA branches are fused by feature concatenation, forming an enhanced representation that simultaneously preserves structural continuity and semantic discriminability.
Subsequently, the Spatial Pyramid Pooling Fast (SPPF) module aggregates multi-scale contextual information and enlarges the effective receptive field [22]. The SPPF module employs sequential max-pooling operations with a $5 \times 5$ pooling kernel to efficiently enhance global context representation while maintaining computational efficiency.
Neck:
The Neck adopts a bidirectional multi-scale feature fusion strategy based on the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) [23]. Through progressive upsampling operations, high-level semantic features are fused with intermediate- and low-level features (P4 and P3), while subsequent downsampling paths further strengthen cross-scale feature interactions. Similar multi-scale fusion strategies have been widely employed in recent YOLO-based infrastructure inspection systems to improve robustness for small-scale and low-contrast defects.
The upsampling operation uses nearest-neighbor interpolation with a scale factor of 2, while feature fusion is implemented through channel-wise concatenation operations. This bidirectional fusion mechanism enhances semantic consistency across different scales and improves robustness for structural defect detection under complex background conditions, including shadows, weathering traces, and texture interference commonly observed in post-disaster infrastructure environments [17].
Head:
The Head follows the standard YOLOv11 detection head design and receives fused feature maps from three scales, namely P3, P4, and P5 [24]. The detection head jointly performs bounding box regression and defect category prediction for multi-scale defect localization and classification.
The model adopts anchor-free detection and Distribution Focal Loss (DFL)-based bounding box regression, improving localization accuracy for irregular structural defects such as cracks, spalling, and corner breakage. Similar anchor-free regression strategies have shown strong performance for fine-grained structural defect localization tasks in recent YOLO-based studies [7,11].
During training, the detection framework employs an IoU threshold of $0.7$ for non-maximum suppression (NMS), with a maximum detection number of 300 objects per image.

3.2. Geometric Constraint Attention Formulation

The Geometric Constraint Attention (GCA) module is formally defined as follows:

GCA (X) = softmax (\frac{\nabla_{str} F_{str} (X)}{{∥ \nabla_{str} F_{str} (X) ∥}_{2} + ϵ}) ⊙ X

(2)

The schematic diagram of the GCA module is shown in Figure 2.

where

$X \in R^{H \times W \times C}$ denotes the intermediate input feature map with spatial dimensions $H \times W$ and C channels.
$\nabla_{str}$ represents the morphology-guided structural gradient operator implemented using weakly learnable Sobel-initialized convolution kernels with kernel size $3 \times 3$ .
$F_{str} (X)$ is the morphology-enhanced structural feature map constructed from first-order gradient responses and second-order Laplacian responses.
${∥ \cdot ∥}_{2}$ denotes the $L_{2}$ -norm along the spatial dimensions, serving as a magnitude normalizer to stabilize structural response amplitudes during optimization.
$ϵ$ is a strictly positive smoothing constant empirically set to $10^{- 7}$ for numerical stability and prevention of zero-division in homogeneous regions.
⊙ denotes the Hadamard product (element-wise multiplication) used for spatial feature recalibration.

Figure 2. GCA Module Schematic Diagram.

Structural Interpretation:

The operator

\nabla_{str}

is specifically engineered to extract structural discontinuities and boundary transitions commonly observed in building defects, such as cracks, spalling edges, and corner damage. Crucially, it represents the intensity of image-domain structural variations rather than explicit physical mechanics fields.

Softmax Normalization & Attention Formulation:

By dividing the gradient by its

L_{2}

-norm, we constrain the structural response into a unit-magnitude space. The subsequent softmax function then normalizes the structural response field into a probabilistic attention distribution strictly bounded within

[0, 1]

. This formulation ensures mathematical compatibility with standard attention gating mechanisms while heavily emphasizing regions with dominant structural transitions.

In practical implementation, the structural response is further refined through lightweight channel attention and spatial attention branches. The channel attention branch adopts adaptive one-dimensional convolution with kernel size

k = {|\frac{{log}_{2} (C) + b}{γ_{c}}|}_{odd},

(3)

where

b = 1

and

γ_{c} = 2

. The nearest odd integer is selected to preserve symmetric convolution behavior. The spatial attention branch employs a

7 \times 7

convolution kernel to enlarge the receptive field and improve long-range structural continuity perception.

Geometry-Enhanced Feature Map:

F_{str} (X)

encodes morphology-consistent structural responses derived from image-domain gradient operators, which significantly improves the model’s representational robustness under environmental noise and illumination variations.

Element-Wise Multiplication:

The operator ⊙ applies the computed structural attention weights directly to the original feature maps. This acts as a spatial filter, enhancing the activations of defect-relevant regions while systematically suppressing irrelevant background textures.

3.2.1. Geometry-to-Image Mapping

Coordinate Mapping:
We establish a rigorous correspondence between the discrete image coordinates $(x, y)$ and the continuous morphology-consistent defect regions. This spatial alignment guarantees that the learned feature responses are strictly congruent with geometric structural priors, specifically spatial continuity and boundary concentration.
Field Interpolation:

$F_{str} (x, y) = J (S (x, y))$

(4)

where $J$ represents a bilinear interpolation operator that maps discrete grid values to a continuous sub-pixel representation, and $S (x, y)$ denotes the raw structural response field derived from first-order image gradients and local texture discontinuities.
Structural Feature Construction:

$F_{morph} = Softplus (G_{x}^{2} + G_{y}^{2}) + Softplus (| Laplace (X) |)$

(5)

This formulation constructs structural features by fusing first-order (gradient magnitude) and second-order (Laplacian) differential information. The $Softplus (x) = ln (1 + e^{x})$ activation function is intentionally selected over standard ReLU; its smooth, differentiable nature preserves subtle structural gradient responses without introducing abrupt truncation. This design substantially enhances edge continuity and local structural saliency, which are key morphological priors for cracks, spalling regions, and surface degradation.
The horizontal and vertical structural gradients are generated by Sobel-initialized depthwise convolution kernels:

$G_{x} = K_{x} * X, G_{y} = K_{y} * X,$

(6)

where $K_{x}$ and $K_{y}$ denote $3 \times 3$ Sobel operators. The second-order structural response is extracted using a Laplacian kernel:

$L (X) = K_{l} * X .$

(7)

The Sobel and Laplacian kernels are initialized as:

$K_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}], K_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}] .$

(8)

$K_{l} = [\begin{matrix} 0 & 1 & 0 \\ 1 & - 4 & 1 \\ 0 & 1 & 0 \end{matrix}] .$

(9)

Unlike fixed edge detectors, these morphology kernels remain trainable during optimization, enabling adaptive structural representation learning for different defect categories.
Role of Gradient Operators:
Sobel and Laplacian operators are utilized as specialized image-domain structural feature extractors. They systematically highlight high-frequency spatial transitions strictly associated with defect boundaries rather than representing explicit physical mechanics fields.
As shown in Figure 3, the proposed morphology-oriented feature enhancement strategy is designed to improve the representation of crack and defect characteristics by integrating structural shape cues into the feature extraction process.

3.2.2. Theoretical Foundation of the GCA Module

(1) Morphology-Based Structural Prior Foundation:

The GCA module is grounded in morphological structural priors of civil infrastructure defects. These priors are statistical regularities derived from observed defect patterns such as cracks (continuous and elongated structures), spalling (irregular boundary regions), and corner damage (abrupt discontinuities). These are data-driven visual priors rather than geometry constraints.

(2) Structural Continuity Assumption:

Building defects exhibit local continuity in image space. The GCA module explicitly enhances edge connectivity, directional consistency, and boundary completeness, improving robustness under weak contrast and noisy environments.

(3) Redefined GCA Components:

Softplus activation for stable feature enhancement
Learnable morphology kernels initialized from Sobel/Laplacian operators
Structural confidence gating (phys_gate) for adaptive suppression of background interference

The confidence gate is implemented using global average pooling followed by two

1 \times 1

convolution layers and sigmoid normalization. The intermediate channel reduction ratio is set to

1 / 4

of the input channel dimension.

(4) Role of $λ_{str}$ :

Controls the strength of morphology-guided structural priors. A small value weakens structural enhancement, while a large value increases edge consistency constraints. In all experiments,

λ_{str} = 1

.

Final Output Formulation

The final recalibrated representation integrates multiple attention dimensions through a residual-like structure:

\begin{matrix} y & = x (1 + α \cdot channel_attn) (1 + β \cdot spatial_attn) \\ + λ_{str} γ str_gate str_features \end{matrix}

(10)

The addition of ‘1’ in the multiplicative terms guarantees that the original feature identity is preserved (identity mapping), while the gated parameters dynamically modulate the intensity of the incorporated structural priors.

The learnable scaling coefficients are initialized as

α = 0.1

,

β = 0.1

, and

γ = 0.2

, respectively, allowing adaptive modulation of channel attention, spatial attention, and morphology-guided structural enhancement during training.

Optimization Objective Overview

L_{total} = L_{detection} + λ_{str} \cdot L_{edge}

(11)

where

L_{edge}

functions as a regularization term that explicitly enforces structural continuity consistency for predicted defect boundaries. The edge regularization term is implemented through morphology-consistent structural enhancement rather than explicit physical partial differential constraints.

3.2.3. Visualization and Analysis of Attention Heatmaps

As shown in Figure 4, significant differences in attention distribution are observed between the baseline and improved models. To further investigate the internal decision-making mechanism of the proposed framework, Grad-CAM-based heatmaps visualize the spatial response patterns of intermediate feature representations under different model configurations. This enables a qualitative assessment of how the network allocates attention to potential defect regions during inference, offering an interpretable perspective for model comparison. Similar visualization-based interpretability analyses have recently been adopted in attention-enhanced infrastructure inspection frameworks to evaluate structural feature perception and defect localization behavior [25].

The baseline YOLO11 model generates scattered and less structured activation regions. These activations are often weakly correlated with actual defect locations and are influenced by complex background textures such as concrete surface roughness, illumination variations, shadows, and other irrelevant structural patterns common in civil engineering environments. This dispersed attention distribution indicates that the baseline model relies heavily on global appearance cues rather than localized structural semantics, resulting in limited discriminative capability when dealing with subtle or low-contrast damage patterns. Similar instability under complex environmental interference has also been reported in recent defect detection studies [26].

In contrast, the proposed GCA-enhanced model demonstrates substantially improved attention localization. The generated heatmaps exhibit significantly more concentrated, continuous, and structurally consistent activation patterns. More importantly, the attention is not only confined to the crack or damaged region but also forms a distinct boundary-aware distribution. Specifically, high-response activations are densely concentrated along the edges surrounding the defect regions, forming a ring-like or contour-following structure. This indicates that the model effectively captures geometric discontinuities and local boundary transitions, which are critical cues for accurately characterizing structural damage in civil infrastructure. Similar morphology-aware structural characteristics have been emphasized as important defect descriptors in recent structural health monitoring and crack analysis studies [13,19].

From a mechanistic perspective, this improvement is attributed to the introduced Geometric Constraint Attention (GCA) module, which embeds morphology-aware structural priors into the feature learning process. By explicitly enhancing the modeling of spatial relationships and geometric consistency, the network is guided to shift its focus from global texture interference to local structural variations that are more relevant to defect characterization. As a result, the learned representations become more sensitive to boundary-level features, enabling more precise localization of damage regions while suppressing irrelevant background activations. This design philosophy is also consistent with recent efforts on engineering-informed feature enhancement and geometry-aware defect perception [27].

Furthermore, this boundary-enhanced attention suggests that the proposed method does not merely improve classification confidence, but also significantly enhances spatial awareness of structural anomalies. This property is particularly important in real-world civil engineering inspection scenarios, where defects often appear with irregular shapes, weak contrast, and are easily confused with background noise. The model’s ability to consistently highlight defect boundaries demonstrates improved robustness and generalization capability under complex environmental conditions, including post-disaster and outdoor infrastructure inspection scenarios [17].

Overall, these visual results strongly demonstrate that the proposed GCA module indeed effectively enhances the distinguishability and spatial correlation of features. The transition from the diffusion activation state in the baseline model to the boundary-centered and structurally-aligned attention pattern in the improved model clearly validates the effectiveness of integrating morphological perception geometric priors into deep learning-based defect detection. This not only improves the detection accuracy but also provides better interpretability, which is crucial for practical applications in safety-critical infrastructure monitoring.

3.3. Residual Efficient Channel Attention (RECA)

The conventional Efficient Channel Attention (ECA) mechanism [28] is widely recognized for its efficiency, but it can suffer from limitations such as insufficient sensitivity to fine-grained features and excessive suppression of informative channels in deep network architectures. Specifically, in deep object detection frameworks, pure channel attention can inadvertently lead to unstable feature propagation or over-suppression of subtle defect features during long-term training [29]. To address these optimization bottlenecks without significant computational overhead, this paper proposes a Residual Efficient Channel Attention (RECA) module [30]. The overall architecture of the proposed RECA module is illustrated in Figure 5.

The proposed RECA module can be formulated as:

RECA (X) = X ⊙ (1 + α \cdot A_{c} (X)),

(12)

where

X \in R^{H \times W \times C}

denotes the input feature map,

A_{c} (X)

represents the channel attention response,

α

denotes the learnable residual scaling parameter, and ⊙ represents element-wise multiplication.

The detailed processing steps are described as follows.

Channel Global Average Pooling:
Global average pooling is applied to compress the input feature maps along the spatial dimension:

$z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j),$

(13)

where $z_{c}$ denotes the channel descriptor of the c-th channel. This operation captures global semantic information while preserving inter-channel correlations relevant to structural defect characteristics [31].
Channel Local Correlation Modeling:
A lightweight one-dimensional convolution is employed to model local dependencies among adjacent channels. To dynamically capture interactions across varying network depths, the convolution kernel size k is adaptively determined according to:

$k = int (|\frac{{log}_{2} (c) + b}{γ}|),$

(14)

where c denotes the number of input channels, with hyperparameters $b = 1$ and $γ = 2$ . To maintain symmetric convolution behavior, the nearest odd integer is selected:

$k = \{\begin{matrix} t, & if t is odd, \\ t + 1, & otherwise, \end{matrix}$

(15)

where

$t = |\frac{{log}_{2} (c) + b}{γ}| .$

(16)

The logarithmic mapping ${log}_{2} (c)$ ensures that the kernel size scales sub-linearly with the channel dimension, thereby maintaining lightweight computational complexity while effectively modeling local inter-channel interactions.
In implementation, the one-dimensional convolution uses padding size $k / / 2$ without bias parameters, ensuring channel dimension preservation and computational efficiency.
Attention Weight Generation:
The channel attention response is generated using sigmoid normalization:

$A_{c} (X) = σ (Conv 1 D (z)),$

(17)

where $σ (\cdot)$ denotes the Sigmoid activation function. The resulting attention weights are normalized into the range $[0, 1]$ , enabling adaptive channel-wise feature recalibration [32].
Residual Dynamic Weighting (Core Optimization):
In deep networks, pure channel attention mechanisms may lead to optimization difficulties and excessive suppression of informative features. To mitigate this issue, a learnable residual scaling parameter $α$ is introduced [26].
Unlike the static residual formulation in conventional ECA [24], the proposed RECA module introduces a learnable dynamic residual weighting mechanism formulated as

$Y = X ⊙ (1 + α \cdot A_{c} (X)) .$

(18)

This residual fusion strategy dynamically balances the original feature flow and the attention-enhanced feature flow, thereby strengthening the responses of critical channels (e.g., crack boundaries and defect contours) while preventing the loss of non-critical but potentially useful contextual information.
The learnable residual scaling parameter is initialized as

$α = 0.1 .$

(19)

A relatively small initialization value is intentionally adopted to stabilize early-stage optimization and prevent excessive amplification of attention responses during initial training iterations.

As demonstrated in Table 1, RECA introduces only one additional learnable scalar parameter corresponding to the residual scaling coefficient

α

. Therefore, the computational overhead introduced by the proposed module remains extremely limited while substantially improving feature propagation stability and adaptive channel recalibration capability in deep defect detection networks.

In practical implementation, the RECA module employs adaptive average pooling, one-dimensional convolution, and sigmoid normalization without introducing fully connected layers, thereby preserving the lightweight characteristics of the original ECA design while improving optimization robustness for complex structural defect scenarios.

3.4. Dual Attention Integration Strategy

To exploit the complementary strengths of the Geometric Constraint Attention (GCA) and Residual Efficient Channel Attention (RECA) modules, this paper adopts a collaborative dual-attention enhancement strategy based on a parallel architecture [3]. Instead of serially stacking attention modules, GCA and RECA are applied in parallel to the same high-level semantic feature representations extracted by the Backbone at the P5 stage.

Specifically, the RECA branch focuses on adaptive channel-wise feature recalibration to enhance the discriminability of fine-grained structural defects, while the GCA branch embeds morphology-aware structural priors derived from image-domain structural regularities, such as edge continuity, boundary sharpness, and texture discontinuity, to enhance structural consistency in defect representation [18].

In the RECA branch, channel attention weights are generated using adaptive global average pooling and one-dimensional convolution. The adaptive convolution kernel size is dynamically determined according to the input channel dimension:

k = {|\frac{{log}_{2} (C) + b}{γ}|}_{odd},

(20)

where

b = 1

and

γ = 2

. The residual scaling parameter is initialized as

α = 0.1

to stabilize feature propagation during early-stage optimization.

In the GCA branch, structural enhancement is generated through weakly learnable Sobel and Laplacian operators implemented using depthwise

3 \times 3

convolutions. The morphology-guided structural enhancement strength is controlled by the structural weighting parameter

λ_{str} = 1

, while the channel attention scaling coefficients are initialized as

α = 0.1

,

β = 0.1

, and

γ = 0.2

, respectively.

The outputs of the two attention branches are then explicitly fused through feature concatenation, forming an enhanced high-level representation that jointly preserves semantic saliency and morphology-consistent structural information.

After parallel refinement and feature fusion, the enhanced P5 features are further processed by the SPPF module to aggregate multi-scale contextual information and expand the effective receptive field before being forwarded to the Neck for subsequent multi-scale feature fusion.

Through the upsampling and concatenation operations in the Neck, the fused high-level representations—already enriched with channel-aware semantics and morphology-consistent structural priors—are propagated to the P4 and P3 levels, enabling effective cross-scale transmission of optimized feature information [27].

Compared with approaches that repeatedly stack attention mechanisms across multiple feature scales [33], the proposed parallel dual-attention design introduces attention enhancement only at the high-level semantic stage. This design avoids excessive interference with low- and mid-level features (e.g., P3 and P4), reduces the risk of high-frequency noise amplification, and achieves a favorable trade-off between computational efficiency and detection performance.

Moreover, the explicit parallel fusion mechanism ensures stable and consistent synergy between channel-wise feature enhancement and morphology-based structural constraint modeling [32]. Since both branches operate on the same high-level semantic representation, the proposed architecture maintains computational efficiency while effectively improving structural defect representation capability under challenging environmental conditions.

3.5. Loss Function

The total loss function integrates conventional detection objectives with geometry-aware structural regularization, enabling the network to produce predictions that are not only accurate in localization and classification, but also consistent with prior knowledge of crack morphology and structural constraints [34]. Unlike purely data-driven optimization, the proposed framework introduces an additional geometry-informed constraint directly into the bounding-box regression branch. In addition, a lightweight crack-specific width regularization term is embedded into the IoU optimization path, consistent with the implementation in the training loss code.

3.5.1. Detection Loss

The baseline detection loss follows the standard YOLO optimization formulation, consisting of classification loss, localization loss (IoU-based regression loss), and distribution focal loss (DFL). This multi-task objective ensures accurate defect category prediction and precise bounding-box regression.

L_{detection} = λ_{b o x} L_{box} + λ_{c l s} L_{cls} + λ_{d f l} L_{dfl}

(21)

where

$L_{box} = (1 - I o U)$ is the IoU-based localization loss (CIoU in implementation).
$L_{cls}$ is the binary cross-entropy classification loss.
$L_{dfl}$ is the Distribution Focal Loss (DFL) for bounding box regression refinement.
$λ_{b o x}, λ_{c l s}, λ_{d f l}$ are task weighting coefficients inherited from YOLOv11 training hyperparameters.

3.5.2. Geometry-Constrained Structural Loss

To enhance sensitivity to crack morphology characteristics, we introduce a geometry-constrained structural regularization term. This constraint is directly integrated into the IoU regression branch in the loss implementation.

In this undertaking, the incorporated prior knowledge does not allude to explicit physical laws or manually contrived engineering. Instead, it implies morphology-aware structural priors drawn from visual regularities frequently observed in civil infrastructure These priors embrace characteristics like edge continuity, drawn-out geometric constructs, directional steadiness, obvious boundary sharpness, and local texture having discontinuity, which frequently present in crack propagation and other structural damage.

Among these morphology-aware priors, cracks typically exhibit thin and elongated spatial distributions with strong directional continuity. Based on this observation, we introduce a width-aware geometric constraint applied to both predicted and ground-truth bounding boxes to improve sensitivity to slender defect structures:

L_{geo} = R_{w i d t h} (B_{p r e d}, B_{g t})

(22)

where

R_{w i d t h} = \frac{1}{N} \sum_{i = 1}^{N} [max (0, τ - w_{i}^{p r e d}) + max (0, τ - w_{i}^{g t})]

(23)

Here:

$w_{i}^{p r e d}$ and $w_{i}^{g t}$ denote the widths of predicted and ground-truth bounding boxes.
$τ$ is the crack-width threshold controlling structural sparsity (default set to 0.25 in implementation).
N is the number of positive samples in the current batch.

This formulation introduces a geometry-aware inductive bias that suppresses physically implausible thick bounding boxes and enhances sensitivity to fine and narrow crack structures. Importantly, this term is implemented inside the IoU loss computation path, ensuring full end-to-end differentiability.

3.5.3. Integration with RECA and GCA Features

The proposed loss function is calculated using features enhanced by a dual attention mechanism, specifically Geometric Constraint Attention (GCA) and Residual Efficient Channel Attention (RECA). RECA incorporates a learnable residual scaling parameter (

α

) that adaptively balances the original feature flow with the attention-enhanced flow. This contributes to improved gradient stability during IoU optimization, which in turn promotes the convergence of the regression loss.

3.5.4. Total Loss Function

The final optimization objective is defined as:

L_{total} = L_{detection} + λ_{s t r} L_{geo}

(24)

where

λ_{s t r}

is a balancing coefficient controlling the strength of geometric constraints during training. In practice, a warm-up strategy is applied, where

λ_{s t r}

is initialized as 0.1 and gradually increased to 1.0 to avoid over-regularization in early training stages.

3.5.5. Consistency with Implementation

To ensure consistency with the implementation, the geometry-constrained loss is added directly to the IoU loss term inside the bounding-box regression branch, without modifying the standard YOLO loss structure. This guarantees compatibility with pretrained weights while introducing minimal computational overhead.

3.5.6. Clarification on Physical Interpretation

No physics-based simulation, partial differential Equation (PDE), or mechanical modeling is used in this work. The term “geometry-constrained” strictly refers to shape priors derived from crack morphology statistics (thin, elongated patterns), rather than explicit physical law modeling.

3.5.7. Structural Prior Embedding Strategy

The structural prior is embedded through (1) attention modules (GCA and RECA) in the feature space and (2) width-aware constraints in the loss space. This dual-level design ensures that both representation learning and optimization objectives are aligned with domain-specific crack characteristics, while maintaining computational efficiency.

3.6. Use of Generative AI Tools

Generative artificial intelligence (GenAI) tools were used in a limited manner during manuscript preparation. Specifically, Google Gemini was utilized for partial English language refinement, readability improvement, and preliminary conceptual illustration assistance corresponding to Figure 2 and Figure 3.

The generated outputs were carefully reviewed, modified, and finalized by the authors before inclusion in the manuscript. The GenAI tool was not used for experimental design, dataset construction, model development, data analysis, result interpretation, or scientific conclusion generation.

All technical content, experimental results, and scientific interpretations presented in this study were independently completed and verified by the authors.

4. Experiments

4.1. Implementation Details

4.1.1. Hardware Configuration

Training Platform: NVIDIA GeForce RTX 4090 (24 GB GDDR6, CUDA 12.2)
CPU: 12th Generation Intel(R) Core(TM) i5-12600K (16 cores, maximum clock frequency 4.90 GHz)

4.1.2. Software Environment

Operating System: Ubuntu 22.04.3 LTS
Deep Learning Framework: PyTorch 2.0.0 + CUDA 11.8
Inference Acceleration: TensorRT 8.6.1
Libraries: OpenCV 4.12.0, Albumentations 1.3.1

4.1.3. Training Hyperparameters

The proposed model was implemented based on the Ultralytics YOLO framework and trained using the hyperparameter settings summarized in Table 2.

During training, pretrained weights were adopted to improve convergence stability. Automatic Mixed Precision (AMP) training was enabled to reduce GPU memory consumption and accelerate computation. Mosaic augmentation was disabled during the final 10 epochs to stabilize convergence.

For the public crack dataset experiments in Section 4.4.1, the model was trained for 1200 epochs to ensure sufficient convergence under complex crack morphology and background variations.

For the self-constructed spalling and corner-damage datasets in Section 4.4.2 and Section 4.4.3, the training process was conducted for 500 epochs due to the relatively smaller dataset scale and faster convergence behavior.

4.2. Dataset and Evaluation Metrics

To verify the effectiveness, robustness, and engineering applicability of the proposed method, experiments were conducted on one public crack dataset and two self-built structural defect datasets, specifically, a self-built peeling/detaching dataset and a self-built crack dataset. The datasets were specifically designed to reflect realistic outdoor engineering inspection scenarios rather than laboratory-controlled environments. Therefore, the collected images exhibit substantial environmental interference, including uneven illumination, shadow occlusion, surface contamination, weathering traces, motion blur, vegetation interference, and complex texture backgrounds.

The datasets were collected from various real engineering structures including bridges, tunnels, pavements, retaining walls, roadside infrastructures, concrete exterior walls, and building facades. Data acquisition was performed using high-resolution digital cameras during field investigations under different environmental and weather conditions, including sunny, cloudy, rainy, dusk, and low-illumination conditions. The shooting distance varied from approximately 0.5 m to 5 m, and both frontal and oblique viewpoints were included to improve viewpoint diversity and practical applicability.

Furthermore, the defect shapes exhibit significant differences in terms of size, continuity, direction, boundary clarity, and texture features. The dataset contains large irregular defects and slender cracks, enabling a comprehensive assessment of different defect sizes and structural complexity [27].

Regarding dataset annotation, all self-built datasets were manually labeled using the LabelImg tool. A polygon-guided annotation strategy was adopted to accurately delineate irregular and complex defect boundaries, particularly for crack-like defects and surface degradation in real engineering environments. Crack defects were annotated to cover continuous crack structures as completely as possible, while surface defects such as peeling and spalling were annotated to fully enclose the damaged regions.

For the public crack dataset, we directly adopted the annotations provided by the original dataset without modification. For the self-built crack and peeling datasets, all images were annotated by the authors of this study. To ensure annotation quality and consistency, all labeling results were further cross-checked and verified by the corresponding author, as an additional layer of quality control [31].

In images containing multiple defect instances, each instance was annotated independently to preserve instance-level discrimination and avoid ambiguity. Moreover, different defect categories were treated as separate classes during annotation to ensure clear semantic separation between structural damage types [26].

Overall, the constructed datasets include multiple structural defect types such as concrete spalling, corner damage, surface peeling, irregular boundary defects, and cracks, enabling a comprehensive evaluation under diverse real-world conditions. In addition, all datasets were trained independently rather than being merged into a unified multi-class dataset, to fairly evaluate the robustness and generalization ability of the proposed method across different structural defect scenarios [35].

Generally, the dataset was divided into training, validation, and test sets in a 6:3:2 ratio. Data augmentation strategies including horizontal flipping, random rotation, brightness adjustment, contrast enhancement, Gaussian blur, and random scaling were adopted to improve the model’s generalization ability and robustness in complex environmental conditions.

The data and code underlying this study are available in the Supplementary Materials, as described in the Data Availability Statement.

Introduction to the Experimental Dataset

Public Crack Dataset:
This publicly available crack dataset was selected from the Crack Detection dataset provided through Roboflow Universe (https://universe.roboflow.com/ho-chi-minh-university-of-technology-y9zrl/crack-detection-3 (accessed on 10 May 2026)) and is widely used in structural crack detection research. The original dataset contains approximately 10,000 images covering diverse crack types collected from pavements, bridge decks, concrete walls, and building facades under varying illumination and background conditions.
To ensure data quality and representativeness, we first performed a strict filtering process including duplicate removal, low-quality image exclusion, and resolution consistency checking. After this refinement, a final subset of 433 images was selected for training and evaluation [30]. This subset preserves the original diversity of crack morphology while improving annotation reliability and computational efficiency.
Self-built Peeling/Detaching Dataset:
This dataset was collected through field surveys across multiple urban infrastructure scenarios, including bridges, tunnels, retaining walls, and building facades. The original dataset contains 1247 images of concrete peeling, spalling, and surface detachment under complex environmental conditions.
Unlike the public dataset, this dataset exhibits significant redundancy and high similarity among images due to continuous video-frame-like acquisition and repeated viewpoints during field inspection. Therefore, a multi-stage selection strategy was applied, including deduplication, viewpoint diversity screening, and quality assessment based on defect visibility and annotation completeness [24]. After this process, 887 representative images were retained.
This selection ensures a balance between dataset richness and computational feasibility while maintaining sufficient coverage of defect morphology variations such as irregular geometry, boundary ambiguity, and texture interference.
Self-built Crack Dataset:
This dataset contains 1018 images collected from real structural inspection scenarios including bridge decks, tunnel linings, pavement surfaces, and concrete walls. The dataset focuses on fine-grained cracks with weak visibility, discontinuous edges, and strong environmental interference such as shadows, stains, and weathering effects.
Similar to the peeling dataset, the raw data contains significant redundancy due to repeated capture of similar crack patterns under different angles and lighting conditions [36]. To ensure high-quality training samples and reduce imbalance in data distribution, we conducted strict filtering based on crack visibility, annotation quality, and morphological diversity. As a result, a final subset of 248 images was selected for experiments.
This reduced but high-quality subset ensures that the model is trained on representative and challenging crack patterns while avoiding overfitting caused by repetitive samples and visually similar instances [37].

To comprehensively evaluate the robustness and generalization capability of the proposed method, experiments were conducted on three different datasets with distinct morphological characteristics and environmental conditions. Detailed comparisons of these datasets are presented in Table 3.

4.3. Comparison with State-of-the-Art Methods

All experiments were conducted under identical software and hardware environments, with differences only in network structure and attention mechanisms [38]. The performance of each model was evaluated using mAP@0.5, mAP@0.5:0.95, Precision, Recall, and Frames Per Second (FPS). FPS was measured under the same input resolution and batch size to ensure fair comparison of real-time inference efficiency.

Based on the YOLOv11 baseline, this study introduces the Geometric Constraint Attention (GCA) mechanism and the Residual Efficient Channel Attention (RECA) module, as well as their parallel dual-attention fusion strategy [26,27]. The GCA module embeds morphology-aware structural priors derived from image-domain defect regularities, while RECA focuses on adaptive channel-wise feature recalibration to enhance fine-grained defect representation.

4.4. Ablation Studies

4.4.1. Results on the Public Crack Dataset

From Table 4, the following observations can be drawn. The Figure 6 present a visual comparison of different models.The verification results on the publicly available crack dataset are illustrated in Figure 7 [39]:

mAP@0.5: The dual-attention model achieves a mAP@0.5 of 0.521, outperforming the YOLOv11 baseline by 1.1 percentage points, the single RECA by 0.9 percentage points, and the single GCA by 2.8 percentage points. This shows that combining RECA and GCA provides a significant improvement in the model’s ability to detect cracks across a variety of conditions.
mAP@0.5:0.95: The proposed dual-attention fusion model achieves the best performance at multiple IoU thresholds, with a mAP@0.5:0.95 of 0.288. This result demonstrates that the dual-attention approach enhances detection stability across different overlap levels, outperforming both single-module configurations.
Precision: The dual-attention model improves Precision by 5.6 percentage points compared to the baseline, highlighting that the combination of RECA and GCA effectively reduces false positives, ensuring more accurate crack localization. This is especially important when dealing with complex real-world cracks that may be partially occluded or low in contrast.
Recall: The Recall improvement in the dual-attention model, although not as significant as Precision, indicates that the model maintains a strong ability to detect a wide range of crack features without sacrificing coverage. The single RECA module showed a slight edge in Recall, emphasizing its ability to preserve defect continuity, but the dual-attention model achieves a balanced improvement.
FPS: The addition of RECA and GCA modules results in a modest increase in FPS to 294.1, indicating that the model remains computationally efficient. Despite the added complexity of dual-attention, the model manages to maintain real-time inference speed, showing its practicality for deployment in real-world applications.
GFLOPS and Parameters: The dual-attention model shows a moderate increase in GFLOPS due to the additional operations required for geometry reasoning. However, the number of parameters remains relatively low (2.3 M), confirming that the model’s performance improvements are achieved without a substantial increase in model complexity.

Table 4. Ablation results on the public crack dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
YOLOv11	0.510	0.264	0.648	0.476	243.9	6.3	2.6 M
YOLOv11 + RECA	0.512 (+0.4%)	0.266 (+0.8%)	0.599 (−7.6%)	0.563 (+18.3%)	263.2 (+7.9%)	6.1 (−3.2%)	2.3 M (−11.5%)
YOLOv11 + GCA	0.493 (−3.3%)	0.216 (−18.2%)	0.608 (−6.2%)	0.488 (+2.5%)	250.0 (+2.5%)	6.2 (−1.6%)	2.3 M (−11.5%)
YOLOv11 + RECA + GCA	0.521 (+2.2%)	0.288 (+9.1%)	0.704 (+8.6%)	0.498 (+4.6%)	294.1 (+20.6%)	6.7 (+6.3%)	2.3 M (−11.5%)

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline YOLOv11. Bold values indicate the best performance among all compared methods.

Figure 6. Ablation study results of different model variants.

Figure 7. Validation results on the public crack dataset under the single-category crack detection setting. Only crack defects were considered as annotation and detection targets in this experiment.

4.4.2. Results on the Self-Built Peeling/Detaching Dataset

From Table 5, the following observations can be drawn. The Figure 8 present a visual comparison of different models.The verification results on the self-built peeling/detaching dataset are illustrated in Figure 9 [39]:

mAP@0.5: The proposed dual-attention model achieves a mAP@0.5 of 0.785, outperforming the YOLOv11 baseline by 1.9 percentage points, the single GCA by 1.1 percentage points, and the single RECA by 0.2 percentage points. This demonstrates the synergistic benefit of combining feature enhancement and geometry-constrained.
mAP@0.5:0.95: Under multiple IoU thresholds, only the dual-attention fusion model outperforms the baseline. This indicates that the adaptability of a single RECA or GCA module across varying overlap levels is limited, while their collaboration provides more stable detection performance.
Precision: The dual-attention model improves Precision by 2.1 percentage points compared to the baseline, validating that the geometry constraints introduced by GCA effectively suppress geometrical inconsistent false detections, while RECA enhances channel-level feature discrimination.
Recall: The single RECA module shows a slight advantage in Recall, suggesting that channel attention tends to preserve the integrity of defect features. The dual-attention model maintains a Recall comparable to GCA, indicating that accuracy improvements are achieved without significant loss of detection coverage.
FPS: RECA introduces negligible computational overhead and even improves inference speed through effective feature selection [40]. Although GCA slightly reduces FPS due to additional constraint computation, the dual-attention model still achieves higher FPS than the baseline, demonstrating a favorable balance between accuracy and efficiency.
GFLOPS and Parameters: The dual-attention model exhibits a moderate increase in GFLOPS compared to the baseline and single-module variants, reflecting the additional geometrical reasoning operations. Nevertheless, the total number of parameters remains within a lightweight range (2.38 M), confirming that the proposed dual-attention design achieves performance improvements without substantially increasing model complexity.

Table 5. Ablation results on the self-built peeling/detaching dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
YOLOv11	0.766	0.486	0.807	0.688	227.3	6.3	2.6 M
YOLOv11 + RECA	0.783 (+2.2%)	0.512 (+5.3%)	0.796 (−1.4%)	0.720 (+4.7%)	322.6 (+41.9%)	6.1 (−3.2%)	2.3 M (−9.6%)
YOLOv11 + GCA	0.774 (+1.0%)	0.495 (+1.9%)	0.836 (+3.6%)	0.684 (−0.6%)	285.7 (+25.7%)	6.2 (−1.6%)	2.3 M (−9.1%)
YOLOv11 + RECA + GCA	0.785 (+2.5%)	0.508 (+4.5%)	0.828 (+2.6%)	0.714 (+3.8%)	294.1 (+29.4%)	6.7 (+6.3%)	2.4 M (−7.8%)

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline YOLOv11. Bold values indicate the best performance among all compared methods.

Figure 8. Ablation study results of different model variants.

Figure 9. Validation results on the self-built peeling/detaching dataset under the single-category defect detection setting. The experiment focused exclusively on peeling/detaching defects, while minor secondary textures or crack-like patterns were not treated as annotation targets.

4.4.3. Results on the Self-Built Crack Dataset

From Table 6, the following observations can be drawn. The Figure 10 present a visual comparison of different models. The verification results on the self-built crack dataset are illustrated in Figure 11.

mAP@0.5: The proposed dual-attention model significantly improves mAP@0.5 from 0.515 to 0.597 compared with the baseline YOLOv11 model, achieving the best performance among all variants. This demonstrates that the collaborative interaction between RECA and GCA effectively enhances crack feature representation and improves robustness against complex background interference.
mAP@0.5:0.95: Under multiple IoU thresholds, the dual-attention model achieves the highest mAP@0.5:0.95 value of 0.248, outperforming all single-module variants. This indicates that the proposed framework maintains stable localization capability across varying overlap conditions and improves adaptability to cracks with different scales and morphology characteristics.
Precision: The dual-attention model achieves the highest Precision among all compared methods. This demonstrates that the morphology-aware structural priors introduced by GCA effectively suppress structurally inconsistent false detections, while RECA further improves discriminative channel-level feature representation.
Recall: The Recall of the dual-attention model increases substantially from 0.429 to 0.571, representing the most significant improvement among all metrics. This suggests that the collaborative optimization of RECA and GCA enables more complete crack extraction while maintaining accurate localization capability.
FPS: The proposed dual-attention framework maintains the same inference speed as the baseline YOLOv11 model, demonstrating that the introduced GCA and RECA modules do not introduce significant runtime overhead and remain suitable for real-time crack inspection scenarios.
GFLOPS and Parameters: Although the dual-attention model exhibits a moderate increase in GFLOPS due to the additional morphology-aware enhancement operations, the total parameter count remains lightweight (approximately 2.4 M). This confirms that the proposed framework achieves substantial performance gains without relying on excessive model expansion.

4.4.4. Effect of the Structural Prior Weight $λ$

To investigate the influence of the structural prior weight

λ

on detection performance, a series of ablation experiments are conducted by varying

λ \in

{0, 0.1, 0.5, 1.0, 2.0}. For each setting, the model is evaluated on the validation set in terms of

mAP @ 0.5

and the corresponding structural consistency regularization loss

L_{struct}

. The quantitative results are summarized in Table 7, and the overall trends are illustrated in Figure 12.

The overall trends are illustrated in Figure 12.

When

λ = 0

, the model degenerates into a purely data-driven detector without any structural prior constraint. Although reasonable detection accuracy is achieved, the regularization term is completely inactive, indicating that the predicted feature responses may not fully satisfy underlying structural consistency. This suggests that removing the structural prior limits the model’s ability to produce geometrically and topologically coherent predictions [41].

As

λ

increases from 0 to

0.1

, a notable improvement in

mAP @ 0.5

can be observed, accompanied by a clear reduction in

L_{struct}

. This indicates that introducing a mild structural constraint effectively guides the network toward more structure-consistent feature learning while simultaneously enhancing detection accuracy. In this setting, the structural prior acts as a regularization mechanism that suppresses spatially inconsistent feature activations and improves generalization [26,27].

However, further increasing

λ

to larger values (e.g.,

λ = 2.0

) leads to a slight degradation in detection performance, despite stronger enforcement of structural consistency. This suggests that an overly strong structural constraint may dominate the optimization process, causing the model to overemphasize structural regularity while reducing sensitivity to discriminative visual patterns. As a result, the detector becomes overly constrained and less adaptive to complex visual variations [42].

Overall, these results demonstrate that the structural prior weight

λ

plays a key balancing role between detection accuracy and structural consistency. Neither an overly small nor an excessively large

λ

yields optimal performance. Instead, a moderate value (e.g.,

λ = 1.0

in this work) provides the best trade-off, achieving high detection accuracy while maintaining structurally coherent predictions. This confirms that the proposed structural prior framework benefits from a carefully tuned regularization strength rather than an extreme emphasis on either data-driven learning or prior-driven guidance [43].

4.5. Real-Time Capability and Computational Efficiency Analysis

To further validate the practical utility of the proposed dual-attention framework in infrastructure health monitoring, we conducted a rigorous analysis of its computational efficiency. This analysis focuses on whether the model can maintain high precision while operating within the strict latency requirements of real-time monitoring systems, such as those used in UAV-based inspections or automated tunnel scanning.

4.5.1. Quantification of Model Complexity

The computational footprint of a deep learning model is primarily determined by its parameter count and GFLOPs. As summarized in Table 8, our proposed method (YOLOv11 + RECA + GCA) maintains a highly compact architecture. Despite the integration of complex morphology-aware structural priors, the total number of parameters is approximately 2.4 M, which represents a 7.8% to 11.5% reduction compared to the baseline YOLOv11 model (2.6 M). This efficiency is achieved through the strategic use of channel-level feature discrimination in the RECA module, which helps the model focus on essential features and prune redundant information.

4.5.2. Inference Speed and Real-Time Throughput

Inference speed is the most critical metric for on-site monitoring. Our experimental results on the RTX 4090 platform show that the proposed framework achieves a peak frame rate of 294.1 FPS. In practical terms, this means the model can process an entire second of high-definition video in less than 4 milliseconds per frame.

On the self-built crack dataset, which is characterized by significantly higher environmental noise (such as shadows and water stains), the model still sustains 161.3 FPS. This consistency across different datasets proves that the proposed GCA module successfully suppresses background interference without introducing significant computational lag. The dual-attention mechanism enables the model to bypass irrelevant image regions and concentrate on localized defect features, thereby maintaining high throughput even in visually complex scenes.

4.5.3. Feasibility for On-Site Engineering Deployment

The combination of a small parameter count and high FPS makes this framework exceptionally suitable for deployment on edge computing platforms. In real-world bridge and tunnel inspections, computational resources are often restricted to embedded systems like the NVIDIA Jetson Orin or mobile edge servers. A model exceeding 10 M parameters or requiring high GFLOPs would struggle to maintain real-time performance on such hardware.

Our model’s GFLOPs of 6.7 and parameter count of 2.4 M fall well within the operational limits of current-generation edge devices. By maintaining the same or even higher FPS than the baseline model while significantly improving detection recall (up to 57.1% on complex cracks), our method provides a reliable, real-time solution for intelligent infrastructure diagnosis that can be deployed directly on mobile sensing platforms.

4.6. Overall Analysis

The Geometric Constraint Attention (GCA) module integrates morphology-aware structural priors derived from defect regularities observed in the image domain into the feature learning process in a differentiable manner. It significantly improves Precision and reduces structurally inconsistent false detections, making it suitable for scenarios requiring high reliability in civil infrastructure inspection [44].
The Residual Efficient Channel Attention (RECA) module enhances channel-level feature responses through residual dynamic weighting, improving Recall and inference speed without noticeably increasing computational complexity, making it advantageous for real-time and resource-constrained applications [45].
The combined RECA + GCA approach achieves the best trade-off among accuracy, robustness, and efficiency. The two modules exhibit strong complementarity: GCA enforces morphology-consistent structural representation, while RECA enhances semantic discriminability at the channel level. This design demonstrates strong practical feasibility for engineering applications such as UAV inspection and intelligent infrastructure monitoring [46].
It is also possible to conduct robustness and comparative analysis can also be conducted with other YOLO series models [47].

5. Discussion

5.1. Failure Case Analysis

Performance saturation under large-scale high-quality data conditions
When the training dataset becomes sufficiently large and contains high-resolution images with relatively clear defect characteristics and limited background interference, the performance improvement introduced by the proposed RECA + GCA framework gradually becomes marginal compared with the baseline YOLOv11.
To further investigate this phenomenon, additional experiments were conducted on a large-scale high-quality crack dataset. The quantitative comparison results are summarized in Table 9.
As shown in Table 9, the proposed model exhibits only a slight performance difference compared with the baseline YOLOv11, with mAP@0.5 decreasing from 0.893 to 0.890 and mAP@0.5:0.95 from 0.740 to 0.736. The absolute performance gap remains within 0.4%, indicating that both models achieve near-saturated detection performance under these ideal data conditions.
This behavior does not indicate model instability or ineffective feature learning. Instead, it can be explained by the diminishing marginal benefit of structural guidance when the training data already provides sufficiently strong visual discriminative information. With large-scale, high-quality datasets, the baseline YOLOv11 can learn highly separable feature representations directly from visual appearance cues, thereby lessening the relative impact of additional structure-guided regularization [48].
In such scenarios, the defect morphology, boundary continuity, and texture distribution are already well represented in the learned feature space, leading to statistically fewer structurally inconsistent predictions. Consequently, the enhancement introduced by the GCA module becomes less pronounced [49].
From the perspective of the bias–variance trade-off, introducing additional structure-aware constraints under abundant data conditions may slightly increase inductive bias while providing limited additional generalization gain [50]. Therefore, although the proposed framework demonstrates substantial improvements under complex low-quality engineering scenarios, its advantages may gradually converge toward those of the baseline detector when trained on very large-scale, clean, and visually well-structured datasets.
Feature interference under extreme illumination and complex backgrounds
In scenarios where defect regions are heavily occluded or exposed to extreme lighting conditions or strong background noise, the detection accuracy of the model may decrease [51].
For instance, with narrow cracks under strong shadow conditions in the dataset, the mAP@0.5 values of some samples may decrease, indicating sensitivity to lighting-induced distribution changes. From a feature representation perspective, this can be interpreted as a degradation of signal-to-noise ratio in intermediate feature maps. The RECA mechanism relies on channel-wise attention weights derived from global semantic statistics; however, under severe illumination variation, the statistical distribution of feature activations may shift, leading to suboptimal attention allocation [52].
Meanwhile, the GCA module relies on gradient-based structural cues (e.g., Sobel/Laplacian responses) to model edge continuity and boundary sharpness. In signal processing terms, strong shadows and complex textures introduce high-frequency components that are not necessarily associated with true structural defects, thereby increasing the likelihood of spurious edge responses [53]. This can be interpreted as a form of gradient-domain contamination, where defect-related edges and background-induced edges become partially indistinguishable in the feature extraction stage.
Furthermore, when background textures exhibit strong morphological similarity to crack patterns—such as wood grain, concrete aging textures, or repetitive linear artifacts—the feature distributions of foreground and background classes may overlap in the embedding space [54]. In such cases, the discriminative power of morphology-based priors is naturally reduced, as these priors rely on statistical differences in structural regularities. From a representation learning perspective, this corresponds to reduced separability between class manifolds in high-dimensional feature space. This observation also aligns with known limitations of 2D image-based learning frameworks, where the absence of depth or temporal consistency may limit robustness under highly ambiguous visual conditions [55]. Therefore, incorporating additional geometric modalities such as 3D point clouds or multi-view constraints may provide complementary structure for improving robustness.

5.2. Dual Attention Mechanism for Collaborative Reasoning

To investigate how different collaboration strategies between channel attention (RECA) and geometric constraint attention (GCA) affect detection performance, four representative dual-attention mechanisms were evaluated: serial, cross-guided, gated, and parallel. The experimental results on both the self-built peeling/detaching dataset (Table 10) and the self-built crack dataset (Table 11) consistently demonstrate that the parallel collaboration strategy achieves the best overall performance. Figure 13 presents a structural comparison of four dual-attention strategies for concrete crack feature extraction.

Serial collaboration applies the two attention modules sequentially, which enforces a fixed information flow. Although this design preserves architectural simplicity, it limits mutual interaction between RECA and GCA [56]. As a result, the serial strategy shows relatively stable but suboptimal performance, particularly in Recall and mAP@0.5:0.95, indicating insufficient adaptability to diverse defect morphologies [57].

Cross-guided collaboration introduces mutual guidance between the two attention branches, enabling partial interaction between feature enhancement and geometry constraints. This strategy improves Recall in some cases by encouraging complementary feature responses [58]. However, the bidirectional coupling may introduce conflicting gradients and unstable feature reweighting, leading to inconsistent Precision and limited overall accuracy gains across datasets [59].

Gated collaboration dynamically modulates attention responses through a gating mechanism, aiming to suppress redundant or noisy features, while this strategy enhances robustness in certain scenarios, it tends to over-filter feature responses, especially for fine-grained or low-contrast defects [60]. Consequently, the gated mechanism exhibits reduced mAP@0.5 and inferior generalization compared to other strategies [61].

In contrast, the parallel collaboration strategy processes RECA and GCA in independent branches and fuses their outputs at the feature level. This design avoids information dominance and gradient interference while fully preserving complementary cues from both attention mechanisms [62]. As evidenced by the experimental results, the parallel strategy consistently achieves the highest mAP@0.5, mAP@0.5:0.95, Precision, and Recall on both datasets. Moreover, this performance gain is obtained with only a marginal increase in computational cost, as reflected by the moderate GFLOPs and parameter count [63].

Overall, these results indicate that parallel dual-attention collaboration provides a more balanced and robust mechanism for integrating feature enhancement and geometry-constrained constraints. By decoupling attention learning while enabling effective feature fusion, the parallel strategy achieves superior detection accuracy, geometry consistency, and computational efficiency, making it the most suitable design choice for engineering-oriented defect detection tasks.

5.3. Compared with Other Baseline YOLO Models

To provide a more comprehensive evaluation, we extend the comparison to a wider range of representative YOLO-based detectors, including YOLOv5–YOLOv12 as well as recent variants such as YOLOv9t/c and YOLOv10b. Table 12 and Table 13 report the quantitative results on the self-built crack dataset and the self-built peeling dataset, respectively. In addition, Figure 14 and Figure 15 present the corresponding visual comparison results, further illustrating the detection performance differences among all methods.

To ensure a consistent and reproducible evaluation, all models are trained under an identical training protocol, including the same optimizer configuration, learning rate schedule, batch size, number of epochs, and data augmentation strategy. This unified setting enables a fair comparison by isolating architectural differences as the primary factor affecting performance. We note that large-scale models such as YOLOv9c and YOLOv10b are typically optimized under model-specific hyperparameter settings; however, evaluating all methods under a unified configuration is a common practice in benchmark studies, ensuring controlled and directly comparable results across heterogeneous architectures.

Overall detection performance. Across both datasets, the proposed YOLOv11 + RECA + GCA consistently achieves superior or competitive performance among all compared YOLO variants. On the self-built crack dataset, it achieves the highest mAP@0.5 (0.597), demonstrating strong detection capability across both lightweight and large-scale models.

In particular, although YOLOv10b achieves relatively high precision (0.798), it exhibits lower recall (0.476), indicating missed detections. In contrast, the proposed method improves recall to 0.571 while maintaining high precision (0.741), achieving a more balanced detection performance.

Fairness and evaluation protocol. To ensure a fair and controlled comparison, all models are evaluated under an identical training configuration. This allows us to isolate architectural differences as the main factor influencing performance, while large-scale models such as YOLOv9c and YOLOv10b may achieve higher performance under model-specific hyperparameter tuning, the unified setting provides a consistent and reproducible benchmark across all YOLO variants. Under this condition, the proposed method still achieves competitive or superior accuracy while maintaining significantly lower computational cost.

Effectiveness of RECA and GCA modules. The performance improvements stem from the complementary design of the RECA and GCA modules. RECA enhances channel-wise feature recalibration, improving sensitivity to fine-grained crack-related features, while GCA introduces morphology-aware structural constraints that enforce spatial consistency and suppress geometrically implausible activations. Their integration enables more robust feature representation under challenging conditions such as low contrast, irregular crack geometry, and scale variation.

Efficiency and scalability analysis. Despite improved detection accuracy, the proposed method remains lightweight with only 2.4 M parameters and 6.7 GFLOPs. It is comparable to YOLOv11 and significantly more efficient than large-scale models such as YOLOv9c and YOLOv10b. Although its inference speed is not the highest among all models, it achieves a favorable balance between accuracy and efficiency, making it suitable for real-time deployment.

Robustness and extended comparison. By extending the baseline set to include both classic and recent YOLO variants, a more comprehensive evaluation is achieved. The consistent improvements across both datasets further demonstrate the robustness and generalization capability of the proposed approach under different data distributions.

Summary. Overall, the extended comparative study confirms that the proposed YOLOv11 + RECA + GCA achieves a superior trade-off between accuracy, efficiency, and robustness, highlighting the effectiveness of integrating channel attention and morphology-aware structural guidance for practical defect detection tasks.

5.4. Strategic Transition from Bounding-Box Detection to Pixel-Level Quantification

While the experimental results underscore the efficacy of GCA-YOLO in robustly localizing structural defects, we must critically address the inherent constraints of bounding-box-level detection within the broader context of Structural Health Monitoring (SHM). A fundamental limitation of rectangular bounding boxes is their inability to achieve pixel-level accuracy, which results in the inclusion of redundant background information and the loss of precise boundary morphology. In civil engineering, this precision is not merely a technical preference but a prerequisite for defect severity grading. For instance, determining the structural risk level of a concrete beam requires precise measurements of crack width, orientation, and the total area of spalled concrete—metrics that a standard bounding box cannot directly provide.

However, we argue that GCA-YOLO serves as a “high-fidelity prompt generator” within the defect detection ecosystem. Recent breakthroughs in foundation models, particularly the Segment Anything Model (SAM), have demonstrated that high-performance object detectors can be seamlessly integrated into two-stage assessment pipelines. As highlighted in recent literature, bounding-box-level outputs can serve as effective visual prompts to guide segmentation models toward achieving sub-pixel fidelity without the prohibitive cost of dense pixel-level labeling.

The integration of GCA-YOLO into such a two-stage pipeline offers several transformative advantages for structural assessment:

(i): Synergistic Efficiency: By utilizing GCA-YOLO to rapidly filter large-scale inspection imagery and identify regions of interest (ROIs), the computationally intensive segmentation process is restricted solely to relevant pixels, significantly enhancing real-time deployment potential.
(ii): Morphological Consistency: Unlike generic detectors, our model’s GCA module ensures that the generated boxes are strictly aligned with the underlying structural geometry. This “cleaner” prompt drastically reduces the ambiguity for downstream segmentation models, leading to more accurate delineation of irregular boundaries.
(iii): Quantifiable Assessment: This framework bridges the gap between vision-based “detection” and engineering-based “quantification,” enabling the automated calculation of severity indices (e.g., spalling area ratios or crack propagation rates) that are critical for long-term bridge and tunnel maintenance.

In conclusion, while GCA-YOLO currently operates at the box level, its design principles facilitate its natural extension into a comprehensive, multi-scale structural diagnosis system.

6. Conclusions

This paper proposes a dual-attention detection framework that integrates Geometric Constraint Attention (GCA) and Residual Efficient Channel Attention (RECA), to address key limitations of conventional YOLOv11-based building defect detection, including structurally inconsistent predictions, limited fine-grained feature representation, vulnerability to complex background interference, and instability from excessive attention weighting.

The main conclusions of this study are summarized as follows.

First, by differentiably embedding morphology-aware structural priors derived from civil infrastructure defect appearance patterns into the YOLOv11 framework [46,47,48], the proposed GCA module introduces an explicit structural constraint during feature learning.

Experimental results demonstrate that this mechanism consistently reduces structurally inconsistent detections, particularly for cracks and spalling, where clear improvements in boundary continuity and geometric coherence are observed. This indicates that incorporating image-domain morphological priors can effectively enhance structural reliability beyond purely data-driven feature learning.

Second, the improved RECA mechanism employs residual dynamic channel reweighting to enhance discriminative feature representation while preserving model efficiency [39,42].

Extensive experiments show that RECA improves sensitivity to subtle defects such as thin cracks and surface spalling, and yields consistent gains in mAP, precision, and recall across multiple datasets. Compared with conventional attention mechanisms, it effectively alleviates feature suppression imbalance and reduces the influence of high-frequency background noise, thereby improving robustness in complex visual environments.

Third, the combined RECA + GCA framework achieves a favorable trade-off among detection accuracy, robustness, and computational efficiency, demonstrating strong complementarity between structural modeling and adaptive channel attention.

Experimental evaluations on our self-constructed datasets for crack, spalling, and corner-chipping confirm consistent performance improvements in key metrics, including mAP, precision, and recall, under most tested conditions. Moreover, the inference speed remains comparable to the baseline YOLOv11, indicating that the proposed enhancements do not introduce significant computational overhead.

From a methodological perspective, the interpretability analysis suggests that the model tends to focus on structurally meaningful defect regions while suppressing irrelevant background responses. However, this observation should be regarded as qualitative evidence supporting the role of attention mechanisms, rather than strict causal verification of its internal decision-making processes.

Furthermore, while the effectiveness of GCA and RECA has been validated in the tested datasets, their performance under extreme environmental variations (e.g., severe occlusion, domain shift, or synthetic textures resembling defects) cannot be fully guaranteed. Nevertheless, their design principle suggests potential transferability to other civil engineering scenarios such as bridges, tunnels, and road surfaces [3].

Future research will focus on addressing the current limitations of box-level outputs by integrating the GCA-YOLO framework into a two-stage semantic segmentation pipeline. By utilizing our detection results as prompts for foundation models, we aim to achieve sub-pixel accuracy in defect quantification, thereby enabling precise severity grading and more reliable structural safety assessments.

Overall, the proposed framework provides a practical and scalable solution for integrating deep learning-based perception with morphological perception-based structural constraints to be applied in civil engineering defect detection. Future work will focus on adaptive prior weight strategies and more physically grounded descriptions to further enhance robustness in highly complex real-world environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/buildings16112105/s1.

Author Contributions

Conceptualization, C.G. and Z.Z. C.G. was responsible for defining the research scope, establishing the overall framework, and proposing the key scientific problems. Z.Z. contributed to refining the research objectives and forming the core methodology of this work. Methodology, Z.Z. Z.Z. was responsible for the design and implementation of the research methodology, including the experimental setup and data processing pipeline. Writing—original draft, Z.Z. Z.Z. drafted the original manuscript, including the literature review, methodology description, results analysis, and initial discussion. Visualization, Z.Z. Z.Z. was responsible for designing and generating all figures, tables, and graphical summaries presented in the manuscript. Writing—review and editing, C.G. C.G. revised the manuscript, provided critical feedback, and edited the final version for clarity, coherence, and academic rigor. Supervision, C.G. C.G. supervised the entire research process, provided guidance on research direction, and managed the project. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data and code underlying the results presented in this work, including the complete original source code and the majority of the datasets, are available in the Supplementary Materials. Requests for further data or materials should be addressed to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used Google Gemini (Google LLC, Mountain View, CA, USA) for partial English language refinement, readability improvement, and preliminary conceptual illustration assistance related to Figure 2, Figure 3 and Figure 12. All generated content was carefully reviewed, modified, and validated by the authors. The authors take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Golding, V.P.; Gharineiat, Z.; Munawar, H.S.; Ullah, F. Crack detection in concrete structures using deep learning. Sustainability 2022, 14, 8117. [Google Scholar] [CrossRef]
Ali, L.; Alnajjar, F.; Jassmi, H.A.; Gocho, M.; Khan, W.; Serhani, M.A. Performance evaluation of deep CNN-based crack detection and localization techniques for concrete structures. Sensors 2021, 21, 1688. [Google Scholar] [CrossRef]
Park, S.E.; Eem, S.H.; Jeon, H. Concrete crack detection and quantification using deep learning and structured light. Constr. Build. Mater. 2020, 252, 119096. [Google Scholar] [CrossRef]
Dung, C.V. Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 2019, 99, 52–58. [Google Scholar] [CrossRef]
Shan, Z.; Hu, H.; Zhu, C.; Du, S.; Jing, H.; Wang, H. RSM-YOLOv11: Lightweight steel surface defect segmentation algorithm research based on YOLOv11 improvement. IEEE Access 2025, 13, 111681–111698. [Google Scholar] [CrossRef]
Ruggieri, S.; Cardellicchio, A.; Nettis, A.; Renò, V.; Uva, G. Using Attention for Improving Defect Detection in Existing RC Bridges. IEEE Access 2025, 13, 18994–19015. [Google Scholar] [CrossRef]
Wang, Y. A Study on Intelligent Road Damage Detection from UAV Imagery Based on the YOLOv10 Architecture. In Proceedings of the 2025 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC), Athens, Greece, 15–17 December 2025; IEEE: New York, NY, USA, 2025; pp. 1045–1049. [Google Scholar] [CrossRef]
Badar, H.M.S.; Hussain, I.; Bashir, A.K.; Alturki, N.; Fan, G.; Zhang, C. Edge-Optimized Lightweight and Transformer Backbones for Real-Time Road Damage Detection in IIoT Systems. IEEE Internet Things J. 2026, 13, 19893–19904. [Google Scholar] [CrossRef]
Huang, S.; Liu, Q.; Chen, C.; Chen, Y. A real-time concrete crack detection and segmentation model based on YOLOv11. arXiv 2025, arXiv:2508.11517. [Google Scholar] [CrossRef]
Tian, Z.; Yang, F.; Yang, L.; Wu, Y.; Chen, J.; Qian, P. An optimized YOLOv11 framework for the efficient multi-category defect detection of concrete surface. Sensors 2025, 25, 1291. [Google Scholar] [CrossRef]
Weng, Y. Optimization of Concrete Crack Feature Extraction in YOLO for High-Precision Crack Detection. In Proceedings of the 2025 IEEE 3rd International Conference on Electrical, Automation and Computer Engineering (ICEACE), Changchun, China, 26–28 December 2025; IEEE: New York, NY, USA, 2025; pp. 2030–2036. [Google Scholar] [CrossRef]
Ding, J.; Pang, J.; Li, W.; Yin, X.; Li, X.; Chen, Q.; Zhao, J.; Zhao, J.; Hu, D. Physics-informed Neural Network for real-time imaging and evaluation of defect based on high-definition ACFM probe. IEEE Trans. Instrum. Meas. 2025, 75, 6001014. [Google Scholar]
Khan, M.Z.; Shahzadi, M.; Khan, A.; Ali, U.; Hassan, M.A.S.; Hussain, M. Review on crack detection in civil infrastructure using structural health monitoring and machine learning techniques. Innov. Infrastruct. Solut. 2025, 10, 348. [Google Scholar] [CrossRef]
Yamane, T.; Chun, P.J. Crack detection from a concrete surface image based on semantic segmentation using deep learning. J. Adv. Concr. Technol. 2020, 18, 493–504. [Google Scholar] [CrossRef]
Bazzucchi, F.; Restuccia, L.; Ferro, G.A. Considerations over the Italian road bridge infrastructure safety after the Polcevera viaduct collapse: Past errors and future perspectives. Frat. IntegritÀ Strutt. 2018, 12, 400–421. [Google Scholar] [CrossRef]
Farneti, E.; Cavalagli, N.; Costantini, M.; Trillo, F.; Minati, F.; Venanzi, I.; Ubertini, F. A method for structural monitoring of multispan bridges using satellite InSAR data with uncertainty quantification and its pre-collapse application to the Albiano-Magra bridge in Italy. Struct. Health Monit. 2023, 22, 353–371. [Google Scholar] [CrossRef]
Durante, M.G.; Di Sarno, L.; Zimmaro, P.; Stewart, J.P. Damage to roadway infrastructure from 2016 central Italy earthquake sequence. Earthq. Spectra 2018, 34, 1721–1737. [Google Scholar] [CrossRef]
Kim, B.; Cho, S. Automated vision-based detection of cracks on concrete surfaces using a deep learning technique. Sensors 2018, 18, 3452. [Google Scholar] [CrossRef] [PubMed]
Mohammadzadeh, M.; Kremer, G.E.O.; Olafsson, S.; Kremer, P.A. AI-driven crack detection for remanufacturing cylinder heads using deep learning and engineering-informed data augmentation. Automation 2024, 5, 578–596. [Google Scholar] [CrossRef]
Xue, M.; Chen, M.; Peng, D.; Guo, Y.; Chen, H. One spatio-temporal sharpening attention mechanism for lightweight YOLO models. Sensors 2021, 21, 7949. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhang, M.; Zhang, C.; Liang, H.; Li, P.; Zhang, W. YOLO-CCS: Vehicle detection algorithm based on coordinate attention mechanism. Digit. Signal Process. 2024, 153, 104632. [Google Scholar] [CrossRef]
Lv, M.; Su, W.H. YOLOv5-CBAM-C3TR: Optimized model with transformer for apple leaf disease detection. Front. Plant Sci. 2024, 14, 1323301. [Google Scholar] [CrossRef]
Jiang, K.; Xie, T.; Yan, R.; Wen, X.; Li, D.; Jiang, H.; Jiang, N.; Feng, L.; Duan, X.; Wang, J. Attention mechanism-improved YOLOv7 for object detection. Agriculture 2022, 12, 1659. [Google Scholar] [CrossRef]
Ye, Z.; Guo, Q.; Wei, J.; Zhang, J.; Zhang, H.; Bian, L.; Guo, S.; Zheng, X.; Cao, S. Improved YOLOv5 with attention mechanism for plant detection. Front. Plant Sci. 2022, 13, 991929. [Google Scholar] [CrossRef]
Yan, J.; Zhou, Z.; Zhou, D.; Su, B.; Xuanyuan, Z.; Tang, J.; Lai, Y.; Chen, J.; Liang, W. Underwater object detection algorithm based on attention mechanism and cross-stage partial fast spatial pyramidal pooling. Front. Mar. Sci. 2022, 9, 1056300. [Google Scholar]
Chen, B.; Dang, Z. Fast PCB defect detection using improved YOLOv7 with CBAM. IEEE Access 2023, 11, 95092–95103. [Google Scholar] [CrossRef]
Xu, C.; Cao, B.T.; Yuan, Y.; Meschke, G. Transfer learning based physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 2023, 405, 115852. [Google Scholar] [CrossRef]
Zhang, E.; Dao, M.; Karniadakis, G.E.; Suresh, S. Physics-informed neural networks for materials analysis. Sci. Adv. 2022, 8, eabk0644. [Google Scholar] [CrossRef]
Jeong, H.; Bai, J.; Batuwatta-Gamage, C.P.; Rathnayaka, C.; Zhou, Y.; Gu, Y. A physics-informed neural network-based topology optimization (PINNTO) framework for structural optimization. Eng. Struct. 2023, 278, 115484. [Google Scholar] [CrossRef]
Kapoor, T.; Wang, H.; Núñez, A.; Dollevoet, R. PINNs for beam systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 5981–5995. [Google Scholar] [CrossRef]
Guo, X.Y.; Fang, S.E. Structural parameter identification using PINNs. Measurement 2023, 220, 113334. [Google Scholar] [CrossRef]
Martinez, Y.; Rojas, L.; Peña, A.; Valenzuela, M.; Garcia, J. Physics-informed neural networks for the structural analysis and monitoring of railway bridges: A systematic review. Mathematics 2025, 13, 1571. [Google Scholar] [CrossRef]
Du, X. C-YOLO: An attention-enhanced framework for road crack detection. In 2025 IEEE 8th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC); IEEE: New York, NY, USA, 2025; Volume 8, pp. 833–838. [Google Scholar]
Ma, L.; Chen, M. Road damage detection based on improved YOLO algorithm. Sci. Rep. 2025, 15, 28506. [Google Scholar] [CrossRef]
Li, S.; Lin, Z.; Shi, Y.; Lan, J.; Zhuo, Y. Road crack detection algorithm based on fusion structure re-parameterization with multi-scale parallel convolutions and hybrid attention mechanism. J. Real-Time Image Process. 2025, 22, 189. [Google Scholar] [CrossRef]
Du, Y.; Cheng, Q.; Liu, X.; Xu, J.; Yi, Y. Enhancing road maintenance through cyber-physical integration: The LEE-YOLO model for drone-assisted pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2025, 26, 14169–14178. [Google Scholar] [CrossRef]
Lan, X.; Liu, L.; Wang, X. DAL-YOLO: A multi-target detection model for UAV-based road maintenance integrating feature pyramid and attention mechanisms. J. Real-Time Image Process. 2025, 22, 105. [Google Scholar] [CrossRef]
Zhang, Y. A remote sensing small object detection algorithm based on dynamic snake convolution and attention mechanism. In 2025 IEEE 5th International Conference on Electronic Technology, Communication and Information (ICETCI); IEEE: New York, NY, USA, 2025; pp. 754–759. [Google Scholar]
Zhang, H.; Jing, S. MLCAM-YOLO: A lightweight small object detection model based on attention mechanism. In 2025 6th International Conference on Machine Learning and Computer Application (ICMLCA); IEEE: New York, NY, USA, 2025; pp. 153–157. [Google Scholar]
Peng, C. Research on YOLOv4 object detection based on K-means algorithm and fusion attention mechanism. In 2023 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA); IEEE: New York, NY, USA, 2023; pp. 439–444. [Google Scholar]
Gao, B.; Fang, Y.; Li, W. Research on transformer-based small object detection systems for low-resolution scenarios. In 2025 7th International Academic Exchange Conference on Science and Technology Innovation (IAECST); IEEE: New York, NY, USA, 2025; pp. 1277–1281. [Google Scholar]
Liu, Y.; Luo, L.; Wang, X.; Zheng, Y. Research on object detection algorithm based on twin attention and adaptive feature fusion. In 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE); IEEE: New York, NY, USA, 2024; pp. 964–968. [Google Scholar]
Lee, J.; Kim, H.; Park, C.; Jang, J.; Paik, J. Small object detection in infrared images using attention mechanism and sigmoid function. In 2024 IEEE International Conference on Consumer Electronics (ICCE); IEEE: New York, NY, USA, 2024; pp. 1–3. [Google Scholar]
Iqbal, F.; Pumrin, S. Advance attention-based techniques for small object detection in remote sensing images. In 2024 28th International Computer Science and Engineering Conference (ICSEC); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Avci, O.; Abdeljaber, O.; Kiranyaz, S.; Hussein, M.; Gabbouj, M.; Inman, D.J. A review of vibration-based damage detection in civil structures: From traditional methods to machine learning and deep learning applications. Mech. Syst. Signal Process. 2021, 147, 107077. [Google Scholar] [CrossRef]
Pathirage, C.S.N.; Li, J.; Li, L.; Hao, H.; Liu, W.; Ni, P. Structural damage identification based on autoencoder neural networks and deep learning. Eng. Struct. 2018, 172, 13–28. [Google Scholar] [CrossRef]
Azimi, M.; Eslamlou, A.D.; Pekcan, G. Data-driven structural health monitoring and damage detection through deep learning: State-of-the-art review. Sensors 2020, 20, 2778. [Google Scholar] [CrossRef] [PubMed]
Lomazzi, L.; Giglio, M.; Cadini, F. Towards a deep learning-based unified approach for structural damage detection, localisation and quantification. Eng. Appl. Artif. Intell. 2023, 121, 106003. [Google Scholar] [CrossRef]
Zhang, L.; Shen, J.; Zhu, B. A review of the research and application of deep learning-based computer vision in structural damage detection. Earthq. Eng. Eng. Vib. 2022, 21, 1–21. [Google Scholar] [CrossRef]
Pathirage, C.S.N.; Li, J.; Li, L.; Hao, H.; Liu, W.; Wang, R. Development and application of a deep learning-based sparse autoencoder framework for structural damage identification. Struct. Health Monit. 2019, 18, 103–122. [Google Scholar] [CrossRef]
Melville, J.; Alguri, K.S.; Deemer, C.; Harley, J.B. Structural damage detection using deep learning of ultrasonic guided waves. AIP Conf. Proc. 2018, 1949, 230004. [Google Scholar] [CrossRef]
Sun, H.; Song, L.; Yu, Z. A deep learning-based bridge damage detection and localization method. Mech. Syst. Signal Process. 2023, 193, 110277. [Google Scholar] [CrossRef]
Guo, T.; Wu, L.; Wang, C.; Xu, Z. Damage detection in a novel deep-learning framework: A robust method for feature extraction. Struct. Health Monit. 2020, 19, 424–442. [Google Scholar] [CrossRef]
Cha, Y.J.; Choi, W.; Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Nie, M.; Wang, C. Pavement crack detection based on YOLOv3. In 2019 2nd International Conference on Safety Produce Informatization (IICSPI); IEEE: New York, NY, USA, 2019; pp. 327–330. [Google Scholar]
Du, Y.; Pan, N.; Xu, Z.; Deng, F.; Shen, Y.; Kang, H. Pavement distress detection and classification based on YOLO network. Int. J. Pavement Eng. 2021, 22, 1659–1672. [Google Scholar] [CrossRef]
Meng, Z.; Qian, Q.; Xu, M.; Yu, B.; Yıldız, A.R.; Mirjalili, S. PINN-FORM: A new physics-informed neural network for reliability analysis with partial differential equation. Comput. Methods Appl. Mech. Eng. 2023, 414, 116172. [Google Scholar] [CrossRef]
Nie, M.; Wang, C. Pavement crack detection based on YOLOv3. In Proceedings of the IICSPI, Chongqing, China, 28–30 November 2019. [Google Scholar]
Sidqi, A.J.; Nugraha, A.A.; Saputra, R.; Pratama, D.; Wibowo, T.; Hidayat, R.; Santoso, B.; Firmansyah, M.; Utama, A.; Kurniawan, D. YOLO-based Road Crack Detection System. In Proceedings of the ICISS, Bandung, Indonesia, 4–5 September 2024. [Google Scholar]
Du, X. C-YOLO attention-enhanced crack detection. In Proceedings of the IAEAC, Guiyang, China, 8–10 August 2025. [Google Scholar]
Zhou, Z.; Tian, Y.; Khan, A.; Liu, J.; Hameed, N. Improved YOLOv8 for pavement cracks. In Proceedings of the ISCTIS, Xi’an, China, 16–18 May 2025. [Google Scholar]
Kumar, P.; Batchu, S.; Kota, S.R. Real-time concrete damage detection using deep learning for high rise structures. IEEE Access 2021, 9, 112312–112331. [Google Scholar] [CrossRef]
Yao, H.; Liu, Y.; Li, X.; You, Z.; Feng, Y.; Lu, W. Crack detection with attention mechanism. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22179–22189. [Google Scholar] [CrossRef]

Figure 1. End-to-end architecture of the modified YOLO model with GCA and RECA modules for intelligent crack inspection.

Figure 3. Illustration of morphology-oriented feature enhancement for crack and defect analysis.

Figure 4. Visualization comparison of attention heatmaps between baseline and improved YOLO11 model.

Figure 5. The proposed RECA module with adaptive kernel convolution and learnable scaling parameter.

Figure 10. Ablation study results of different model variants.

Figure 11. Validation results on the self-built crack dataset under the single-category crack detection setting. Only crack defects were labeled and evaluated during training and inference.

Figure 12. Effect of varying the structural prior coefficient

λ

on detection accuracy and regularization loss.

Figure 12. Effect of varying the structural prior coefficient

λ

on detection accuracy and regularization loss.

Figure 13. Structural comparison of four dual-attention strategies for concrete crack feature extraction.

Figure 14. Comparison with YOLO baselines on self-built crack dataset.

Figure 15. Comparison with YOLO baselines on the self-built peeling dataset.

Table 1. Comparison of parameter count and computational cost between ECA and RECA.

Module	Parameters	Computational Cost (FLOPs)
ECA	k	$C \cdot k + C \cdot H \cdot W$
RECA	$k + 1$	$C \cdot k + 2 \cdot C \cdot H \cdot W$

Table 2. Training hyperparameter configuration.

Parameter	Value
Input image size	$640 \times 640$
Batch size	32
Initial learning rate ( $l r_{0}$ )	0.001
Final learning rate factor ( $l r f$ )	0.01
Optimizer	Auto-selected optimizer
Weight decay	$5 \times 10^{- 4}$
Momentum	0.937
Warmup epochs	3
Training epochs	500/1200
Early stopping patience	100
IoU threshold	0.7
Mosaic augmentation	1.0
Horizontal flip probability	0.5
HSV augmentation	$(0.015, 0.7, 0.4)$
Scale augmentation	0.5
Translation augmentation	0.1
Mixed precision training	Enabled (AMP)
Crack width threshold	0.25

Table 3. Detailed comparison of the datasets used in this study.

Dataset Type	Source	Key Morphological Features	Environmental Challenges
Public Crack	Roboflow Universe	Standard crack patterns; multi-scale cracks; diverse widths.	Illumination variation; background clutter.
Self-built Peeling	Field Survey	Irregular geometry; spalling; edge detachment.	Vegetation interference; texture confusion; outdoor variability.
Self-built Crack	Field Survey	Fine-grained cracks; discontinuous edges; weak contrast.	Shadows; stains; weathering; low visibility.

Table 6. Ablation results on the self-built crack dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
YOLOv11	0.515	0.227	0.740	0.429	161.3	6.3	2.6 M
YOLOv11 + RECA	0.494 (−4.1%)	0.245 (+7.9%)	0.677 (−8.5%)	0.429 (+0.0%)	166.7 (+3.3%)	6.1 (−3.2%)	2.3 M (−9.6%)
YOLOv11 + GCA	0.483 (−6.2%)	0.221 (−2.6%)	0.708 (−4.3%)	0.462 (+7.7%)	151.5 (−6.1%)	6.2 (−1.6%)	2.3 M (−9.1%)
YOLOv11 + RECA + GCA	0.597 (+15.9%)	0.248 (+9.3%)	0.741 (+0.1%)	0.571 (+33.1%)	161.3 (+0.0%)	6.7 (+6.3%)	2.4 M (−7.8%)

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline YOLOv11. Bold values indicate the best performance among all compared methods.

Table 7. Ablation study on the effect of structural prior weight

λ

.

Table 7. Ablation study on the effect of structural prior weight

λ

.

$λ$	mAP@0.5	$L_{struct}$
0	0.745	0.000
0.1	0.770	0.138
0.5	0.763	0.094
1.0	0.774	0.082
2.0	0.770	0.121

Bold values indicate the best performance among all compared methods.

Table 8. Comprehensive efficiency comparison between the baseline and the proposed framework.

Model Variant	Parameters	GFLOPs	FPS (Public)	FPS (Crack)	Latency
YOLOv11 (Baseline)	2.6 M	6.3	243.9	161.3	Low
Proposed Model	2.4 M	6.7	294.1	161.3	Ultra-low

Bold values indicate the best performance among all compared methods.

Table 9. Performance comparison under large-scale high-quality data conditions.

Method	mAP@0.5	mAP@0.5:0.95
YOLOv11	0.893	0.740
YOLOv11 + RECA + GCA	0.891	0.736

Table 10. Comparison of different dual-attention collaboration strategies on the self-built peeling/detaching dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
Serial	0.552	0.215	0.770	0.429	144.9	6.3	2.6 M
CrossGuided	0.577 (+4.5%)	0.254 (+18.1%)	0.566 (−26.5%)	0.667 (+55.5%)	166.7 (+15.1%)	6.1 (−3.2%)	2.3 M (−11.5%)
Gated	0.491 (−11.1%)	0.231 (+7.4%)	0.692 (−10.1%)	0.428 (−0.2%)	166.7 (+15.1%)	6.2 (−1.6%)	2.4 M (−7.7%)
Parallel	0.597 (+8.2%)	0.248 (+15.3%)	0.741 (−3.8%)	0.571 (+33.1%)	161.3 (+11.3%)	6.7 (+6.3%)	2.4 M (−7.7%)

Note: Bold values indicate the best performance among all compared methods.

Table 11. Comparison of different dual-attention collaboration strategies on the self-built crack dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
Serial	0.756	0.489	0.766	0.686	333.3	10.7	3.2 M
CrossGuided	0.764 (+1.1%)	0.491 (+0.4%)	0.787 (+2.7%)	0.696 (+1.5%)	344.8 (+3.5%)	6.1 (−43.0%)	2.3 M (−28.1%)
Gated	0.755 (−0.1%)	0.499 (+2.0%)	0.767 (+0.1%)	0.707 (+3.1%)	285.6 (−14.3%)	6.2 (−42.1%)	2.4 M (−25.0%)
Parallel	0.785 (+3.8%)	0.508 (+3.9%)	0.828 (+8.1%)	0.714 (+4.1%)	294.1 (−11.8%)	6.7 (−37.4%)	2.4 M (−25.0%)

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline Serial.Bold values indicate the best performance among all compared methods.

Table 12. Comparison The caption is intentionally consistent with that of Figure 14 and can be retained. with YOLO baselines on self-built crack dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
YOLOv5	0.577	0.202	0.693	0.476	175.4	7.1	2.5 M
YOLOv6	0.534	0.254	0.549	0.524	178.6	11.7	4.2 M
YOLOv8	0.524	0.276	0.682	0.524	370.4	8.1	3.0 M
YOLOv8n	0.529	0.263	0.714	0.524	185.2	8.1	3.0 M
YOLOv9t	0.555	0.245	0.652	0.571	140.8	7.6	2.0 M
YOLOv9c	0.513	0.277	0.457	0.619	128.2	102.3	25.3 M
YOLOv10b	0.568	0.243	0.798	0.476	714.3	91.6	19.0 M
YOLOv11	0.515	0.227	0.740	0.429	161.3	6.3	2.6 M
YOLOv12	0.513	0.170	0.591	0.552	166.3	6.3	2.6 M
YOLOv11 + CBAM	0.494	0.205	0.557	0.524	158.7	6.1	2.3 M
YOLOv11 + SE	0.582	0.261	0.619	0.571	169.5	6.1	2.3 M
YOLOv11 + RECA + GCA	0.597	0.248	0.741	0.571	161.3	6.7	2.4 M

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline Serial. Bold values indicate the best performance among all compared methods.

Table 13. Comparison with YOLO baselines on self-built peeling dataset.

Method	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS	GFLOPS	Params
YOLOv5	0.752	0.459	0.830	0.638	322.6	7.1	2.5 M
YOLOv6	0.776	0.499	0.835	0.685	344.8	11.7	4.2 M
YOLOv8	0.768	0.502	0.806	0.696	344.8	8.1	3.0 M
YOLOv8n	0.770	0.498	0.810	0.694	357.1	8.1	3.0 M
YOLOv9t	0.749	0.478	0.795	0.659	294.1	7.6	2.0 M
YOLOv9c	0.742	0.472	0.760	0.699	285.7	102.6	25.3 M
YOLOv10b	0.763	0.519	0.832	0.662	370.4	91.6	19.0 M
YOLOv11	0.766	0.486	0.807	0.688	227.3	6.3	2.6 M
YOLOv12	0.773	0.515	0.837	0.725	303.0	6.3	2.6 M
YOLOv11 + CBAM	0.775	0.504	0.845	0.677	322.6	6.1	2.3 M
YOLOv11 + SE	0.763	0.482	0.854	0.645	333.3	6.1	2.3 M
YOLOv11 + RECA + GCA	0.793	0.505	0.811	0.732	294.1	6.7	2.4 M

Note: Percentage values indicate improvement (+) or degradation (−) relative to the baseline Serial. Bold values indicate the best performance among all compared methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z.; Guo, C. Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection. Buildings 2026, 16, 2105. https://doi.org/10.3390/buildings16112105

AMA Style

Zhang Z, Guo C. Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection. Buildings. 2026; 16(11):2105. https://doi.org/10.3390/buildings16112105

Chicago/Turabian Style

Zhang, Zichen, and Chengjun Guo. 2026. "Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection" Buildings 16, no. 11: 2105. https://doi.org/10.3390/buildings16112105

APA Style

Zhang, Z., & Guo, C. (2026). Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection. Buildings, 16(11), 2105. https://doi.org/10.3390/buildings16112105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Incorporating Structural Prior Knowledge into YOLO for Robust Infrastructure Damage Detection

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning for Structural Damage Detection

2.2. Attention Mechanism in Visual Tasks

2.3. Morphology-Aware Structural Prior Learning

2.4. From Detection to Segmentation: The Role of Foundation Models

3. Methodology

3.1. Overall Network Architecture

3.2. Geometric Constraint Attention Formulation

3.2.1. Geometry-to-Image Mapping

3.2.2. Theoretical Foundation of the GCA Module

Final Output Formulation

Optimization Objective Overview

3.2.3. Visualization and Analysis of Attention Heatmaps

3.3. Residual Efficient Channel Attention (RECA)

3.4. Dual Attention Integration Strategy

3.5. Loss Function

3.5.1. Detection Loss

3.5.2. Geometry-Constrained Structural Loss

3.5.3. Integration with RECA and GCA Features

3.5.4. Total Loss Function

3.5.5. Consistency with Implementation

3.5.6. Clarification on Physical Interpretation

3.5.7. Structural Prior Embedding Strategy

3.6. Use of Generative AI Tools

4. Experiments

4.1. Implementation Details

4.1.1. Hardware Configuration

4.1.2. Software Environment

4.1.3. Training Hyperparameters

4.2. Dataset and Evaluation Metrics

Introduction to the Experimental Dataset

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

4.4.1. Results on the Public Crack Dataset

4.4.2. Results on the Self-Built Peeling/Detaching Dataset

4.4.3. Results on the Self-Built Crack Dataset

4.4.4. Effect of the Structural Prior Weight λ

4.5. Real-Time Capability and Computational Efficiency Analysis

4.5.1. Quantification of Model Complexity

4.5.2. Inference Speed and Real-Time Throughput

4.5.3. Feasibility for On-Site Engineering Deployment

4.6. Overall Analysis

5. Discussion

5.1. Failure Case Analysis

5.2. Dual Attention Mechanism for Collaborative Reasoning

5.3. Compared with Other Baseline YOLO Models

5.4. Strategic Transition from Bounding-Box Detection to Pixel-Level Quantification

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4.4. Effect of the Structural Prior Weight $λ$