Pipeline Defect Detection Based on Improved YOLOv11

Li, Zhiqiang; Shi, Weimin; Sun, Lei

doi:10.3390/pr14030530

Open AccessArticle

Pipeline Defect Detection Based on Improved YOLOv11

by

Zhiqiang Li

,

Weimin Shi

^*

and

Lei Sun

Mechanical College, Zhejiang Sci-Tech University, Hangzhou 310000, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(3), 530; https://doi.org/10.3390/pr14030530

Submission received: 18 December 2025 / Revised: 11 January 2026 / Accepted: 16 January 2026 / Published: 3 February 2026

(This article belongs to the Special Issue Process Engineering: Process Design, Control, and Optimization)

Download

Browse Figures

Versions Notes

Abstract

Underground utility tunnels face corrosion, cracks, and leakage after long-term use, endangering urban safety. Traditional methods have strong subjectivity, high miss rates, and poor real-time performance, failing refined management needs. This paper proposes an attention-enhanced YOLOv11 rather than YOLOv10 because its C3k2 backbone and dynamic anchor head already surpass YOLOv10 by 1.8% mAP for pipeline defect detection in utility tunnels. It uses homomorphic filtering to improve low-light image quality; replaces the last two C3k2 modules of the original YOLOv11 with a Multi-Scale Feature Aggregation Module to capture micro-cracks via expanded receptive fields; introduces a bidirectional weighted feature pyramid network in the neck (with C2PSA/BRA attention) for cross-scale feature fusion and background suppression, which yields both fine-grained micro-crack sensitivity and global false-target suppression; and adopts DIoU loss in the detection head to reduce slender defect localization errors. Experiments on 5000 utility tunnel defect images show the improved algorithm achieves 93.2% precision, 92.4% recall, and 92.6% mAP—outperforming the original YOLOv11, Faster R-CNN, and YOLOv5. Ablation experiments confirm module effectiveness, cutting relative error by 75% compared with the baseline. This algorithm can accurately identify multiple types of defects in complex utility tunnel environments, providing technical support for the safe and efficient operation and maintenance of urban infrastructure.

Keywords:

tunnel defect detection; YOLOv11; multi-scale feature aggregation; BiFPN

1. Introduction

As a vital component of modern urban infrastructure, underground utility tunnels house critical pipelines for power, telecommunication, gas, and water systems. With years of service, aging pipes, external damage, corrosion, and leakage gradually emerge, posing significant risks to city operations and public safety [1]. Timely detection and repair of these defects are therefore essential to safeguard the reliable functioning of urban infrastructure. Traditional inspection of utility tunnel defects relies mainly on manual patrols or periodic checks with specialized equipment approaches that suffer from large subjective errors, high miss rates, and considerable safety risks. In recent years, new technologies such as closed-circuit television (CCTV) [2,3], infrared thermography [4,5,6], and ultrasonic arrays [7,8] have been introduced (Table 1).

Table 1 summarizes the currently used inspection techniques. However, owing to their poor real-time performance, limited sensitivity to complex defects, and high operational overhead, these methods are difficult to scale to network-level surveys and cannot satisfy the demand for refined asset management. In recent years, the rapid advances in computer vision and artificial intelligence have made vision-based automatic inspection an increasingly active research area. In particular, the rise of deep learning has opened a new route for defect recognition in challenging environments. Convolutional neural networks (CNNs), object detection architectures, and semantic segmentation models [9,10] have all demonstrated superior performance in image feature extraction and classification tasks.

Kumar [11] proposed a CNN-based system to classify sewer defects in CCTV images, including root intrusion, sediment, and cracks, achieving 86.2% accuracy. Hawari [12] fused hand-crafted features with a CNN in the Auto-CCTV tool, attaining an F1-score of 0.83 on three defect types (crack, deformation, and deposit) while processing 1 h of video in only 4.3 min. Yin [13] adopted YOLOv3 for end-to-end detection of four defect classes, reaching 81.4% mAP on 2500 CCTV images; video-tracking post-processing further reduced the miss rate by 3.2%. Li [14] presented a dual-level CNN that combines local and global features for sewer defect classification, obtaining high accuracy. Situ [15] employed StyleGAN to synthesize sewer defect images and fine-tuned a CNN classifier, notably enhancing robustness. Shen [16] embedded an enhanced attention module into ResNet50, merged multi-scale features with a feature-preserving block, and achieved 96.34% classification accuracy on five drainage pipe defect categories. Wang [17] built a two-stage ResNet-50 system that delivered 70.9% precision and 75.1% recall on their self-constructed Pipe-Defect-8K dataset covering eight defect types. Haurum [18] released Sewer-ML, the largest multi-label sewer dataset to date, and provided Faster R-CNN and YOLOv3 baselines; YOLOv3 reached 76.4% mAP. Wang [19] introduced a knowledge distillation framework for defect detection; with 80% parameter reduction (only 3.2 MB), mAP rose from 3.1% to 86.5%, enabling 30 fps real-time inference on edge devices. Luo [20], targeting the strong noise and concealed targets in GPR images, designed the lightweight MTGPR network, significantly improving both the detection rate and speed of void defects. Li [21] incorporated deformable convolutions and a Transformer decoder into YOLOv8, proposing YOLOv8-DTD; crack mAP increased by 10.84% while parameters dropped by 43%, running at 65.46 fps on Android phones. Zhou [22] built a unified multi-defect model on Mobile-YOLO, surpassing 93% mAP at >100 fps and showing strong adaptability to complex background changes. Qin [23] developed YOLO-TDD, a batch-processing framework tailored for line-scan images, cutting the inspection time of entire tunnel surface defects by an order of magnitude while maintaining high generalization. Feng [24] focused on the low-contrast problem of water leakage and designed an interpretable segmentation network, achieving 89% IoU on their self-built tunnel dataset and accurately delineating seepage areas even under low illumination and bracket shadows. Although studies by Kumar [11], Haurum [18], and follow-up works [19,20,21,22,23,24] have progressively raised tunnel defect detection mAP from 86% to above 93%, several inherent limitations remain unresolved, preventing real-world CCTV inspection deployment. (1) Insufficient recall of micro-defects: Most detectors still rely on single-scale anchors [13,17] or fixed-receptive-field bottlenecks [21,22], failing on hairline cracks and seepage spots only 0.5–2 pixels wide. These targets occupy < 0.01% of the image area and often share orientation and intensity with reflective pipe joints, leading to persistently high miss rates. (2) Low-illumination robustness depends on external hardware. Low-light enhancement is either ignored [11,18] or delegated to additional infrared/thermal cameras [4,5,6], raising inspection costs and operational complexity. (3) Cross-scale fusion weights are static. Even the recent YOLOv8-DTD [21] and Mobile-YOLO [22] adopt FPN or lightweight PAN structures whose fusion weights are frozen after training. When crack scales vary abruptly around the tunnel circumference, high-level semantic features drown out fine details, causing localization drift.

To overcome the limitations of existing detectors in sub-pixel crack localization, we introduce a topologically re-engineered YOLOv11 that jointly optimizes three mutually reinforcing components: (1) a Multi-Scale Feature Aggregation Module (MSFAM) that replaces the last two C3k2 bottlenecks with 1 × 1, 3 × 3 and 5 × 5 parallel convolutions before any downsampling, instantly enlarging the receptive field to 17 × 17 pixels while preserving full resolution—enabling 0.5-pixel-wide cracks to be captured without the edge blur inherent to large-kernel or dilated convolutions; (2) a defect-size-aware BiFPN whose fast-normalized fusion weights are re-parameterized under DIoU supervision, yielding a non-uniform weight distribution that grants micro-defect scales (≤8 px) with 2.3× higher fusion gain than background scales; (3) a collaborative C2PSA+BRA attention pair that is jointly pruned to the 6% of spatial coefficients lying within 1 pixel of annotated crack edges, suppressing 75% of reflection-induced false positives at <0.4 GFLOP cost. The resulting attention-enhanced YOLOv11 achieves 92.6% mAP at 17.3 GFLOPs, pushing micro-crack detection to 0.5 px and outperforming all existing YOLO-based or attention-augmented frameworks, thereby providing a ready-to-deploy solution for real-time utility tunnel inspection.

2. Methods

2.1. Image Preprocessing

To improve the detection accuracy of subsequent deep learning models, this paper preprocesses the images. Due to the low illumination inside the utility tunnel, homomorphic filtering [25] is introduced for image preprocessing.

Homomorphic filtering is a frequency-domain technique that simultaneously enhances image contrast and compresses the brightness range. By attenuating low-frequency components while amplifying high-frequency ones, it reduces illumination variations and sharpens edge details. In field applications, uneven lighting often yields blurred images in which defects are barely discernible. Homomorphic filtering remaps the grayscale range, corrects non-uniform illumination, boosts details in dark regions, and yet preserves the information in bright areas without loss.

Regarding the original image, function

I (x, y)

is the product of an illumination function

i (x, y)

and a reflectance function

r (x, y)

; thus, the original image function is written as

I (x, y) = i (x, y) * r (x, y)

(1)

To perform homomorphic filtering, the multiplicative relationship of the original image function must be converted into an additive one by taking the logarithm of the function:

Z (x, y) = \ln i (x, y) + \ln r (x, y)

(2)

To transform the image into the frequency domain, a Fourier transform is applied to the logarithm-processed function:

F (Z (x, y)) = F (\ln i (x, y)) + F (\ln r (x, y))

(3)

Then select an appropriate transfer function: by compressing the variation range of the illumination component, attenuating and enhancing the reflection component, and boosting to strengthen high-frequency content, a suitable parameter is chosen, and a homomorphic filter function is applied to the Fourier transform of the logarithm of the original image, yielding

S (u, v) = H (u, v) I (u, v) + H (u, v) R (u, v)

(4)

Inverse-transform back to the spatial domain to obtain

s (x, y) = F^{- 1} (S (u, v))

(5)

Finally, take the exponential to obtain the filtered result:

f^{'} (x, y) = \exp (s (x, y))

(6)

Applying homomorphic filtering to preprocess the image yields the result shown in Figure 1.

Figure 1a shows a utility tunnel defect image acquired under low illumination: the global grayscale dynamic range is compressed, and crack and seepage regions exhibit low contrast due to non-uniform lighting, with edge details submerged in the low-to-mid gray band and a marked decline in signal-to-noise ratio. After homomorphic filtering (Figure 1b), the illumination component is suppressed and the reflectance component is enhanced, stretching the overall image contrast by approximately 1.8×; as the local standard deviation of high-frequency defect structures such as cracks increases, the edge gradient magnitude rises. Consequently, the topological continuity and visual saliency of micro-cracks are effectively restored while high-light areas remain free from overexposure distortion.

2.2. Deep Learning Algorithm Structure

As shown in Figure 2, the proposed pipeline-enhanced YOLOv11 retains the C3k2 efficient bottleneck in the backbone, while the last two C3k2 blocks are replaced with our MSFAM to enlarge receptive fields and capture fine-grained cracks. A cascaded BiFPN is inserted in the neck for bidirectional cross-scale fusion, followed by C2PSA and BRA collaborative attention modules to suppress background clutter. The head keeps YOLOv11’s dynamic anchors but adopts DIoU loss to reduce the localization error of slender defects.

As illustrated in Figure 2, the backbone retains YOLOv11’s efficient C3k2 bottleneck, but its last two C3k2 blocks are replaced by the proposed Multi-Scale Feature Aggregation Module (MSFAM). MSFAM employs parallel 1 × 1, 3 × 3, and 5 × 5 convolutions to enlarge the receptive field, allowing early capture of fine-grained defects such as cracks, while residual connections mitigate gradient vanishing and prevent spatial information loss caused by downsampling. Subsequently, a cascaded BiFPN is introduced as the core of the neck to perform bidirectional, cross-scale weighted fusion of the three features. To further suppress complex background clutter, C2PSA (Partial Spatial Attention) and BRA (Bilateral Regional Attention) are embedded at the end of the neck: C2PSA adaptively recalibrates defect responses along both the channel and spatial dimensions, whereas BRA models long-range dependencies through local–global branches, maintaining high sensitivity under low-light, shadow, or reflective pipe wall conditions. The detection head keeps YOLOv11’s dynamic anchor strategy but replaces the bounding box regression loss with DIoU, accelerating convergence and improving overlap between predicted and ground-truth defect boxes, thus significantly reducing localization errors for slender cracks.

2.3. Replacing C3k2 with MSFAM

In utility tunnel inspections, poor underground lighting, limited camera stand-off, and cluttered background textures yield images characterized by small-scale defects, low contrast, and poor signal-to-noise ratio. Traditional CNNs with fixed-size kernels possess restricted receptive fields, making it difficult to extract multi-granularity features within a single hierarchical level. This structural limitation leads to frequent missed alarms and false positives for micro-cracks, seepage spots, and tiny voids, seriously degrading the accuracy and reliability of structural safety assessments. To overcome this bottleneck, we introduce a Multi-Scale Feature Aggregation Module (MSFAM) to replace the last two C3k2 blocks in the YOLOv11 backbone, substantially enhancing the network’s sensitivity to minute defect targets.

In the YOLOv11 backbone [26], the C3k2 block performs local feature reuse through two parallel branches; its computation can be written compactly as

Y_{C 3 k 2} = X + {C o n v}_{1 \times 1} [C a t (X_{1 \times 1,} B o t t l e (X_{3 \times 3}))]

(7)

This structure relies solely on a stack of identical 3 × 3 receptive fields, which is inadequate for capturing pixel-level defects such as cracks and seepage spots in tunnel images. To address this limitation, we propose replacing C3k2 with the Multi-Scale Feature Aggregation Module (MSFAM). The core idea is to expand the original bottleneck branch into a “three parallel convolutions + residual compression” scheme, as derived below.

The input feature X

\in R^{H \times W \times C}

is first fed through a 1 × 1 convolution to reduce the channel dimension to C/4, yielding X′:

Χ^{'} = {C o n v}_{1 \times 1}^{r e d u c e} (Χ)

(8)

Features with three receptive fields are extracted in parallel and concatenated:

T = C a t [{C o n v}_{1 \times 1} (X^{'}), {C o n v}_{3 \times 3} (X^{'}), {C o n v}_{5 \times 5} (X^{'})] \in R^{H \times W \times 3 C / 4}

(9)

Finally, a 1 × 1 convolution compresses T back to C channels and adds the identity residual:

Y_{M S F A M} = {C o n v}_{1 \times 1}^{f u s e} (T) + Χ

(10)

This network structure is illustrated in Figure 3.

The MSFAM block adopts a four-branch parallel architecture. By simultaneously employing 1 × 1, 3 × 3, and 5 × 5 convolutions, it expands the effective receptive field without increasing computational complexity, allowing the network to capture both fine-grained details and coarse contextual cues in a single layer. This significantly improves the detection accuracy of minute defects such as hairline cracks and seepage spots. Moreover, the MSFAM avoids traditional downsampling, so no spatial resolution is lost and edge structures of flaws are preserved. Each convolution is followed by batch normalization and adaptive SiLU activation, which stabilizes the data distribution during training, mitigates gradient vanishing/explosion, and boosts the model’s expressive power and generalization. Compared with ReLU, SiLU offers stronger adaptability and non-linearity across varying data distributions, leading to faster convergence and more stable training.

2.4. BiFPN Feature Fusion Network

To overcome the limited cross-layer semantic interaction and insufficient utilization of multi-scale information caused by the single lateral connections in the original YOLOv11 neck, this paper introduces an efficient bidirectional weighted feature pyramid network (BiFPN) as the core fusion module [27]. Let the multi-scale feature sequence output by the backbone be denoted as

Ρ = \{P_{l} \in R^{H_{l} \times W_{l} \times C} |l = L_{m i n}, \dots, L_{m a x}\}

(11)

where

l

denotes the level index. The core idea of a BiFPN is as follows: for any fusion node, its inputs come from the previous stage feature at the same scale and from the upsampling and downsampling features from adjacent scales, and adaptive weighting is achieved through fast-normalized weights.

Specifically, the bidirectional fusion process is defined as follows:

(1): Top–down pathway

For

l

=

L_{m a x}

− 1 down to

L_{m i n}

,

{\tilde{P}}_{l}^{t d} = R e s i z e (P_{l + 1}^{t d}) P_{l}^{t d} = {C o n v}_{1 \times 1} (\frac{w_{1} \cdot P_{l} + w_{2} \cdot {\tilde{P}}_{l}^{t d}}{w_{1} + w_{2} + ℇ})

(12)

where Resize denotes nearest-neighbor upsampling, the weights w₁, w₂ > 0 are learnable scalars, and ε is a small constant to ensure numerical stability.

(2): Bottom-up pathway

Initialize

P_{l}^{o u t}

=

P_{l}^{t d}

, then for

l

=

L_{m i n} + 1

to

L_{m a x}

{\tilde{P}}_{l - 1}^{b u} = S t r i d e {C o n v}_{3 \times 3} (P_{l}^{o u t}) P_{l}^{o u t} = {C o n v}_{1 \times 1} (\frac{w_{3} \cdot P_{l}^{t d} + w_{4} \cdot {\tilde{P}}_{l - 1}^{b u}}{w_{3} + w_{4} + ℇ})

(13)

(3): Unified weighted fusion

The above weights satisfy wi = ReLU(αi), where αi are trainable parameters. Fast-normalized fusion replaces softmax, cutting computation and improving numerical stability.

The final output feature set is

O = \{P_{l}^{o u t} |l = L_{m i n}, \dots, L_{m a x}\}

(14)

The BiFPN network architecture is illustrated in Figure 4.

As shown in Figure 4, a BiFPN constructs a bidirectional pathway (top–down and bottom-up) that fully exchanges low-level spatial details and high-level semantic features. Coupled with learnable fast-normalized weights between adjacent levels, it dynamically re-weights each scale according to its task-specific importance, suppressing redundancy while retaining informative cues. Specifically, nearest-neighbor upsampling and stride-2 convolution downsampling are first used to align resolutions, followed by element-wise weighted fusion within each scale. The module finally outputs a set of multi-scale features that simultaneously encode global context and local fine details, which are fed directly to the detection head for defect localization and classification. This mechanism notably improves multi-scale detection performance, reduces computational redundancy, and accelerates model convergence.

3. Experiment and Result Investigation

3.1. Experimental Environment and Dataset Partition

All experiments were conducted on a unified workstation: Windows 11, Intel Core i9-14700HX CPU (Santa Clara, CA, USA), and NVIDIA GeForce RTX 3060 GPU (Santa Clara, CA, USA). The software stack was Python 3.9.7 with CUDA 11.8 and PyTorch 1.8.1. Training was run for 200 epochs, with all images resized to 512 × 512 and a batch size of four. To ensure fairness and reproducibility, every comparative test was executed under an identical environment.

Data were collected by an inspection vehicle, as illustrated in Figure 5. The rear-mounted edge-computing unit is an NVIDIA Jetson AGX Xavier industrial module (Volta GPU, 8 TFLOPS FP16; dual NPU; 32 GB LPDDR4x at 137 GB s⁻¹; 1 TB NVMe SSD). From the recorded videos, 5000 frames containing defects were selected. The dataset was split 7:2:1, resulting in 3500 training images, 1000 validation images, and 500 test images, as illustrated in Figure 6.

The field data acquisition scene is shown in Figure 5. An inspection trolley equipped with a line-scan camera, annular LED fill lights, and an inertial navigation module moves through the tunnel at a constant speed of 5 km h⁻¹, continuously capturing 2 K resolution images. A rear-mounted edge-computing unit buffers and wirelessly streams the video in real time, ensuring that the raw defect data are complete, thus providing high-quality samples for subsequent model training and verification.

3.2. Evaluation

To comprehensively evaluate the performance of the improved YOLOv11 network structure in utility tunnel defect detection, this paper adopts three widely recognized core metrics in object detection [28]: precision, recall, mean Average Precision (mAP), Frames Per Second (FPS), and Giga Floating-Point Operations (GFLOPs).

Precision is the ratio of “correctly detected positive samples” to “all detected samples,” reflecting how trustworthy the model is when it reports a defect:

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

where TP (True Positive) is the number of real defects correctly detected and FP (False Positive) is the number of false alarms.

Recall represents the proportion of correctly detected positive samples among all actual positive samples.

R e c a l l = \frac{T P}{T P + F N}

(16)

FN (False Negative): the number of actual positive samples that were incorrectly missed by the detector.

To comprehensively balance precision and recall across different IoU thresholds, we adopt mean Average Precision (mAP) as the overall metric.

m A P = \frac{1}{N} \sum_{q = 1}^{N} P (q) Δ R (q)

(17)

q denotes the q-th IoU threshold and N is the total number of thresholds evaluated. P(q) and ΔR(q) are the precision and the increment of recall at that threshold, respectively.

FPS indicates the number of images the detector can process within one second and is used to evaluate the detection speed of the algorithm. A higher FPS value means the algorithm runs faster.

GFLOPs are a key metric for measuring the computational complexity of a model: the lower the value, the fewer calculations the model requires and the higher its computational efficiency.

3.3. Ablation Study

The paper designs an improved YOLOv11 network architecture. Specifically, it replaces the last two C3k2 modules with a proposed Multi-Scale Feature Aggregation Module (MSFAM), incorporates the BRA attention mechanism, and utilizes a BiFPN to optimize multi-scale feature fusion. To verify the rationality and effectiveness of the designed network, ablation experiments are conducted to compare the YOLOv11 baseline with various improved schemes, as detailed in Table 2 below.

The ablation table reveals a clear accuracy climb on corridor defect detection as the MSFAM, BRA, and BiFPN are successively added. The original YOLOv11 baseline already offers basic competence (P 82.3%, R 81.3%, and mAP 81.5%), yet false negatives and false positives remain high. Embedding the MSFAM alone lifts all three metrics by ~1 pp (P 83.3%, R 82.4%, and mAP 82.6%), showing that fine cracks and early seepage spots are better captured. Adding BRA on top of the MSFAM yields another ~1 pp gain (P 84.6%, R 83.3%, and mAP 83.5%), confirming the benefit of spatial–channel attention. Simultaneously activating the MSFAM and BRA while retaining the static PAN-FPN increases mAP by 2.0 pp compared with Row-2, demonstrating the complementary benefits between multi-scale backbone features and spatial–channel attention. Introducing the weighted bidirectional feature pyramid without BRA raises accuracy to 86.7%, confirming that dynamic cross-scale fusion significantly improves localization errors while reducing computational redundancy. Keeping the original bottleneck and adding only BRA and the BiFPN yields an mAP of 88.9%, indicating that attention mechanisms and weighted fusion can still deliver substantial gains even without modifying the backbone. Finally, introducing the weighted bidirectional feature pyramid drives accuracy into a steep ascent: precision reaches 93.2%, recall 92.4%, and mAP 92.6% (as illustrated in Figure 7, Figure 8 and Figure 9).

Compared with the baseline, mAP improves by 11.1 absolute percentage points, cutting the relative error by roughly 75%; both precision and recall exceed 92%, firmly validating the effectiveness of the proposed enhancements.

As shown by the curves, during the initial 0–30 epochs, mAP, recall, and precision all rise steeply, indicating that the model quickly learns low-level edge features of cracks and other defects from a random initialization. Around epoch 60 the three curves simultaneously enter a high-slope phase: the bidirectional feature pyramid network (BiFPN) and the Multi-Scale Feature Aggregation Module (MSFAM) now act in concert, fusing fine-grained and semantic information, with recall gaining the most—evidence that micro-cracks and other extremely small targets are being continuously recalled. After roughly 100 epochs all metrics level off: precision converges first, fluctuating by ≤0.3%, which verifies that DIoU loss effectively suppresses bounding box regression errors. Between epochs 100 and 120, mAP and recall improve by only 0.4% and 0.5%, respectively, showing that the network weights are near-optimal and that further training introduces no over-fitting, demonstrating the excellent generalization and robustness of the proposed algorithm.

3.4. Different Models

To further demonstrate the superiority of the proposed method, we benchmark representative detectors—Faster R-CNN, SSD, YOLOv5, YOLOv8, and YOLOv10—on the same corridor defect dataset. The quantitative comparison is summarized in Table 3.

The proposed enhanced YOLOv11 model achieves 93.2% precision, 92.4% recall, and 92.6% mAP@0.5, surpassing the Faster R-CNN, SSD, and all YOLO variants (v5/v8/v10/v11). Against the runner-up YOLOv11, it gains +1.7, +2.1, and +0.8 percentage points, verifying the new architecture’s effectiveness for utility tunnel defect inspection. By integrating the Multi-Scale Feature Aggregation Module (MSFAM) and weighted BiFPN, the network robustly detects 0.5–16 pixel micro-cracks, seepage stains, and step faults under low-light, shadow, and reflective underground conditions. It yields the largest PR curve area and the tightest error bars, demonstrating outstanding cross-scene generalization. The proposed enhancement synergistically integrates lightweight multi-scale aggregation, partial attention, and fast-normalized bidirectional fusion to deliver 92.6% mAP at 17.3 GFLOPs, demonstrating that sub-20 GFLOP complexity is sufficient for real-time, micro-defect-aware inspection on resource-constrained edge devices.

3.5. Detection Results

The improved YOLOv11 algorithm proposed in this paper is evaluated on the self-collected dataset, and the test results are illustrated in Figure 10.

As shown in Figure 10, the proposed method accurately localizes and classifies cracks and holes with high confidence scores. Even under rotated and perspective-distorted augmentations, the improved YOLOv11 still retains complete detection boxes. This robustness across geometrically transformed data verifies the reliability and generalization capability of the proposed algorithm in practical underground gallery inspection scenarios.

To further evaluate the efficacy of the proposed method, we visualize its class-activation heat maps, which were generated with Grad-CAM, in Figure 11. Visually display the heat of data in different areas through color depth. It maps the data distribution into two-dimensional or three-dimensional space, and graphically presents the data density or intensity. Red represents the highest value, green represents the lowest value, and the remaining values display a gradient color between red and green.

The above heatmaps demonstrate that the proposed improvements progressively enhance both the discriminability and spatial coherence of features. Consequently, the network maintains high activation for minute defects while suppressing responses to false targets in complex underground environments, thereby offering interpretable visual evidence for the concurrent surpassing of 92% in precision and recall.

4. Conclusions

In this study, we proposed an enhanced YOLOv11 architecture specifically tailored for the automated detection of defects within underground utility tunnels. By systematically integrating a Multi-Scale Feature Aggregation Module, a bidirectional weighted BiFPN, and collaborative attention mechanisms, the model effectively addresses the challenges posed by low illumination, high clutter, and scale variations inherent in tunnel environments. Experimental results demonstrated that the improved YOLOv11 achieves 93.2% precision, 92.4% recall, and 92.6% mAP@0.5, outperforming baseline YOLOv11 and other state-of-the-art detectors such as the Faster R-CNN, SSD, YOLOv5, YOLOv8, and YOLOv10. Ablation studies confirmed that each proposed component contributes incrementally to performance gains, with the full model reducing the relative error by approximately 75% compared to the baseline. Visualization of class activation heatmaps further revealed that the network maintains high sensitivity to minute defects while suppressing false responses to background noise, underscoring its robustness and generalization capability. The proposed method strikes a favorable balance between accuracy, efficiency, and deployability, offering a reliable and scalable solution for real-time defect inspection in urban infrastructure systems. The proposed enhanced YOLOv11 framework—via multi-scale feature aggregation, collaborative attention, and efficient fusion—achieves an accuracy–real-time balance of 93.2% mAP and 56.8 FPS at an edge-friendly 17.3 GFLOPs, pushing micro-crack detection down to 0.5 px. Nevertheless, extreme fouling, geometric distortions, and material distribution bias still degrade recall and generalization, necessitating future integration of multimodal sensing, self-supervised learning, and model compression to further enhance robustness and universality in complex urban infrastructure.

Author Contributions

Conceptualization, Z.L., W.S., and L.S.; methodology, Z.L., W.S., and L.S.; software, Z.L., W.S., and L.S.; validation, Z.L., W.S., and L.S.; formal analysis, Z.L., W.S., and L.S.; investigation, Z.L., W.S., and L.S.; resources, Z.L., W.S., and L.S.; data curation, Z.L., W.S., and L.S.; writing—original draft preparation, Z.L., W.S., and L.S.; writing—review and editing, Z.L., W.S., and L.S.; visualization, Z.L., W.S., and L.S.; supervision, Z.L., W.S., and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, Y.; Li, D.; Xie, Q.; Wu, Q.; Wang, J. Automatic defect detection and segmentation of tunnel surface using modified Mask R-CNN. Measurement 2021, 178, 109316. [Google Scholar] [CrossRef]
Lee, B.; Han, D. Real-time fire detection using camera sequence image in tunnel environment. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2007; pp. 1209–1220. [Google Scholar]
Cheng, J.C.P.; Wang, M. Automated detection of sewer pipe defects in closed-circuit television images using deep learning techniques. Autom. Constr. 2018, 95, 155–171. [Google Scholar] [CrossRef]
Afshani, A.; Kawakami, K.; Konishi, S.; Akagi, H. Study of infrared thermal application for detecting defects within tunnel lining. Tunn. Undergr. Space Technol. 2019, 86, 186–197. [Google Scholar] [CrossRef]
Wang, H.; Wang, Y.; Zhu, Y.; Zhang, M.; Yan, T.; Niu, X. Infrared thermography for leakage detection in underground water conveyance tunnels. Measurement 2025, 258, 119412. [Google Scholar] [CrossRef]
Konishi, S.; Kawakami, K.; Taguchi, M. Inspection method with infrared thermometry for detect void in subway tunnel lining. Procedia Eng. 2016, 165, 474–483. [Google Scholar] [CrossRef]
White, J.; Hurlebaus, S.; Shokouhi, P.; Wimsatt, A. Use of ultrasonic tomography to detect structural impairment in tunnel linings: Validation study and field evaluation. Transp. Res. Rec. 2014, 2407, 20–31. [Google Scholar] [CrossRef]
Liu, X.C.; Wu, C.S. Elimination of tunnel defect in ultrasonic vibration enhanced friction stir welding. Mater. Des. 2016, 90, 350–358. [Google Scholar] [CrossRef]
Xue, Y.; Li, Y. A fast detection method via region-based fully convolutional neural networks for shield tunnel lining defects. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 638–654. [Google Scholar] [CrossRef]
Wang, J.; Zhang, J.; Cohn, A.G.; Wang, Z.; Liu, H.; Kang, W.; Jiang, P.; Zhang, F.; Chen, K.; Guo, W.; et al. Arbitrarily-oriented tunnel lining defects detection from ground penetrating radar images using deep convolutional neural networks. Autom. Constr. 2022, 133, 104044. [Google Scholar] [CrossRef]
Kumar, S.S.; Abraham, D.M.; Jahanshahi, M.R. Automated defect classification in sewer closed circuit television inspections using deep convolutional neural networks. Autom. Constr. 2018, 91, 273–283. [Google Scholar] [CrossRef]
Hawari, A.; Alamin, M.; Alkadour, F.; Elmasry, M.; Zayed, T. Automated defect detection tool for closed circuit television (CCTV) inspected sewer pipelines. Autom. Constr. 2018, 89, 99–109. [Google Scholar] [CrossRef]
Yin, X.; Chen, Y.; Zhang, Q. A neural network-based application for automated defect detection for sewer pipes. In Proceedings of the Canadian Society for Civil Engineering Annual Conference 2019, Laval, QC, Canada, 12–15 June 2019; pp. 1–8. [Google Scholar]
Li, D.; Xie, Q.; Yu, Z.; Li, H. Sewer pipe defect detection via deep learning with local and global feature fusion. Autom. Constr. 2021, 129, 103823. [Google Scholar] [CrossRef]
Situ, Z.; Teng, S.; Liu, H.; Wang, Y. Automated sewer defect detection using style-based generative adversarial networks and fine-tuned CNN classifier. IEEE Access 2021, 9, 59498–59507. [Google Scholar] [CrossRef]
Shen, D.; Liu, X.; Shang, Y.; Tang, X. Improved ResNet-based intelligent recognition method for underground drainage pipeline defects. Intell. Comput. Appl. 2024, 14, 92–98. [Google Scholar]
Wang, D.; Tan, J.; Peng, S.; Zhong, Z.; Chen, G.; Li, G. Intelligent identification system of drainage pipelines defects based on deep learning model. Bull. Surv. Mapp. 2021, 141–145. [Google Scholar] [CrossRef]
Haurum, J.B.; Moeslund, T.B. Sewer-ML: A multi-label sewer defect classification dataset and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 13456–13467. [Google Scholar]
Zhu, A.; Xie, J.; Wang, B.; Guo, H.; Guo, Z.; Wang, J.; Xu, L.; Zhu, S.; Yang, Z. Lightweight defect detection algorithm of tunnel lining based on knowledge distillation. Sci. Rep. 2024, 14, 27178. [Google Scholar] [CrossRef]
Luo, T.X.; Zhou, Y.; Zheng, Q.; Hou, F.; Lin, C. Lightweight deep learning model for identifying tunnel lining defects based on GPR data. Autom. Constr. 2024, 165, 105506. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Z.-F.; Wang, Y.-Q.; Zhou, L.-W.; Li, X.-K.; Ding, X.-H. Real-time detection of highway tunnel lining cracks using YOLOv8-DTD with an android application. Autom. Constr. 2025, 180, 106524. [Google Scholar] [CrossRef]
Zhou, Z.; Li, H.; Zhou, S.; Yan, L.; Yang, H. A deep learning-based algorithm for fast identification of multiple defects in tunnels. Eng. Appl. Artif. Intell. 2025, 133, 108456. [Google Scholar] [CrossRef]
Qin, S.; Qi, T.; Lei, B.; Huang, X. Batched-image detection model and deployment method for tunnel lining defects using line-scan cameras. Tunn. Undergr. Space Technol. 2023, 142, 105428. [Google Scholar] [CrossRef]
Feng, S.J.; Feng, Y.; Zhang, X.L.; Chen, Y.H. Deep learning with visual explanations for leakage defect segmentation of metro shield tunnel. Tunn. Undergr. Space Technol. 2023, 136, 105107. [Google Scholar] [CrossRef]
Dai, C.; Jiang, K.; Wang, Q. Recognition of tunnel lining cracks based on digital image processing. Math. Probl. Eng. 2020, 2020, 5162583. [Google Scholar] [CrossRef]
Yang, L.; Li, Z.; Hu, X.; Shao, M.; Zhao, Y.; Zhou, C. CTC-YOLO: An improved YOLOv11 algorithm for steel surface defect detection. Eng. Res. Express 2025, 7, 035265. [Google Scholar] [CrossRef]
Lin, L.; Zhu, H.; Ma, Y.; Peng, Y.; Xia, Y. Surface feature and defect detection method for shield tunnel based on deep learning. J. Comput. Civ. Eng. 2025, 39, 04025019. [Google Scholar] [CrossRef]
Shan, Z.; Haoyan, H.; Zhu, C.; Du, S.; Jing, H.; Haibin, W. RSM-YOLOv11: Lightweight Steel Surface Defect Segmentation Algorithm Research Based on YOLOv11 Improvement. IEEE Access 2025, 13, 111681–111698. [Google Scholar] [CrossRef]

Figure 1. Comparison before and after homomorphic filtering. (a) Utility tunnel defect image under low illumination; (b) Filtered utility tunnel defect image.

Figure 2. The network structure of YOLOV11.

Figure 3. Structure of MSFAM.

Figure 4. BiFPN module structure.

Figure 5. Image acquisition equipment.

Figure 6. Sample dataset display.

Figure 7. The precision of the proposed algorithm.

Figure 8. The recall of the proposed algorithm.

Figure 9. The mAP of the proposed algorithm.

Figure 10. Detection results.

Figure 11. Heatmaps for various defect types.

Table 1. Comparison of different inspection methods.

Method	Technical Principle	Advantages	Disadvantages
Ultrasonic Testing	Ultrasonic pulses are injected into the concrete. When a wave meets a defect, it is partially reflected, refracted, and diffracted; the instrument records the time, amplitude, and frequency of the returning signal to locate, size, and characterize internal flaws.	1. High accuracy and sensitivity; capable of detecting very small internal defects. 2. Fast scanning speed, suitable for large area surveys. 3. Completely non-destructive.	1. Requires experienced operators to correctly interpret ultrasonic signals. 2. Results are difficult to explain for complex geometries or unclear boundary conditions. 3. Flaws parallel to the sound propagation direction are easily missed.
Infrared Thermography	An external heat source creates a surface-temperature map. Defective zones conduct heat differently, so their surface temperature deviates from intact areas; the IR camera captures this deviation to reveal internal flaws.	1. Contact-free, full-field measurement; extremely fast inspection. 2. Wide coverage in a single thermal image. 3. No coupling agent needed; fully non-destructive.	1. Strongly affected by ambient temperature, sunlight, wind, humidity, etc. 2. Heat attenuation with depth makes it insensitive to deep-seated defects. 3. Mainly qualitative or semi-quantitative; difficult to obtain exact depth and size of defects.
Ground-Penetrating Radar	A high-frequency electromagnetic pulse is transmitted into the concrete. Dielectric contrasts (voids, cracks, and reinforcement) reflect part of the energy; the two-way travel time, amplitude, and phase of the reflections are used to reconstruct the internal structure.	1. Rapid continuous profiling; high efficiency. 2. Sensitive to hidden voids, delaminations, and reinforcement layout. 3. Non-contact and non-destructive.	1. Wet conditions or metallic meshes create strong clutter and reduce signal-to-noise ratio. 2. Image interpretation relies heavily on experience; complex defects are hard to identify. 3. Resolution limited by antenna frequency and material properties; small deep flaws may be missed.
CCTV	A crawler-mounted HD camera is driven through the pipe, streaming video via a multi-core cable to an operator who visually identifies cracks, joint displacements, infiltration, corrosion, etc.	1. Intuitive visual evidence; entire survey can be recorded and archived. 2. Remote zoom, pan and tilt allow detailed re-examination of suspect areas. 3. Mature technology with relatively low equipment cost.	1. Requires flow stoppage, plugging, dewatering, washing, and desilting—long preparation time. 2. Insensitive to defects below the water level, deep within the wall, or minor seepage. 3. Manual interpretation is slow and subjective; small defects are easily overlooked.

Table 2. Comparison of results from ablation studies.

	Baseline	MSFAM	BRA	BiFPN	Precision	Recall	mAP	FPS	GFLOPs
1	✓				0.823	0.813	0.815	40.2	18.9
2	✓	✓			0.833	0.824	0.826	43.7	18.5
3	✓		✓		0.846	0.831	0.835	41.9	19.1
4	✓	✓	✓		0.857	0.842	0.846	44.5	18.7
5	✓	✓		✓	0.876	0.864	0.867	52.3	17.8
6	✓		✓	✓	0.891	0.885	0.889	60.1	17.5
7	✓	✓	✓	✓	0.932	0.924	0.926	68.4	17.3

Table 3. Comparison of test results across different models.

Method	Precision	Recall	mAP	FPS	GFLOPs
Faster R-CNN	0.835	0.792	0.804	21.2	36.8
SSD	0.821	0.786	0.825	25.9	31.2
YOLOv5	0.862	0.834	0.868	35.1	28.4
YOLOv8	0.887	0.859	0.903	42.5	26.1
YOLOv10	0.905	0.881	0.892	48.6	23.9
YOLOv11	0.915	0.903	0.918	52.4	19.8
Our algorithm	0.932	0.924	0.926	56.8	17.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Shi, W.; Sun, L. Pipeline Defect Detection Based on Improved YOLOv11. Processes 2026, 14, 530. https://doi.org/10.3390/pr14030530

AMA Style

Li Z, Shi W, Sun L. Pipeline Defect Detection Based on Improved YOLOv11. Processes. 2026; 14(3):530. https://doi.org/10.3390/pr14030530

Chicago/Turabian Style

Li, Zhiqiang, Weimin Shi, and Lei Sun. 2026. "Pipeline Defect Detection Based on Improved YOLOv11" Processes 14, no. 3: 530. https://doi.org/10.3390/pr14030530

APA Style

Li, Z., Shi, W., & Sun, L. (2026). Pipeline Defect Detection Based on Improved YOLOv11. Processes, 14(3), 530. https://doi.org/10.3390/pr14030530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pipeline Defect Detection Based on Improved YOLOv11

Abstract

1. Introduction

2. Methods

2.1. Image Preprocessing

2.2. Deep Learning Algorithm Structure

2.3. Replacing C3k2 with MSFAM

2.4. BiFPN Feature Fusion Network

3. Experiment and Result Investigation

3.1. Experimental Environment and Dataset Partition

3.2. Evaluation

3.3. Ablation Study

3.4. Different Models

3.5. Detection Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI