Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset

Oğurlu, Mustafa; Bayram, Bülent; Kulavuz, Bahadır; Bakırman, Tolga

doi:10.3390/app152011096

Open AccessArticle

Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset

¹

Geographic Information Systems Branch Directorate, Department of Information Technology, Istanbul Water and Sewerage Administration, 34060 Istanbul, Türkiye

²

Department of Geomatics Engineering, Yildiz Technical University, 34220 Istanbul, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11096; https://doi.org/10.3390/app152011096

Submission received: 23 September 2025 / Revised: 10 October 2025 / Accepted: 13 October 2025 / Published: 16 October 2025

Download

Browse Figures

Versions Notes

Abstract

The inspection and maintenance of urban sewer infrastructure remain critical challenges for megacities, where conventional manual inspection approaches are labor-intensive, time-consuming, and prone to human error. Although deep learning has been increasingly applied to sewer inspection, the field lacks both a publicly available large-scale dataset and a systematic evaluation of CNN and transformer-based models on real sewer footage. The primary aim of this study is to systematically evaluate and compare state-of-the-art deep learning architectures for automated sewer defect detection using a newly introduced dataset. We present the Istanbul Sewer Defect Dataset (ISWDS), comprising 13,491 expert-annotated images collected from Istanbul’s wastewater network and covering eight defect categories that account for approximately 90% of reported failures. The scientific novelty of this work lies in both the introduction of the ISWDS and the first systematic benchmarking of YOLO (v8/11/12) and RT-DETR (v1/v2) architectures under identical protocols on real sewer inspection footage. Experimental results demonstrate that RT-DETR v2 achieves the best performance (F1: 79.03%, Recall: 81.10%), significantly outperforming the best YOLO variant. While transformer-based architectures excel in detecting partially occluded defects and complex operational conditions, YOLO models provide computational efficiency advantages for resource-constrained deployments. Furthermore, a QGIS-based inspection tool integrating the best-performing models was developed to enable real-time video analysis and automated reporting. Overall, this study highlights the trade-offs between accuracy and efficiency, demonstrating that RT-DETR v2 is most suitable for server-based processing. In contrast, compact YOLO variants are more appropriate for edge deployment.

Keywords:

deep learning; sewer defect detection; object detection; YOLO; RT-DETR; infrastructure inspection

1. Introduction

1.1. Context and Motivation

Urban wastewater infrastructure represents one of the most critical yet vulnerable components of modern city systems worldwide. With over 4.4 billion people living in urban areas today and this number projected to reach 6.7 billion by 2050 [1], the sustainable management of sewer networks has become paramount for public health, environmental protection, and economic stability [2]. These underground lifelines face mounting pressures from aging infrastructure, climate change, and rapid urbanization, leading to deterioration patterns that threaten service continuity and environmental safety [3,4]. Contemporary research demonstrates that approximately 56% of wastewater in major river basins leaks without treatment due to infrastructure inadequacy, while sewage overflow events have increased significantly as infrastructure fails to keep pace with demand [5,6].

Istanbul exemplifies these challenges at megacity scale, operating an extensive wastewater network of approximately 18,000 km of sewer pipelines, 1220 km of collectors, 201 km of tunnels, and 90 wastewater treatment plants, amounting to nearly 1.37 million individual sewer pipes [7]. Like metropolitan areas globally, this infrastructure experiences accelerating deterioration due to physical, chemical, and operational stresses that result in structural defects including cracks, breaks, collapses, leaks, and joint displacements, as well as operational issues like roots, collapses, and blockages.

With increasing population density, sewer pipelines are exposed to physical, chemical, and operational stresses that lead to deterioration over time. Factors such as corrosion, soil movement, overloading, and insufficient maintenance result in structural defects including cracks, breaks, collapses, leaks, joint displacements, as well as operational issues like roots, collapses, and blockages [8,9,10,11]. These failures can reduce hydraulic capacity, cause untreated wastewater to infiltrate surface or groundwater, and ultimately create severe environmental, health, and economic consequences. Moreover, climate change and the rising frequency of extreme rainfall events place additional stress on urban drainage systems, further exacerbating infrastructure failures.

Since the construction and rehabilitation of sewer networks involve high capital investments, proactive inspection and timely maintenance are essential. Conventional inspection practices rely primarily on Closed-Circuit Television (CCTV) systems, where remotely operated cameras are deployed into sewer lines to capture video footage for manual assessment [12]. While widely used, this process is labor-intensive, time-consuming, and highly dependent on the operator’s expertise. In practice, limitations such as high operator workload, insufficient training, overlooked defects, and subjective interpretation reduce the accuracy and reliability of CCTV-based inspections [12,13,14,15,16]. Additional challenges include poor image quality due to moisture, deposits, or low lighting, as well as the complexity of detecting multiple defects simultaneously.

In response to these challenges, automated inspection systems leveraging artificial intelligence (AI), particularly deep learning (DL), have emerged as a transformative solution. The application of DL in this domain has evolved through distinct phases, each overcoming previous limitations while introducing new challenges.

1.2. Related Work

The first wave of research focused on adapting established convolutional neural network (CNN) architectures for sewer defect detection. This era was characterized by two parallel tracks: object detection models like Faster R-CNN and YOLO were deployed to localize defects [17,18,19], while semantic segmentation architectures like U-Net and its variants were explored for pixel-level delineation of cracks and other flaws [20,21]. This period also saw the emergence of more advanced instance segmentation frameworks, such as Mask R-CNN and custom solutions like PipeSolo, which combined detection and segmentation for finer-grained analysis [22,23]. While these studies proved the concept and achieved significant accuracy improvements over manual methods, they were often limited by relatively small, homogenous datasets, focusing on a narrow set of high-prevalence defects.

Driven by the need for higher accuracy and robustness, a second wave of research refined CNN-based detectors with architectural and loss-function improvements tailored to the challenging conditions of sewer CCTV imagery. For example, Spatial Pyramid Pooling (SPP) and enhanced IoU losses (DIoU/CIoU) were incorporated into YOLOv4-style pipelines to better handle scale variation and stabilize bounding-box regression [24]. At the same time, single-shot detectors were strengthened by multi-scale receptive-field modules and channel/coordinate attention mechanisms, approaches that improve small-defect sensitivity and suppress background noise [16]. Other studies combined receptive-field blocks (RFB) with focal or class-balanced losses to mitigate class imbalance and emphasize rare operational defects, while attention modules such as CBAM or ECA were used to highlight defect-related features within cluttered scenes [25]. These targeted interventions such as multi-scale fusion, attention, and improved loss formulations have produced notable gains in controlled evaluations (several works reporting mAP values exceeding ~80% on focused defect subsets), demonstrating that careful CNN engineering can substantially raise detection performance on real-world sewer data [24,26,27].

The most recent wave in sewer-inspection research is characterized by the adoption of transformer-based and hybrid CNN–transformer architectures, which exploit global attention and improved multi-scale feature extraction to handle complex, cluttered CCTV scenes. For example, DefectTR demonstrated an end-to-end DETR-based pipeline for sewer defects (avoiding anchor design and NMS) and showed competitive localization accuracy on operational footage [28]. Hybrid models that fuse convolutional encoders with transformer modules (e.g., PipeTransUNet) have been proposed for semantic segmentation and severity quantification, improving pixel-level delineation of defects [29,30]. Composite approaches that pair Swin-Transformer backbones with multi-stage detection heads (for example, Cascade R-CNN) have been successfully applied to sewer-defect detection and related small-object tasks, reporting consistent improvements in localization and mean-average-precision over CNN-only baselines [31,32].

To address fundamental data limitations, researchers have paired transformer adoption with data-centric strategies: GAN-/StyleGAN-based augmentation pipelines and synthetic image generation have been used to expand defect classes and balance rare categories, while few-shot or mask-guided generation approaches help produce realistic defect samples from limited examples [33,34,35]. At the same time, lightweight real-time transformer variants (e.g., RT-DETR and its enhancements) are being explored to bring transformer performance to practical, live inspection settings [36].

However, despite this rapid progression and increasing model sophistication, a critical analysis reveals persistent barriers to robust, real-world deployment:

The Data Scarcity Paradox: While large-scale datasets like Sewer-ML exist [37], their utility is limited by task design. Sewer-ML is restricted to image-level classification and does not provide bounding box annotations, making it unsuitable for training and benchmarking object detection models. Moreover, there is a pronounced focus on a narrow set of structural defects (e.g., cracks, breaks), whereas operational defects with significant maintenance implications such as root intrusions, attached deposits, and hardened deposits, are frequently overlooked or severely underrepresented [17,38]. This underscores the need for detection-ready datasets such as the ISWDS, which explicitly address both structural and operational defects.

The Robustness Gap: Performance often drops significantly under real-world conditions not seen in the training data. Challenges like extreme occlusion, water obscuration, variable lighting, and lens distortion remain major obstacles [39].

The Translation Gap: The field is dominated by frame-by-frame image analysis, with limited research on temporal modeling for video sequences, which could leverage context across frames to improve accuracy and efficiency. Furthermore, the step from a high-performing model to a validated, user-friendly tool integrated into asset management workflows (e.g., GIS systems) is rarely taken [28,40].

1.3. Positioning and Contributions

It is against this background of advanced yet contextually limited models that our study is positioned, aiming to bridge dataset diversity, real-time applicability, and integration into operational workflows. To address these limitations, this study proposes a comprehensive deep learning-based framework for automated defect detection in sewer inspection. The contributions of this work are threefold:

Development of the Istanbul Sewer Defect Dataset: a novel dataset comprising 13,491 images covering eight major defect classes (roots, crack, breaks, collapses, joint displacement, displaced joint, settled deposits, leakage and attached deposits). In contrast to Sewer-ML which, while large-scale, is primarily built on European standards and focuses more heavily on structural defects, the Istanbul dataset provides a more balanced coverage of both structural and operational defects. Classes such as settled deposits, leakage and attached deposits, often underrepresented in previous datasets, are included to better reflect the practical challenges faced by sewer operators. Based on the Istanbul Water and Sewage Administration (ISKI)’s geospatial database, these eight categories account for approximately 90% of all reported defects. This makes the dataset both representative and operationally grounded.
Comparative evaluation of state-of-the-art deep learning models: We benchmark models including YOLOv8, YOLOv11, YOLOv12, and the transformer-based RT-DETR-v1 and RT-DETR-v2, on the newly introduced Istanbul Sewer Defect Dataset. To the best of our knowledge, this is the first study to systematically benchmark YOLO and RT-DETR side by side on real sewer inspection footage, offering comparative insights into their strengths across structural and operational defect types.
Practical integration for automated inspection: We discuss how the proposed models can support real-time decision-making during field operations and reduce operator dependence. On top of that, we integrate the best-performing model into a QGIS 3.34-based graphical user interface, enabling seamless geo-referenced defect logging and providing a user-friendly tool for inspection teams, demonstrating a direct path to field deployment and proactive infrastructure asset management.

While YOLO variants and transformer-based detectors such as RT-DETR have individually been applied in various domains, their systematic benchmarking protocols have not been conducted in the context of sewer defect detection. Our work provides the first comparative analysis, highlighting trade-offs between accuracy and computational efficiency that are essential for practical deployment. This comparison is of high practical relevance: sewer inspection requires both lightweight, real-time models suitable for edge devices and high-accuracy transformer models suitable for server-based processing. By providing the first systematic evaluation of YOLO and RT-DETR in this domain, the study delivers critical insights into model selection strategies for real-world sewer inspection workflows.

In addition, this study contributes to literature in several ways that extend beyond conventional benchmarking. It provides the first systematic comparison of CNN-based (YOLO) and transformer-based (RT-DETR) detectors on real sewer inspection data, a dimension that has not been addressed in previous research. The analysis incorporates class-wise performance metrics together with statistical significance testing, which strengthens the robustness and reproducibility of the findings. The study also outlines practical deployment pathways by demonstrating that YOLO variants are suitable for embedded, on-site inspection devices, whereas RT-DETR achieves higher detection accuracy for centralized, server-based processing. Moreover, to foster transparency and facilitate future research, the trained models have been made publicly available on Hugging Face, enabling independent validation and extension.

2. Materials and Methods

2.1. Data Collection

2.1.1. Istanbul Sewer Defect Dataset

The Istanbul Sewer Defect Dataset (ISWDS) is a novel, expert-curated dataset developed to address the need for a comprehensive, region-specific benchmark for automated sewer defect detection. The imagery was collected and annotated according to the principles of EN 13508-2:2003+A1:2011 standard [41], which provides a systematic coding system for the objective and comparable assessment of sewer infrastructure conditions across European member states.

The dataset comprises CCTV inspection videos from the extensive wastewater network of Istanbul, provided by ISKI. Data was collected between 2021 and 2024 across all 39 districts of the city, ensuring wide geographical and operational diversity. The videos were captured under varying conditions including different times of day, weather, water levels, and lighting using a variety of robotic CCTV systems (Figure 1). Approximately 95% of the images were acquired using cameras, capable of recording up to 30 FPS in Full HD resolution (1920 × 1080). The raw video footage exhibited a wide range of resolutions and frame rates, reflecting the heterogeneity of real-world inspection equipment.

To construct the dataset, frames were meticulously extracted from hours of video. A key challenge addressed during curation was the variable speed of the inspection crawlers. While the scientific maximum recommended speed is 0.25 m/s, operational practices often involve faster speeds (0.35–0.50 m/s), leading to motion blur and reduced image quality. Frames with excessive blur or poor clarity were manually excluded by a team of two expert annotators to ensure the dataset’s quality.

The annotation process was conducted using the open-source AnyLabeling v0.3.3 tool over a period of approximately four months. The experts labeled images across eight critical defect classes, focusing on a single primary defect per image to ensure label clarity. The classes were selected to provide a balanced coverage of both structural and operational defects, as defined by the EN 13508-2 standard, and reflect the most common and critical failure modes in Istanbul’s infrastructure. All annotations were performed in the YOLO format, using bounding boxes to localize each defect instance. The final class distribution is presented in Table 1. The dataset preparation workflow is shown in Figure 2.

The ISWDS is characterized by its real-world complexity, including challenges such as turbid water, occlusions, lens distortions, and uneven illumination. Unlike Sewer-ML, which is larger but limited to image-level classification and heavily focused on structural defects, ISWDS provides bounding box annotations and a more balanced representation of operational defects such as attached deposits and settled deposits, which are critical in Istanbul’s network. To ensure privacy and comply with data regulations, all text overlays and operator information within the images were automatically detected using the EasyOCR 1.7.1 library (with >95% accuracy) and blurred via a Gaussian filter implemented with OpenCV 4.12.0.88.

The dataset is not naturally balanced, reflecting the inherent imbalanced distribution of defects in real sewer systems. For instance, cracks and attached deposits are more prevalent than roots or infiltrations. This intentional preservation of class imbalance allows for the development of robust models to real-world operational conditions. The final dataset consists of 13,491 high-quality, annotated images ready for model development. A sample of annotated images from the ISWDS, showcasing the variety of defects and conditions is presented in Figure 3.

All image data is the property of ISKI, and official permission for its use in this research was obtained. Due to the sensitive nature of public infrastructure data and privacy agreements, the ISWDS is a closed dataset for internal research use and is not publicly distributed.

2.1.2. Comparison with Existing Datasets

Several datasets have been proposed for sewer defect analysis and are summarized in Table 2, ranging from small-scale collections [18,37,38,42,43,44,45,46,47] to the large-scale Sewer-ML dataset [37]. As summarized in Table 2, most existing datasets suffer from either small sample sizes, lack of diverse defect classes, or missing annotations required for object detection. For example, Ye et al. [42] provided only 1045 defective images across seven classes, while Myrans et al. [43] reported 2260 samples distributed over 13 classes. Similarly, Chen et al. [44], Li et al. [45], Kumar et al. [18], and Xie et al. [47] produced datasets with limited class diversity and imbalance, often focusing on structural rather than operational defects.

Sewer-ML [37], released as a large-scale public dataset, contains more than 1.3 million images (609,479 defective, 690,722 normal) across 17 classes. While it represents the largest benchmark to date and supports multi-label classification, its annotations are restricted to image-level labels without bounding boxes, limiting its suitability for object detection tasks. Furthermore, although Sewer-ML covers 17 defect categories, operational defects of practical significance (e.g., deposits, leakage, root intrusion) remain underrepresented.

In contrast, the Istanbul Sewer Defect Dataset (ISWDS) introduced in this study provides 13,491 images with bounding-box annotations for eight defect classes that collectively represent approximately 90% of real-world failures recorded in Istanbul’s wastewater network. The ISWDS ensures a more balanced representation of both structural and operational defects, offering a practically grounded benchmark for object detection models and real-world deployment.

2.2. Model Training

2.2.1. Deep Learning Architectures

To evaluate the performance of the proposed Istanbul Sewer Defect Dataset (ISWDS), a comprehensive comparative analysis was conducted using three state-of-the-art object detection architectures. The selection criteria prioritized models renowned for their high accuracy, computational efficiency, and relevance to real-time inspection tasks. The chosen models represent the current evolution of deep learning for object detection: the latest iterations of the well-established YOLO family, known for their speed-accuracy trade-off, and a modern real-time transformer-based detector, RT-DETR, which offers an end-to-end approach. The following subsections provide a concise overview of each model’s fundamental architecture and its key innovations.

2.2.2. You Only Look Once (YOLO) Architecture Family

The YOLO (You Only Look Once) family of architecture represents a cornerstone of modern, single-stage (single-forward-pass) object detection, renowned for its exceptional balance between speed and accuracy, making it ideal for real-time applications such as sewer inspection. The core YOLO principle involves dividing an input image into an S × S grid, where each cell simultaneously predicts the probability of an object’s presence and the coordinates of its bounding box. This unified approach to classification and localization reduces computational latency and leverages contextual information more effectively than traditional two-stage detectors.

Architecturally, YOLO models are generally composed of three primary components:

Backbone: A convolutional neural network (e.g., CSPDarknet) responsible for extracting multi-scale feature maps from the input image.

Neck: A module (e.g., PANet, BiFPN) that aggregates and fuses these features from different layers to enhance the representation of objects across scales.

Head: The final component that performs the simultaneous classification of objects and regression of their bounding box coordinates.

This study employs three distinct iterations of YOLO architecture to provide a comprehensive benchmark across different generations of architecture.

YOLOv8. Developed by Ultralytics (Frederick, MD, USA) in 2023, YOLOv8 represents a mature and widely adopted architecture that serves as our baseline model. The architecture introduces several key innovations that distinguish it from previous YOLO versions. The model implements anchor-free detection, which eliminates pre-defined anchor boxes and instead predicts object centers directly while regressing to bounding box dimensions. This approach simplifies the detection pipeline and reduces the need for hyperparameter tuning related to anchor configuration. The backbone utilizes CSPDarknet53, which incorporates Cross Stage Partial connections to improve gradient flow while reducing computational cost compared to traditional darknet architectures. For multi-scale feature fusion, YOLOv8 employs a Path Aggregation Network (PAN) neck that implements bottom-up path augmentation to enhance feature fusion across different scales, enabling better detection of objects at various sizes. The detection head uses a decoupled design that separates classification and regression tasks into distinct head branches, which improves training stability and allows for independent optimization of each task. For bounding box regression, the model employs Complete IoU (CIoU) loss, which incorporates distance, overlap, and aspect ratio considerations to achieve more accurate bounding box predictions compared to traditional IoU-based losses. The model is available in five scales (n, s, m, l, x) with parameter counts ranging from 3.2 M to 68.2 M parameters, allowing for flexible deployment across different computational constraints.

YOLOv11 released in 2024 by Ultralytics, YOLOv11 introduces several architectural refinements over YOLOv8 that enhance both performance and efficiency. The architecture replaces standard C3 modules with C3k2 blocks that incorporate selective kernel mechanisms, allowing for adaptive receptive field adjustment based on input characteristics. This modification enables the network to dynamically adapt its feature extraction capabilities to different object scales and contexts. The model also improves upon the Spatial Pyramid Pooling Fast (SPPF) module by incorporating additional pooling scales, which provides better multi-scale feature extraction capabilities essential for detecting objects of varying sizes in complex sewer environments. A notable innovation is the integration of C2PSA (C2 Partial Self-Attention) modules that incorporate lightweight self-attention mechanisms in deeper network layers. These attention modules enable the capture of long-range dependencies while maintaining computational efficiency, allowing the model to better understand spatial relationships across the entire image. Additionally, YOLOv11 implements an improved data augmentation pipeline that utilizes MixUp, CutMix, and Mosaic augmentations with adaptive scheduling during training, which enhances the model’s robustness to various imaging conditions commonly encountered in sewer inspection scenarios. The architectural changes result in improved accuracy–efficiency trade-offs, particularly for small object detection, while maintaining similar parameter counts to YOLOv8.

YOLOv12. YOLOv12, released in early 2024, represents an experimental evolution that incorporates more advanced attention mechanisms compared to its predecessors. The architecture implements Efficient Local Attention Networks (ELAN) that provide efficient attention modules designed to balance local and global feature interactions, enabling better contextual understanding while maintaining computational efficiency. The model incorporates RepVGG-style blocks that utilize structural re-parameterization during inference to reduce computational overhead while maintaining training-time expressiveness, allowing for a more efficient inference pipeline without sacrificing the model’s learning capacity during training. Another key innovation is the adaptive feature pyramid that dynamically adjusts feature pyramid weights based on input characteristics, enabling the network to automatically optimize feature fusion for different types of input images. YOLOv12 also includes enhanced augmentation strategies that incorporate advanced photometric and geometric augmentations specifically tuned for complex visual scenarios, which should theoretically improve performance in challenging conditions such as those encountered in sewer inspection environments. However, it is important to note that YOLOv12 showed mixed results in our experiments, with the largest variant (YOLOv12x) underperforming compared to other models in the series, suggesting that the architectural innovations may require further refinement or different training strategies to achieve their full potential.

For this study, all YOLO variants were initialized with pre-trained weights and trained on the same dataset using an identical transfer learning strategy to ensure a fair and objective comparison of their inherent architectural capabilities.

2.2.3. Real Time Detection Transformer (RT-DETR)

RT-DETR, developed by Baidu (Beijing, China) in 2024 [48], is an optimized variant of the DETR (DEtection TRansformer) architecture, specifically designed for efficient, end-to-end object detection in real-time applications. Its most significant advantage over conventional detectors is the complete elimination of the Non-Maximum Suppression (NMS) post-processing step, which reduces computational complexity and inference latency.

RT-DETR v1. RT-DETR v1 consists of four key components:

HGNet-v2 backbone: A hybrid CNN backbone that combines depthwise separable convolutions with residual connections for efficient feature extraction.
Hybrid encoder: Implements both intra-scale and cross-scale attention mechanisms using attention-based Intrascale Feature Interaction (AIFI) and Cross-scale Feature Fusion (CCFF) modules.
Transformer decoder: Uses 6 decoder layers with learnable object queries (300 queries by default) that attend to encoder features.
Prediction heads: Separate classification and regression heads that output final predictions without requiring NMS.

The architecture incorporates several key innovations that distinguish it from traditional detection methods. RT-DETR v1 implements uncertainty-guided query selection that improves query initialization by leveraging prediction uncertainty, allowing the model to focus computational resources on the most informative regions of the image. The model also employs an IoU-aware classification loss function that combines classification confidence with localization quality, ensuring that high classification scores correspond to accurate bounding box predictions. Additionally, the architecture utilizes efficient attention mechanisms that reduce attention complexity through separable attention mechanisms, significantly improving computational efficiency while maintaining the model’s ability to capture long-range spatial dependencies across the entire image.

RT-DETR v2. RT-DETR v2, released later in 2023, introduces several improvements over v1:

Enhanced backbone options: Supports both HGNet-v2 and ResNet backbones with optimized feature extraction
Dynamic query selection: Implements learnable query initialization that adapts based on input image characteristics
Improved multi-scale fusion: Uses deformable attention mechanisms in the encoder for better feature alignment
Optimized training strategy: Incorporates knowledge distillation and progressive resizing during training

Here is the RT-DETR comparison and performance characteristics section rewritten as continuous prose without bullet points:

The key differences between RT-DETR v1 and v2 center on several architectural and training improvements. RT-DETR v2 uses more sophisticated query selection mechanisms that better adapt to input characteristics, while also incorporating enhanced multi-scale attention in the encoder that provides improved feature alignment across different scales. The newer version also benefits from improved training protocols that result in better convergence properties, though this comes at the cost of slightly higher computational requirements while delivering improved accuracy. Both RT-DETR variants eliminate the need for anchor generation and NMS post-processing, making them truly end-to-end trainable and deployable for real-time applications.

RT-DETR architectures offer several distinct advantages over traditional detection methods. The models provide end-to-end optimization that allows direct optimization of final detection metrics without requiring intermediate processing steps, which simplifies the training pipeline and reduces potential sources of error. They also offer flexible inference speed capabilities, as the number of object queries can be adjusted at inference time to balance speed versus accuracy according to application requirements. The global attention mechanisms inherent in the transformer architecture enable better handling of crowded scenes by helping to detect overlapping objects that might be missed by local feature-based methods. Additionally, RT-DETR models demonstrate consistent performance and are less sensitive to hyperparameter tuning compared to anchor-based methods, making them more robust across different deployment scenarios.

However, the transformer-based architecture also introduces certain limitations that must be considered. The attention mechanisms require higher memory requirements during training, which may limit deployment of resource-constrained hardware. RT-DETR models typically require longer training times as transformer architectures generally need more epochs to converge compared to convolutional networks. Furthermore, there are currently limited pre-trained models available for RT-DETR variants compared to the extensive ecosystem of YOLO checkpoints, which may impact transfer learning capabilities for specialized applications.

This study utilizes RT-DETR-v1 and RT-DETR-v2, which builds upon these core efficient design principles. All models, including RT-DETR, were initialized with weights pre-trained on the COCO dataset and fine-tuned on the sewer defect dataset to ensure a fair comparison and leverage the benefits of transfer learning.

2.3. Evaluation Metrics

The performance of all trained models was rigorously evaluated on a held-out test set using a comprehensive suite of standard object detection metrics. These metrics were calculated for each defect class individually and averaged across all classes to provide a holistic view of model performance, highlighting specific strengths and weaknesses.

The primary metrics used for comparison are defined as follows:

Precision (P) measures the model’s ability to avoid false positives, representing the proportion of correctly identified defects among all predicted defects.

Precision = \frac{T P}{T P + F P}

Recall (R), or True Positive Rate, measures the model’s ability to find all true defects, representing the proportion of actual defects that were correctly detected.

Recall = \frac{T P}{T P + F N}

F1-Score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two values. It is particularly valuable for evaluating performance on imbalanced datasets.

F 1 - Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

Average Precision (AP) is calculated for each class as the area under the precision-recall curve, integrating performance across all confidence thresholds. mean Average Precision (mAP) is the primary metric for overall model accuracy. We report two variants:

mAP@0.5: The mean AP calculated at a single Intersection over Union (IoU) threshold of 0.5. This is a common benchmark but represents a loose localization criterion.

mAP@0.5:0.95: The mean AP averaged over multiple IoU thresholds, from 0.5 to 0.95 in steps of 0.05. This is a stricter, more comprehensive metric that heavily penalizes inaccurate bounding box predictions, making it the gold standard for object detection challenges like COCO.

Inference Speed is measured in Frames Per Second (FPS) to assess the model’s suitability for real-time sewer inspection video processing applications. Model Size is reported in terms of the number of parameters and file size (MB), which is critical for evaluating deployment feasibility on hardware with limited computational resources.

For a complete picture of the training process, auxiliary metrics such as training and validation losses (box loss, classification loss, distribution focal loss) were also monitored to diagnose potential overfitting and ensure convergence.

Furthermore, for all IoU-based metrics, a detection was considered correct if the predicted bounding box overlapped with the ground truth by at least 0.5 IoU. To better assess localization performance for small-scale defects such as cracks or partial pipe damage, higher IoU thresholds were also analyzed, highlighting the model’s precision in localizing fine-grained defects.

3. Results

This section details the experimental setup and reports the results obtained from YOLO-family models and RE-DETR on the ISWDS dataset. The results are analyzed using the evaluation metrics described earlier, with particular attention to differences across defect categories and model families.

3.1. Experimental Setup

To ensure a fair and reproducible comparison, all models were trained and evaluated under a consistent experimental framework. This subsection details the implementation environment, training configurations, dataset handling, and loss functions used, establishing the foundation for the subsequent performance analysis.

All models were implemented using the PyTorch deep learning framework (v2.0.1) within the Ultralytics YOLO ecosystem for YOLO variants and the PaddlePaddle framework for RT-DETR. The choice of separate ecosystems reflects the respective maturity and optimized implementations available for each architecture. Experiments were conducted on a system running Ubuntu 22.04 with CUDA 11.7 and cuDNN 8.5.0. The hardware consisted of an Intel i9 processor, 32 GB of RAM (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA GeForce RTX 3070 Ti GPU (8 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA).

The core hyperparameters for training are summarized in Table 3. All models were trained from pre-trained weights on the MS COCO dataset to leverage transfer learning. The input image size was standardized to 640 × 640 pixels for all architectures. The batch size was adjusted for each model within a range of 7 to 16 to maximize GPU memory utilization without causing out-of-memory errors. While the AdamW optimizer was used for all models in this study, YOLOv12 was trained with SGD to align with its recommended default configuration. A cosine annealing learning rate scheduler was used, starting from an initial learning rate (lr0) of 0.01 for YOLO models and 0.0001 for RT-DETR models, and decaying to a final learning rate (lrf) of 0.01.

The Istanbul Sewer Defect Dataset (ISWDS) was randomly split into training (70%), validation (20%), and test (10%) sets. Importantly, this split was performed at the video level rather than the frame level to prevent data leakage. Since the dataset was constructed from video recordings, individual frames were extracted and used as images. Typically, only a single representative frame was selected from each video for a given defect type in order to avoid redundancy. This strategy ensured that no frames originating from the same video appeared across different subsets, thereby preserving the independence of training, validation, and testing data.

A comprehensive set of data augmentation techniques was applied in real-time during training to improve model robustness and mitigate overfitting. These augmentations simulate the wide variability in real-world sewer inspection conditions, ranging from lighting inconsistencies to occlusions. Specifically:

Geometric transformations: Horizontal flipping (probability = 0.5), rotation (±10°), and translation

Photometric transformations: Adjustments to brightness, contrast, saturation, and hue (color jitter).

Advanced techniques: Mosaic augmentation (stitching four images together) and CutOut (randomly masking out rectangular sections of the image).

The YOLO models utilized a composite loss function consisting of bounding box regression loss (CIoU), objectness loss, and classification loss. The RT-DETR model employed a set prediction loss, which uses the Hungarian algorithm for optimal bipartite matching between predictions and ground truth. This loss combines Focal Loss for classification and a combination of L1 and Generalized IoU (GIoU) loss for bounding box regression.

3.2. Quantitative Results and Performance Benchmarking

To enable a fair comparison across architectures, we report standard detection metrics (precision, recall, F1-score, mAP@0.5, and mAP@0.5:0.95) for all trained models on the ISWDS test set. The overall performance metrics for all models are summarized in Table 4. The results reveal a clear performance–efficiency trade-off among the architectures and their scales.

A fundamental trade-off between recall and precision was observed. RT-DETR models consistently achieved the highest recall values (v1: 0.807, v2: 0.811), indicating superior detection of true defects and fewer missed detections (false negatives). In contrast, YOLO models generally achieved higher precision (e.g., YOLOv11m: 0.796), meaning their positive predictions were more reliable but potentially missed some defects.

The F1-Score, which balances precision and recall, identifies RT-DETR v2 (0.790) as the best overall model, followed by RT-DETR v1 (0.764). Among the YOLO family, YOLOv12l achieved the highest F1-Score (0.754). This suggests that for a task where both avoiding false alarms and missing defects are important, RT-DETR provides a more balanced and superior solution.

The mAP@0.5:0.95 metric, which requires precise localization, shows that YOLOv11l (0.566) and RT-DETR v2 (0.565) achieved the highest scores, indicating they are the most accurate models when strict bounding box alignment is required. The performance across YOLO versions (v8, v11, v12) was largely comparable, with no single version dominating the others.

Computational efficiency is a critical consideration for practical deployment, particularly in real-time or resource-constrained scenarios. Table 5 summarizes key training parameters and model sizes, including the number of epochs, batch sizes, ONNX file sizes, and total training time. Despite comparable runtimes, RT-DETR’s transformer-based design incurred slightly higher computational overhead, whereas YOLO variants benefited from their streamlined convolutional backbones.

From the table, it is evident that larger models such as YOLOv12x and YOLOv8x require substantially longer training times and smaller batch sizes due to GPU memory constraints. In contrast, the nano and small variants of YOLO train much faster and can utilize larger batch sizes, making them suitable for rapid prototyping or deployment on edge devices. Some models, including YOLOv8x, YOLOv11m, and YOLOv12n, reached convergence before completing the maximum number of epochs, demonstrating that early stopping can reduce total training time while maintaining competitive performance. The RT-DETR models, while highly accurate, require longer training times of approximately 10 h and moderate batch sizes, reflecting the increased computational demand of transformer-based architecture.

A critical analysis was performed to understand model performance for each of the eight defect classes for the best performing YOLO and RT-DETR model in terms of F1-Score (Table 6). This evaluation provides insights into how different architectures handle the diverse visual characteristics of operational and structural defects.

The analysis shows that RT-DETR v2 consistently outperforms YOLOv12l across all defect classes, demonstrating the transformer-based architecture’s superior ability to model complex spatial dependencies and contextual information in terms of F1-Scores. Its advantage is particularly pronounced for operational defects such as Roots and Displaced joint, where F1-Scores exceed 90%, indicating highly reliable detection.

Crack/breaks and collapses remain the most challenging class, with both models yielding the lowest F1-Scores (YOLOv12l: 54.91%, RT-DETR v2: 63.17%), highlighting the difficulty in detecting thin, irregular, and low-contrast features. This challenge is further reflected in low mAP@0.5:0.95 values for both models, suggesting that precise localization of such defects remains a limitation.

3.3. Results with Statistical Analysis

To strengthen the reliability of the reported findings, we incorporated standard deviations, confidence intervals, and statistical significance tests in addition to the conventional evaluation metrics. Without retraining, measurement uncertainty was estimated via bootstrap resampling with B = 2000 resamples (sampling with replacement) over the test set. For each resample, precision, recall, F1-score, mAP@0.5, and mAP@0.5:0.95 were recomputed, results are reported as mean ± standard deviation (sd). In line with common practice in the object detection literature, per-image metrics (precision, recall, and F1-score) are reported with sd, whereas dataset-level summary metrics (mAP) are additionally accompanied by 95% percentile confidence intervals (CI) (Table 7). The detection score threshold was set to 0.50 when computing per-image precision/recall/F1 metrics, whereas AP values were obtained in the standard threshold-free manner. To ensure reproducibility, the random seed was fixed at 42 during bootstrap resampling. The 95% confidence intervals for mAP values were relatively narrow (±1.5–1.7), reflecting the stability of model performance across the large validation set (N = 2495 images). This suggests that the reported improvements are unlikely to be due to sampling variability and can be considered statistically robust.

In addition, paired permutation tests with Holm correction (to adjust for multiple comparisons) were conducted to assess the significance of differences between the models. The results indicated: Precision: No statistically significant difference was observed (p = 0.115). Recall: RT-DETR v2 achieved significantly higher recall compared to YOLOv12l (p = 0.0001). F1-score: RT-DETR v2 also significantly outperformed YOLOv12l in F1-score (p = 0.0001). Although the per-image metric distributions were symmetric and yielded median differences close to zero, permutation tests still revealed statistically significant differences in the distributions of recall and F1-score. This means that, even if the central tendency appeared identical, RT-DETR v2 consistently provided more reliable detection outcomes across images. The incorporation of confidence intervals and distributional tests thus reinforces the robustness of the reported performance and ensures that the observed improvements are not only quantitatively meaningful but also statistically robust.

3.4. Qualitative Results and Error Analysis

Beyond quantitative metrics, a qualitative analysis was conducted on sample images from the test set to visually assess the detection capabilities, strengths, and failure modes of the top-performing models: YOLOv12l (best YOLO variant by F1-Score) and RT-DETR v2 (overall best model). A confidence threshold of 0.25 was used for both models to ensure a comprehensive comparison of their predictions.

Figure 4 presents example result images for all classes. The left column shows inference results obtained with YOLOv12l, whereas the right column presents the corresponding results produced by RT-DETR v2.

Both models demonstrated proficient detection capabilities across various defect types. However, key differences in their approach were observed:

As shown in Figure 5, both models correctly identified roots. RT-DETR v2 consistently produced larger and more precise bounding boxes that more completely encapsulated the entire defect structure, reflecting its superior localization accuracy (Figure 5b).

Figure 5 also highlights the critical strength of transformer architecture. RT-DETR v2 successfully detected attached deposits that were occluded or only partially visible along the pipe crown (Figure 5d). In contrast, YOLOv12l failed to detect these instances (Figure 5c), indicating RT-DETR’s enhanced ability to leverage global contextual information within the image to identify challenging, low-contrast defects.

The analysis also revealed common and model-specific failure modes, providing insight into the remaining challenges of automated sewer defect detection. A recurring error involved misclassifying Pipe Surface Damage as Displaced joint by YOLOv12l (Figure 6a). This suggests the model’s reliance on similar visual patterns (e.g., linear features, shadows) for both classes, indicating a potential need for more discriminative training examples or architectural adjustments to better separate these classes. The “Crack/Breaks/Collapses” class proved to be the most challenging, with both models occasionally failing to detect fine, thin cracks entirely (Figure 6c,d).

4. Practical Integration and GIS-Based Deployment

The choice between model architectures has direct consequences for real-world deployment, extending beyond mere accuracy metrics to practical considerations of integration and computational efficiency. If the primary goal is to minimize missed defects and ensure the highest possible detection rate across diverse and challenging conditions, RT-DETR v2 is the unequivocal choice. Its stability and high recall make it suitable for offline, server-based processing of inspection data where computational resources are not constrained. If the system must run on embedded hardware such as NVIDIA Jetson or Raspberry Pi (Sony UK Technology Centre, Pencoed, UK) within the inspection vehicle itself, or on standard office computers without specialized GPUs, the YOLO family provides a significant advantage. Their smaller model size, higher frames-per-second throughput, and lower computational demands make them the more practical and deployable option for in-field and desktop applications.

To demonstrate practical integration, a Python 3.12-based application within the QGIS 3.34 environment was developed. This tool bridges the gap between raw video analysis and actionable asset management insights. Inspection videos and their metadata, stored in a PostgreSQL 17.4 database and NextCloud Hub 9 (30.0.8) cloud storage, are queried and retrieved directly within the application. The selected pre-trained model, converted to the standardized ONNX format for framework interoperability, processes the video. Each frame is analyzed using OpenCV 4.11, with defects identified and annotated with colored bounding boxes and labels in real-time.

The application features a custom video player interface that displays the annotated video stream. A legend assigns a unique color to each defect class for easy identification. As the video plays, detection data is automatically compiled and exported to a comprehensive Excel report for further analysis and record-keeping. Figure 7 shows the graphical user interface of the developed QGIS 3.34 plugin, with real-time defect detection on a video stream supported by a legend and controls.

This system directly addresses key industry challenges. First, computational constraints were validated empirically. While YOLO-n and -s models could run in real-time on a standard Intel i7 CPU, larger models such as YOLO-m/l/x and RT-DETR required a dedicated GPU to process 30 FPS video without lag. This justifies the recommendation of smaller YOLO variants for widespread deployment on municipal hardware. Second, the tool significantly reduces the manual workload for inspectors by automating the process and mitigating human error and fatigue associated with reviewing hours of footage. Third, the application supports the democratization of expertise by providing visual annotations and automated reports, enabling less experienced personnel to conduct thorough and accurate inspections and making expert-level assessments more accessible.

The success of this integrated application is directly proportional to the accuracy of the underlying deep learning model, underscoring the importance of the comparative analysis presented in this study. This work provides a complete pipeline, from novel dataset and model benchmarking to a functional tool ready for end-user adoption, demonstrating a direct path from research to practical infrastructure asset management.

5. Discussion

The comparative analysis of YOLO and RT-DETR models highlights both the progress and the persistent challenges in automating sewer defect detection. This section interprets the results within the broader context of automated infrastructure inspection and discusses the implicit ions for practical deployment. The influence of pipe material on sewer system performance has been widely acknowledged in the literature. While management-oriented guidelines such as the EPA’s CMOM framework emphasize operational and maintenance practices rather than material properties [49], technical studies have highlighted the critical role of material type in pipe deterioration. Ref. [8] noted that failures in concrete, clay, and plastic pipes exhibit different modes and rates of deterioration, underlining the importance of material-specific maintenance strategies. More recently, ref. [50] conducted a state-of-the-art review and concluded that pipe material, along with age, diameter, and length, represents one of the most significant predictors of sewer pipe condition. Their review also revealed that international condition rating systems including the WRC in the United Kingdom, PACP in the United States, NRC in Canada, and WSAA in Australia, incorporate material type as a fundamental input in deterioration models. These findings suggest that including pipe material as a variable in defect detection and condition assessment studies could enhance the robustness of predictive models and provide a more comprehensive discussion of infrastructure performance across different countries.

Building upon these insights from the literature, our experimental findings further demonstrate the comparative strengths of the evaluated models. Our experimental results reveal that although both YOLOv12l and RT-DETR v2 exhibit competitive performance, RT-DETR v2 demonstrates superior outcomes in terms of recall and mAP, underscoring its suitability for reliable defect detection in real-world scenarios. These findings align with recent studies emphasizing the advantages of transformer-based detection models over conventional CNN-based approaches, thereby reinforcing the potential of RT-DETR v2 as a robust framework for practical applications.

5.1. Model Performance and Architecture Comparison

Across all metrics, RT-DETR v2 consistently delivered superior recall and robustness, particularly in complex or visually degraded scenes. Its transformer-based encoder–decoder architecture, equipped with multi-head self-attention, enabled the modeling of global context, allowing the detection of defects that were occluded, poorly lit, or spatially spread across the image. This capability proved especially beneficial for operational defects such as roots and settled deposits, where local feature cues alone are insufficient. In contrast, YOLOv12l, the best-performing CNN-based variant, excelled in well-defined scenarios with clear structural patterns, offering cleaner outputs with fewer false positives. However, its reliance on local receptive fields limited its ability to generalize under challenging conditions, as evidenced by its lower recall rates.

The choice between model architectures has direct consequences for real-world deployment, extending beyond mere accuracy metrics to practical considerations of integration and computational efficiency. If the primary goal is to minimize missed defects and ensure the highest possible detection rate across diverse and challenging conditions, RT-DETR v2 is the unequivocal choice. Its stability and high recall make it suitable for offline, server-based processing of inspection data where computational resources are not constrained. If the system must run on embedded hardware such as NVIDIA Jetson or Raspberry Pi within the inspection vehicle itself, or on standard office computers without specialized GPUs, the YOLO family provides a significant advantage. Their smaller model size, higher frames-per-second throughput, and lower computational demands make them the more practical and deployable option for in-field and desktop applications.

5.2. Persistent Challenges and Limitations

Despite these advances, both model families consistently struggled with the Crack/Breaks/Collapses class. The normalized confusion matrix for YOLOv12l and RT-DETR-v2 revealed an accuracy of only 0.58 and 0.60, respectively, for this category, with cracks frequently misclassified as background (Figure 8). A similar challenge arose for the Attached deposits class, with a background confusion rate of 0.27 and 0.26 for YOLOv12l and RT-DETR-v2, respectively. These errors are not solely attributable to model deficiencies but instead reflect the inherent difficulty of the task. Fine cracks, attached deposits, and subtle infiltration signatures often share visual characteristics with pipe textures, stains, water streaks, or reflections, making them difficult to distinguish even for human inspectors.

To further evaluate the generalizability of our models, we also conducted experiments using the publicly available Sewer-ML dataset, which represents one of the largest collections of sewer inspection imagery to date. Since the label annotations of the Denmark dataset were not made publicly available, a direct comparison with the benchmark values reported in the study by [37] could not be performed. However, upon inspection of the dataset, it was observed that several images are of very low resolution. In particular, defects such as cracks and fractures are challenging to detect in low-resolution images. Moreover, it would be beneficial for the Denmark dataset to be reviewed by domain experts and for certain images to be removed. In some cases, defects cannot be identified even by the human eye, or the images suffer from issues such as blurriness. In addition, defects that can only be inferred indirectly, such as infiltration, pose further challenges.

As shown in Figure 9, YOLO failed to detect root intrusions, whereas RT-DETR correctly identified them. Surface damage was localized as a broader, single area by RT-DETR. Intruding sealing material, settled deposits, and attached deposits were successfully detected by both models. Infiltration, however, was consistently misclassified as surface damage in both cases. For cracks, YOLO detected only the one on the left side of the image, while RT-DETR failed to identify any of them.

5.3. Domain-Specific Challenges

Beyond architecture-specific observations, the study brings attention to domain-specific challenges of sewer inspection imagery. Poor image quality resulting from low-resolution cameras, blur, insufficient lighting, or environmental factors such as mud and condensation directly limit detection accuracy. The complex background of sewer walls where stains, surface irregularities, and deposits overlap with actual defects, makes accurate bounding box generation difficult. Many defects are small or ambiguous, requiring models capable of fine-grained detail extraction. Moreover, the high inter-class similarity, such as between attached deposits and deposits, exacerbates false positives. The imbalanced class distribution in the dataset further compounds these issues, with underrepresented defect types leading to unstable per-class performance.

The quality of labeling also emerged as a significant factor. Operator-level analysis indicated that human annotators achieved only 50–60% detection accuracy, partly due to the volume of video data and the inherent difficulty of distinguishing subtle defects. Mislabeling, class grouping (e.g., cracks, breaks, collapses treated as one class despite their variability), and inconsistent annotation standards reduced the models’ ability to generalize. For example, excessive grouping of defect types contributed to poor Crack/Breaks/Collapses performance, while reflection-induced false detections highlighted the need for stricter imaging protocols, such as limiting robot speed to reduce vibration and blur.

Another limitation concerns computational requirements. While YOLO-n and YOLO-s variants successfully ran in real time on CPU-only systems, larger YOLO models and all RT-DETR variants required GPU acceleration to maintain frame rates compatible with live inspection. This reflects a broader trade-off: the transformer-based models offer superior accuracy but impose higher costs in terms of training time, inference latency, and hardware demands. By contrast, the YOLO family, with its smaller model sizes and efficient inference, remains more deployable for municipalities with limited computing resources. Thus, the choice of models must balance accuracy with practical constraints, and future research should investigate methods such as model compression, quantization, and knowledge distillation to bridge this gap.

5.4. Performance Comparison with State-of-the-Art

To contextualize our results within the broader research landscape, we compared our best-performing models against recent state-of-the-art approaches on sewer defect detection for similar defect classes where AP values are reported. While direct comparisons are limited by dataset differences, several studies provide relevant benchmarks. Table 8 summarizes reported performances across different geographic locations, model families, and defect types, highlighting the diversity of methods and results in this domain.

The comparative overview in Table 8 shows several key trends. First, there is considerable variability across reported performances, reflecting differences in datasets, labeling schemes, and evaluation protocols. For example, refs. [17,19] both report high AP values above 83% for roots and settled deposits using Faster R-CNN and improved YOLOv3, respectively, though these were tested on datasets from the United States under relatively controlled conditions. More recent work using the Sewer-ML dataset from Denmark [15,16,26,53,54] consistently reports strong results above 90% AP for structural defects such as displaced joint and roots, suggesting that the large scale and standardized labeling of Sewer-ML may facilitate higher model performance.

Second, transformer-based architectures have begun to appear in this field, with [28,31] demonstrating competitive results using DETR and Swin Transformer variants. While their AP values are lower than the best-performing YOLO-based models, they highlight the potential of attention mechanisms to capture contextual information.

Third, our models’ performances are broadly consistent with reported ranges in the literature, despite being evaluated on the newly introduced ISWDS dataset, which presents more challenging imaging conditions typical of Istanbul’s sewer network. For instance, our YOLOv12l achieved 93.4% AP for roots and 90.8% for displaced joint, aligning with top results from Sewer-ML-based studies, while our RT-DETR v2 excelled in infiltration (80.5%) and attached deposits detection (90.8%), categories that have received comparatively less attention in prior works.

Taken together, these comparisons suggest that our dataset poses a challenging yet realistic benchmark for sewer defect detection. While absolute AP values are slightly lower than those achieved on Sewer-ML, the relative performance of YOLO and RT-DETR aligns with broader trends in the field. This reinforces the value of ISWDS as a complementary dataset to existing benchmarks and underscores the robustness of our findings across defect categories.

5.5. Broader Applicability and Future Directions

While our experimental results clearly demonstrate the effectiveness of both YOLOv12l and RT-DETR v2, their integration into real-world sewer inspection practice remains challenging. Fine-grained defects such as cracks, attached deposits, and subtle infiltration signatures proved particularly difficult to detect, primarily due to their strong visual similarity with background textures, stains, and reflections. These challenges are further compounded by the quality of image acquisition and annotation, as operator-level analysis indicated that even human inspectors achieved only 50–60% accuracy in such cases. In addition, the trade-off between accuracy and computational feasibility represents a practical barrier: RT-DETR v2 provides superior recall and robustness under degraded conditions but requires GPU acceleration, whereas YOLO models, despite being less reliable in occluded or poorly lit scenarios, remain more practical for embedded or resource-constrained environments due to their lighter architectures and higher throughput. To overcome these limitations, future research should focus on stricter imaging protocols to reduce noise and blur, domain-specific data augmentation, and improved annotation practices, along with the exploration of lightweight architectures, model compression, quantization, and hybrid human–AI workflows. These measures would not only mitigate current constraints but also enhance the applicability, scalability, and robustness of automated defect detection systems in operational environments.

Beyond model performance, we demonstrate the translational impact of our research by embedding the best-performing model into a QGIS 3.34-based interface. This integration is not only a practical tool but also a methodological contribution, showing how deep learning outputs can be geo-referenced, systematically logged, and incorporated into existing infrastructure management workflows. By validating the model in this operational context, we highlight its potential to reduce operator dependence, enable real-time decision-making, and support proactive asset management. Furthermore, to ensure reproducibility and facilitate future benchmarking, the trained models are openly shared on Hugging Face, providing a replicable framework for subsequent research on infrastructure AI applications.

The broader applicability of these methods extends beyond sewer systems. While the study focused on sewer infrastructure, the challenges encountered—small, ambiguous defects in noisy, low-quality imagery—are not unique to wastewater systems. Similar issues arise in drinking water networks, stormwater drainage pipes, industrial piping, and even underground cabling systems. Deep learning–based defect detection, particularly with transformer-enhanced architectures, therefore holds significant potential for cross-domain transferability. By enabling earlier detection, reducing human workload, and standardizing inspection outputs, these approaches contribute directly to smarter infrastructure management.

The success of the integrated application demonstrates a direct path from research to practical infrastructure asset management. This work provides a complete pipeline, from novel dataset and model benchmarking to a functional tool ready for end-user adoption. However, improving annotation practices, expanding dataset diversity, and introducing sub-classes for nuanced defect categories are essential steps toward closing the gap between model performance and real-world needs.

Future research could explore several promising directions. First, incorporating video-based analysis rather than frame-by-frame detection could improve temporal consistency and reduce false positives. Second, extending the methodology to include multi-task learning that combines detection, segmentation, and classification may yield richer insights into pipeline conditions. Third, conducting large-scale validation with imagery collected from diverse geographical and environmental contexts would enhance the generalizability of the models. Additionally, testing on larger and more balanced datasets is recommended to mitigate class imbalance effects.

6. Conclusions

This study represents one of the first comparative investigations applying YOLO and RT-DETR architectures to sewer inspection imagery using a novel dataset. The analysis evaluated the performance of both models across multiple defect classes, highlighting their respective strengths and limitations. While the models demonstrated strong performance in detecting sewer defects, addressing practical challenges such as fine-grained defect recognition, annotation quality, and computational feasibility remains essential for real-world deployment. By situating the findings within broader infrastructure contexts and outlining concrete directions for future research, this study provides a foundation for advancing automated defect detection from experimental evaluation toward practical, scalable, and cross-domain applications.

This study introduced the Istanbul Sewer Defect Dataset (ISWDS), comprising 13,491 expert-annotated images that capture eight major defect categories representing nearly 90% of reported failures in Istanbul’s wastewater network. Using this dataset, we conducted a comprehensive benchmark of state-of-the-art deep learning architectures, including CNN-based YOLO variants (v8 and v11 series) and transformer-based RT-DETR models (v1 and v2), under identical evaluation protocols.

Experimental results demonstrate that RT-DETR v2 consistently outperformed all other models, achieving an F1-score of 79.03% and a Recall of 81.10%, which significantly surpass the best-performing YOLO variant (YOLOv8l, F1: 74.20%, Recall: 70.72%). The transformer-based architecture proved particularly effective in detecting partially occluded and complex defects, highlighting its robustness in real-world inspection scenarios. In contrast, smaller YOLO variants, while yielding slightly lower accuracy (F1 between 72–74%), offered advantages in inference speed and computational efficiency, making them suitable for resource-constrained environments.

Beyond benchmarking, we developed a QGIS 3.34-based inspection tool that integrates the best-performing models into a real-time video processing and reporting pipeline. This practical contribution bridges the gap between research and operational deployment, enabling sewer authorities to enhance inspection efficiency, reduce manual labor, and improve early detection of infrastructure failures.

Overall, this work provides (i) the first large-scale sewer defect dataset for Istanbul, (ii) a rigorous comparative analysis of transformer and CNN-based detection models, and (iii) an operational GIS-based tool for automated inspection. Together, these contributions advance the state of the art in smart sewer management and support the development of resilient urban infrastructure systems.

YOLO demonstrated clear advantages in terms of inference speed and compatibility with resource-constrained devices, making it suitable for real-time field deployment. In contrast, RT-DETR achieved higher overall accuracy and robustness across most defect categories, albeit at the expense of increased computational cost. These findings emphasize the trade-off between efficiency and precision when selecting models for practical sewer inspection applications.

From a technical perspective, constraints related to dataset size, hardware availability, training duration, augmentation strategies, and the number of defect classes should be carefully addressed. Deploying more powerful GPU resources would also allow experiments with higher-capacity RT-DETR variants. Furthermore, comparative assessments with other advanced architectures, such as DETR derivatives and Faster R-CNN, could provide a broader benchmark. Finally, extending the analysis beyond imagery to include alternative data modalities, such as 3D point clouds captured from sewer environments, offers an exciting avenue for future exploration.

In conclusion, the integration of deep learning into sewer defect detection demonstrates strong potential for automating and enhancing inspection processes. By advancing towards multimodal approaches and leveraging increasingly powerful architectures, the level of automation and reliability in sewer monitoring can be significantly improved. To ensure reproducibility and foster future research, all trained models have been openly shared on Hugging Face.

Author Contributions

Conceptualization, M.O. and B.B.; methodology, M.O., T.B. and B.B.; software, M.O.; validation, T.B., B.K. and B.B.; formal analysis, T.B. and B.K.; investigation, B.B.; resources M.O. and T.B.; data curation, M.O.; writing—original draft preparation, M.O.; writing—review and editing, T.B., B.K. and B.B.; visualization, M.O.; supervision, B.B.; project administration, B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is not publicly available due to confidentiality restrictions imposed by the Istanbul Water and Sewage Administration (ISKI). However, the trained models derived from this dataset are openly accessible at Hugging Face under the repository Istanbul Sewer Defect Dataset (ISWDS) Models: https://huggingface.co/mogurlu/Istanbul_Sewer_Defect_Dataset_ISWDS_Models (accessed on 10 October 2025). In addition, a Data Request Form has been prepared and added for researchers on the Hugging Face platform.

Acknowledgments

Thanks to the anonymous reviewers and editors for their comments and insights to improve this paper. The authors would like to thank the Istanbul Water and Sewage Administration (ISKI) for authorizing us to utilize CCTV videos. During the preparation of this work the authors used AI assisted technologies in order to enhance the readability and linguistic quality of the manuscript. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the published article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIFI	Intrascale Feature Interaction
AI	Artificial Intelligence
AP	Average Precision
BiFPN	Bidirectional Feature Pyramid Network
CBAM	Convolutional Block Attention Module
CCFF	Cross-scale Feature Fusion
CCTV	Closed-Circuit Television
CIoU	Complete Intersection over Union
CMOM	Capacity, Management, Operations and Maintenance
CNN	Convolutional Neural Networks
CPU	Central Processing Unit
CSPDarknet	Cross Stage Partial Darknet
CUDA	Compute Unified Device Architecture
DETR	DEtection TRANSformer
DL	Deep Learning
ECA	Efficient Channel Attention
ELAN	Efficient Local Attention Networks
EPA	Environmental Protection Agency
FPS	Frames Per Second
FP	False Positive
FN	False Negative
GAN	Generative Adversarial Network
GIS	Geographic Information System
GPU	Graphics Processing Unit
HD	High Definition
HGNet	Hierarchical Feature Guided Network
IoU	Intersection over Union
ISKI	Istanbul Water and Sewage Administration
ISWDS	Istanbul Sewer Defect Dataset
MB	Megabyte
mAP	Mean Average Precision
ML	Machine Learning
MS COCO	Microsoft Common Objects in Context
NMS	Non-Maximum Suppression
NRC	National Research Council (Canada)
OCR	Optical Character Recognition
ONNX	Open Neural Network Exchange
P	Precision
PACP	Pipeline Assessment and Certification Program (USA)
PAN	Path Aggregation Network
PANet	Path Aggregation Network
QGIS	Quantum Geographic Information System
R	Recall
R-CNN	Region-based Convolutional Neural Networks
RFB	Receptive-Field Blocks
SPP	Spatial Pyramid Pooling
SPPF	Spatial Pyramid Pooling Fast
StyleGAN	Style-based Generative Adversarial Network
TN	True Negative
TP	True Positive
WRC	Water Research Centre (UK)
WSAA	Water Services Association of Australia
YOLO	You Only Look Once

References

United Nations Department of Economic and Social Affairs, Population Division. World Population Prospects 2022: Summary of Results; UN DESA/POP/2022/TR/NO. 3; United Nations: New York, NY, USA, 2022. [Google Scholar]
Karn, A.; Pandya, S.; Mehbodniya, A.; Arslan, F.; Sharma, D.; Phasinam, K.; Aftab, M.; Rajan, R.; Bommisetti, R.; Sengan, S. An Integrated Approach for Sustainable Development of Wastewater Treatment and Management System Using IoT in Smart Cities. Soft Comput. 2023, 27, 5159–5175. [Google Scholar] [CrossRef]
Yousofi, F. Climate Change Poses Risks to Neglected Public Transportation and Water Systems; Pew Research Center: Washington, DC, USA, 2024; Available online: https://www.pew.org/en/research-and-analysis/issue-briefs/2024/09/climate-change-poses-risks-to-neglected-public-transportation-and-water-systems (accessed on 10 September 2025).
Hong, M.; Niu, D.; Fu, Q.; Hui, Z.; Wan, Z. Insights into Bio-Deterioration of Concrete Exposed to Sewer Environment: A Case Study. Constr. Build. Mater. 2024, 412, 134835. [Google Scholar] [CrossRef]
Xia, B.; Li, S.; Shen, W.; Mi, M.; Zhuang, Y.; Zhang, L. Sewage Leakage Challenges Urban Wastewater Management as Evidenced by the Yangtze River Basin of China. npj Clean Water 2024, 7, 99. [Google Scholar] [CrossRef]
Giakoumis, T.; Voulvoulis, N. Combined Sewer Overflows: Relating Event Duration Monitoring Data to Wastewater Systems’ Capacity in England. Environ. Sci. Water Res. Technol. 2023, 9, 707–722. [Google Scholar] [CrossRef]
Istanbul Water and Sewage Administration (ISKI). Innovation from ISKI; ISKI: Istanbul, Turkey, 2024. Available online: https://iskiapi.iski.gov.tr/uploads/ISKIDE_INOVASYON_2024_c51b709382.pdf (accessed on 10 September 2025). (In Turkish)
Fenner, R. Approaches to Sewer Maintenance: A Review. Urban Water 2000, 2, 343–356. [Google Scholar] [CrossRef]
United States Environmental Protection Agency. Report to Congress: Impacts and Control of CSOs and SSOs; Office of Water: Washington, DC, USA, 2004. Available online: https://www.epa.gov/sites/default/files/2015-10/documents/csossortc2004_full.pdf (accessed on 2 September 2025).
Huang, Q.H.; Li, B.A.; Lv, X.Q.; Zhang, Z.J.; Liu, K.H. Research on Pipeline Video Defect Detection Based on Improved Convolution Neural Network. J. Phys. Conf. Ser. 2020, 1576, 012028. [Google Scholar] [CrossRef]
German Association for Water, Wastewater and Waste (DWA). Industrial Wastewater Containing Organic Matter (DWA-M 710); DWA: Hennef, Germany, 2009. [Google Scholar]
Wang, Y.; Li, P.; Li, J. The Monitoring Approaches and Non-Destructive Testing Technologies for Sewer Pipelines. Water Sci. Technol. 2022, 85, 3107–3121. [Google Scholar] [CrossRef] [PubMed]
Moradi, S. Defect Detection and Classification in Sewer Pipeline Inspection Videos Using Deep Neural Networks. Doctoral Dissertation, Concordia University, Montreal, QC, Canada, 2020. Available online: https://core.ac.uk/download/pdf/355870666.pdf (accessed on 20 September 2025).
Li, D.; Xie, Q.; Yu, Z.; Wu, Q.; Zhou, J.; Wang, J. Sewer Pipe Defect Detection via Deep Learning with Local and Global Feature Fusion. Autom. Constr. 2021, 129, 103823. [Google Scholar] [CrossRef]
Zhang, J.; Liu, X.; Zhang, X.; Xi, Z.; Wang, S. Automatic Detection Method of Sewer Pipe Defects Using Deep Learning Techniques. Appl. Sci. 2023, 13, 4589. [Google Scholar] [CrossRef]
Shen, D.; Liu, X.; Shang, Y.; Tang, X. Deep Learning-Based Automatic Defect Detection Method for Sewer Pipelines. Sustainability 2023, 15, 9164. [Google Scholar] [CrossRef]
Cheng, J.C.P.; Wang, M. Automated Detection of Sewer Pipe Defects in Closed-Circuit Television Images Using Deep Learning Techniques. Autom. Constr. 2018, 95, 155–171. [Google Scholar] [CrossRef]
Kumar, S.; Abraham, D.; Jahanshahi, M.; Iseley, T.; Starr, J. Automated Defect Classification in Sewer Closed Circuit Television Inspections Using Deep Convolutional Neural Networks. Autom. Constr. 2018, 91, 273–283. [Google Scholar] [CrossRef]
Tan, Y.; Cai, R.; Li, J.; Chen, P.; Wang, M. Automatic Detection of Sewer Defects Based on Improved You Only Look Once Algorithm. Autom. Constr. 2021, 131, 103912. [Google Scholar] [CrossRef]
Pan, G.; Zheng, Y.; Guo, S.; Lv, Y. Automatic Sewer Pipe Defect Semantic Segmentation Based on Improved U-Net. Autom. Constr. 2020, 119, 103383. [Google Scholar] [CrossRef]
Dang, L.; Wang, H.; Li, Y.; Nguyen, L.; Nguyen, T.; Song, H.; Moon, H. Lightweight Pixel-Level Semantic Segmentation and Analysis for Sewer Defects Using Deep Learning. Constr. Build. Mater. 2023, 371, 130792. [Google Scholar] [CrossRef]
Fang, X.; Li, Q.; Zhu, J.; Chen, Z.; Zhang, D.; Wu, K.; Ding, K.; Li, Q. Sewer Defect Instance Segmentation, Localization, and 3D Reconstruction for Sewer Floating Capsule Robots. Autom. Constr. 2022, 142, 104494. [Google Scholar] [CrossRef]
Li, Y.; Wang, H.; Dang, L.; Piran, M.; Moon, H. A Robust Instance Segmentation Framework for Underground Sewer Defect Detection. Measurement 2022, 190, 110727. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, J.; Tian, L.; Liu, X.; Wang, S. A Lightweight Method for Detecting Sewer Defects Based on Improved YOLOv5. Appl. Sci. 2023, 13, 8986. [Google Scholar] [CrossRef]
Zhao, X.; Xiao, N.; Cai, Z.; Xin, S. YOLOv5-Sewer: Lightweight Sewer Defect Detection Model. Appl. Sci. 2024, 14, 1869. [Google Scholar] [CrossRef]
Wang, T.; Li, Y.; Zhai, Y.; Wang, W.; Huang, R. A Sewer Pipeline Defect Detection Method Based on Improved YOLOv5. Processes 2023, 11, 2508. [Google Scholar] [CrossRef]
Luo, D.; Du, K.; Niu, D. Intelligent Diagnosis of Urban Underground Drainage Network: From Detection to Evaluation. Struct. Control Health Monit. 2024, 31, 9217395. [Google Scholar] [CrossRef]
Dang, L.; Wang, H.; Li, Y.; Nguyen, T.; Moon, H. DefectTR: End-to-End Defect Detection for Sewage Networks Using a Transformer. Constr. Build. Mater. 2022, 325, 126584. [Google Scholar] [CrossRef]
Li, M.; Li, M.; Ren, Q.; Li, H.; Xiao, L.; Fang, X. PipeTransUNet: CNN and Transformer Fusion Network for Semantic Segmentation and Severity Quantification of Multiple Sewer Pipe Defects. Appl. Soft Comput. 2024, 159, 111673. [Google Scholar] [CrossRef]
Jung, J.; Reiterer, A. Improving Sewer Damage Inspection: Development of a Deep Learning Integration Concept for a Multi-Sensor System. Sensors 2024, 24, 7786. [Google Scholar] [CrossRef]
Yu, Z.; Li, X.; Sun, L.; Zhu, J.; Lin, J. A Composite Transformer-Based Multi-Stage Defect Detection Architecture for Sewer Pipes. Comput. Mater. Contin. 2024, 78, 435–451. [Google Scholar] [CrossRef]
Lee, S.-H.; Gao, G. A Study on Pine Larva Detection System Using Swin Transformer and Cascade R-CNN Hybrid Model. Appl. Sci. 2023, 13, 1330. [Google Scholar] [CrossRef]
Ma, D.; Liu, J.; Fang, H.; Wang, N.; Zhang, C.; Li, Z.; Dong, J. A Multi-Defect Detection System for Sewer Pipelines Based on StyleGAN-SDM and Fusion CNN. Constr. Build. Mater. 2021, 312, 125385. [Google Scholar] [CrossRef]
Situ, Z.; Teng, S.; Liu, H.; Luo, J.; Zhou, Q. Automated Sewer Defects Detection Using Style-Based Generative Adversarial Networks and Fine-Tuned Well-Known CNN Classifier. IEEE Access 2021, 9, 59498–59507. [Google Scholar] [CrossRef]
Duan, Y.; Hong, Y.; Niu, L.; Zhang, L. Few-Shot Defect Image Generation via Defect-Aware Feature Manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-Time End-to-End Object Detection with Hierarchical Dense Positive Supervision. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 1628–1636. [Google Scholar] [CrossRef]
Haurum, J.; Moeslund, T. Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13456–13467. Available online: https://openaccess.thecvf.com/content/CVPR2021/html/Haurum_Sewer-ML_A_Multi-Label_Sewer_Defect_Classification_Dataset_and_Benchmark_CVPR_2021_paper.html (accessed on 1 September 2025).
Hassan, S.; Dang, L.; Mehmood, I.; Im, S.; Choi, C.; Kang, J.; Park, Y.; Moon, H. Underground Sewer Pipe Condition Assessment Based on Convolutional Neural Networks. Autom. Constr. 2019, 106, 102849. [Google Scholar] [CrossRef]
Haurum, J.; Madadi, M.; Escalera, S.; Moeslund, T. Multi-Scale Hybrid Vision Transformer and Sinkhorn Tokenizer for Sewer Defect Classification. Autom. Constr. 2022, 144, 104614. [Google Scholar] [CrossRef]
Wang, M.; Luo, H.; Cheng, J. Towards an Automated Condition Assessment Framework of Underground Sewer Pipes Based on Closed-Circuit Television (CCTV) Images. Tunn. Undergr. Space Technol. 2021, 110, 103840. [Google Scholar] [CrossRef]
EN 13508-2:2003+A1:2011; Sewer Condition Classification—Part 2: Visual Inspection Coding System. European Committee for Standardization (CEN): Brussels, Belgium, 2011.
Ye, X.; Zuo, J.; Li, R.; Wang, Y.; Gan, L.; Yu, Z.; Hu, X. Diagnosis of Sewer Pipe Defects on Image Recognition of Multi-Features and Support Vector Machine in a Southern Chinese City. Front. Environ. Sci. Eng. 2019, 13, 17. [Google Scholar] [CrossRef]
Myrans, J.; Everson, R.; Kapelan, Z. Automated Detection of Fault Types in CCTV Sewer Surveys. J. Hydroinform. 2019, 21, 153–163. [Google Scholar] [CrossRef]
Chen, K.; Hu, H.; Chen, C.; Chen, L.; He, C. An Intelligent Sewer Defect Detection Method Based on Convolutional Neural Network. In Proceedings of the 2018 IEEE International Conference on Information and Automation (ICIA), Wuyishan, China, 11–13 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1301–1306. [Google Scholar] [CrossRef]
Li, D.; Cong, A.; Guo, S. Sewer Damage Detection from Imbalanced CCTV Inspection Data Using Deep Convolutional Neural Networks with Hierarchical Classification. Autom. Constr. 2019, 101, 199–208. [Google Scholar] [CrossRef]
Meijer, D.; Scholten, L.; Clemens, F.; Knobbe, A. A Defect Classification Methodology for Sewer Image Sets with Convolutional Neural Networks. Autom. Constr. 2019, 104, 281–298. [Google Scholar] [CrossRef]
Xie, Q.; Li, D.; Xu, J.; Yu, Z.; Wang, J. Automatic Detection and Classification of Sewer Defects via Hierarchical Deep Learning. IEEE Trans. Autom. Sci. Eng. 2019, 16, 1836–1847. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
United States Environmental Protection Agency. Guide for Evaluating Capacity, Management, Operation and Maintenance (CMOM) Programs at Sanitary Sewer Collection Systems; Office of Enforcement and Compliance Assurance: Washington, DC, USA, 2005. [Google Scholar]
Malekmohammadi, M.; Najafi, M.; Kermanshachi, S.; Kaushal, V.; Serajiantehrani, R. Factors Influencing the Condition of Sewer Pipes: State-of-the-Art Review. J. Pipeline Syst. Eng. Pract. 2020, 11, 03120002. [Google Scholar] [CrossRef]
Yin, X.; Chen, Y.; Bouferguene, A.; Zaman, H.; Al-Hussein, M.; Kurach, L. A Deep Learning-Based Framework for an Automated Defect Detection System for Sewer Pipes. Autom. Constr. 2020, 109, 102967. [Google Scholar] [CrossRef]
Situ, Z.; Teng, S.; Liao, X.; Chen, G.; Zhou, Q. Real-Time Sewer Defect Detection Based on YOLO Network, Transfer Learning, and Channel Pruning Algorithm. J. Civ. Struct. Health Monit. 2024, 14, 41–57. [Google Scholar] [CrossRef]
Huang, J.; Kang, H. Automatic Defect Detection in Sewer Pipe Closed-Circuit Television Images via Improved You Only Look Once Version 5 Object Detection Network. IEEE Access 2024, 12, 92797–92825. [Google Scholar] [CrossRef]
Lu, J.; Song, W.; Zhang, Y.; Yin, X.; Zhao, S. Real-Time Defect Detection in Underground Sewage Pipelines Using an Improved YOLOv5 Model. Autom. Constr. 2025, 173, 106068. [Google Scholar] [CrossRef]

Figure 1. CCTV setup and robotic system used for data acquisition in Istanbul’s sewer network.

Figure 2. Dataset preparation workflow.

Figure 3. A grid of sample images, showing examples of each of the 8 defect classes and a ‘no defect’ example.

Figure 4. Example result images for all classes. Left side: YOLO v12l (Green boxes), right side: RT-DETR v2 (Red boxes).

Figure 5. Comparison of Roots (Up) and Attached Deposits (Down) detection. (a) YOLOv12l result, (b) RT-DETR v2 result, showing a larger, more accurate bounding box. (c) YOLOv12l misses the grease along the top of the pipe. (d) RT-DETR v2 successfully detects the partially visible attached deposits. Green boxes YOLO, red boxes RT-DETR detections. Blue boxes indicate missed detections.

Figure 6. Common failure cases. (a) Misclassification of Pipe Surface Damage as Displaced joint by YOLOv12l. (b) A missed Crack/Breaks/Collapses by YOLOv12l. (c) True positive of Pipe Surface Damage by RT-DETR-v2. (d) A missed Crack/Breaks/Collapses by RT-DETR-v2. Green boxes YOLO, red boxes RT-DETR detections. Blue boxes indicate missed detections.

Figure 7. The graphical user interface for integrated deep learning-based sewer defect detection application in QGIS 3.34. The purple bounding box indicates detection of Attached Deposits.

Figure 8. Normalized Confusion Matrix for the YOLOv12l and RT-DETR-v2 models.

Figure 9. Example result images obtained through inference on the Sewer-ML dataset using YOLOv12 (up), and RT-DETR v2 (down) models. Green boxes YOLO, red boxes RT-DETR detections.

Table 1. Class distribution of the ISWDS.

Defect Class	EN 13508-2 Code	Number of Images	% of Total
Cracks/Breaks/Collapses	BAB, BAC	4654	34.65%
Intruding Sealing Material	BAI	996	7.42%
Roots	BBA	631	4.70%
Displaced Joint	BAJ	1389	10.34%
Pipe Surface Damage	BAF	919	6.84%
Infiltration	BBF	573	4.27%
Attached Deposits	BBB	2462	18.33%
Settled Deposits	BBC	853	6.35%
No Defect	-	1014	7.55%
Total		13,491	100%

Table 2. Comparison with existing sewer datasets.

Dataset	Public (P)	Multi-Label (ML)	Defective Images (DI)	Normal Images (NI)	Classes (C)	Class Imbalance (CI)
Ye et al. [42]	−	−	1045	0	7	13
Myrans et al. [43]	−	−	2260	0	13	102
Chen et al. [44]	−	−	8000	10,000	5	5
Li et al. [45]	−	−	8455	9879	7	19
Kumar et al. [18]	−	−	11,000	1000	3	4
Meijer et al. [46]	+	−	17,663	2,184,919	12	12,732
Xie et al. [47]	−	−	22,800	20,000	7	8
Hassan et al. [38]	−	−	24,137	0	6	3
Sewer-ML [37]	+	+	609,479	690,722	17	123
ISWDS (Ours)	−	−	12,477	1151	8	8.12

Table 3. Summary of training configuration hyperparameters.

Parameter	Value/Range
Framework	PyTorch, Ultralytics, PaddlePaddle
Input Image Size	640 × 640
Epochs	50
Batch Size	7–16 (model dependent)
Learning Rate Scheduler	Cosine Annealing
Initial Learning Rate	0.01 (YOLO), 0.0001 (RT-DETR)
Optimization	AdamW/SGD (Only at YOLO v12)
Loss Functions	CIoU Loss (YOLO), Set Loss (RT-DETR)
Augmentations	Flip, rotate, color jitter, mosaic, cutout
Weight Initialization	MS COCO pre-trained weights

Table 4. Overall performance metrics of all evaluated models on the ISWDS test set. The best value for each accuracy metric is highlighted with bold.

Model	Precision	Recall	F1-Score	mAP@0.5	mAP@0.5:0.95
YOLO v8n	76.41%	68.65%	72.29%	74.50%	52.00%
YOLO v8s	77.11%	69.33%	73.07%	75.00%	52.32%
YOLO v8m	76.43%	71.21%	73.73%	75.90%	54.27%
YOLO v8l	78.00%	70.72%	74.20%	76.44%	55.02%
YOLO v8x	79.13%	69.41%	73.98%	76.86%	55.02%
YOLO v11n	75.71%	69.63%	72.55%	76.16%	52.91%
YOLO v11s	75.91%	66.02%	70.61%	73.03%	50.57%
YOLO v11m	79.65%	69.81%	74.41%	76.91%	54.92%
YOLO v11l	78.34%	71.31%	74.69%	77.90%	56.56%
YOLO v11x	77.24%	70.02%	73.48%	76.93%	56.19%
YOLO v12n	79.29%	69.77%	74.28%	77.53%	55.24%
YOLO v12s	78.20%	71.28%	74.59%	77.39%	56.01%
YOLO v12m	78.25%	70.76%	74.39%	77.56%	55.94%
YOLO v12l	78.68%	72.36%	75.38%	78.39%	56.00%
YOLO v12x	73.38%	64.40%	68.68%	70.62%	49.54%
RT-DETR v1	72.52%	80.70%	76.40%	72.52%	52.68%
RT-DETR v2	77.17%	81.10%	79.03%	77.17%	56.53%

Table 5. Computational efficiency and model size. * Early stopping.

Model	Epochs	Batch Size	ONNX Size (MB)	Train Time (Hours)
YOLO v8n	50	16	11.97	1.65
YOLO v8s	50	16	43.69	2.08
YOLO v8m	50	16	101.19	3.55
YOLO v8l	50	12	170.62	5.45
YOLO v8x	47 *	8	266.39	7.87
YOLO v11n	50	16	10.34	1.96
YOLO v11s	50	16	37.03	2.35
YOLO v11m	47 *	12	78.54	3.89
YOLO v11l	44 *	12	99.10	4.07
YOLO v11x	50	7	222.35	7.62
YOLOv12n	44 *	12	10.29	2.11
YOLOv12s	48 *	12	36.37	2.84
YOLOv12m	50	10	78.87	5.07
YOLOv12l	50	6	103.32	7.48
YOLOv12x	50	4	231.08	12.10
RT-DETR v1	50	9	78.40	10
RT-DETR v2	50	8	78.50	10

Table 6. F1-Score performance for selected models across defect classes. The best accuracy metric values for each class are highlighted with bold.

Defect Class	YOLOv12l	RT-DETR v2	YOLOv12l	RT-DETR v2	YOLOv12l	RT-DETR v2
	F1-Score (%)		mAP@0.5 (%)		mAP@0.5:0.95
Roots	85.10	93.95	93.42	92.00	69.94	68.32
Crack/Breaks/Collapses	54.91	63.17	54.79	52.96	28.49	28.25
Intruding Sealing Material	84.56	89.33	88.54	86.48	72.26	71.61
Attached Deposits	70.89	72.56	72.14	67.64	52.67	48.97
Displaced Joint	83.53	91.67	90.84	88.41	69.89	69.37
Settled Deposits	72.77	80.51	74.97	75.45	54.01	55.13
Infiltration	78.36	83.33	78.53	80.50	52.15	55.58
Pipe Surface Damage	70.31	75.88	73.95	69.13	48.47	44.74

Table 7. Model performance with standard deviations and 95% confidence intervals.

Model	Precision (Mean ± sd)	Recall (Mean ± sd)	F1-Score (Mean ± sd)	mAP@0.5 (Mean ± sd, 95% CI)	mAP@0.5:0.95 (Mean ± sd, 95% CI)
YOLOv12l	69.2 ± 0.875	65.3 ± 0.868	65.6 ± 0.847	64.7 ± 0.87 (63.0–66.4)	64.7 ± 0.87 (63.0–66.4)
RT-DETR v2	70.4 ± 0.795	72.4 ± 0.819	69.6 ± 0.779	70.6 ± 0.82 (69.0–72.3)	70.6 ± 0.82 (69.0–72.3)

Table 8. Reported AP@0.50 (%) values for sewer defect detection across recent studies, compared with our best-performing YOLOv12l and RT-DETR v2 models on the ISWDS dataset.

Study	Location	Method	AP@0.50
Study	Location	Method	Crack	Infiltration	Intruding Sealing Material	Roots	Displaced Joint	Pipe Damage	Attached Deposits	Settled Deposits
Cheng and Wang, 2018 [17]	Georgia, USA	Faster R-CNN	77.6	86.4	-	83.6	-	-	-	84.5
Yin et al., 2020 [51]	Edmonton, Canada	YOLOv3	77.3 83.7	-	-	76.2	-	-	-	81.3
Li et al., 2021 [14]	China	Strengthened RPN	40.6 46.7	-	-	-	48.8 51.6	-	-	55.1 53.2
Tan et al., 2021 [19]	Georgia, USA	Improved YOLO v3	83.0	98.2	-	89.4	-	-	-	97.4
Dang et al., 2022 [28]	South Korea	DETR	67.8 70.2	-	57.1	59.3	56.1	52.7	-	-
Fang et al., 2022 [22]	Shenzhen & Hefei, China	Mask R-CNN	90.2	-	-	-	-	94.1	-	-
Shen et al., 2023 [16]	Denmark (Sewer ML)	Improved SSD	-	-	-	88.5	84.9	-	-	96.0
Wang et al., 2023 [26]	Denmark (Sewer ML)	Improved YOLOv5	71.9	-	84.3	82.8	73.5	-	-	-
Zhang et al., 2023 [15]	Denmark (Sewer ML)	Improved YOLOv5	82.2	-	-	91.1	90.6	-	-	94.9
Zhang et al., 2023 [24]	Denmark (Sewer ML)	Improved YOLO v4	88.0	-	-	94.2	95.2	-	-	91.8
Yu et al., 2024 [31]	Zhongshan, Zhuhai, Suqian, China	Swin Transformer	53.7	-	-	72.9	67.7	72.7	-	79.4
Situ et al., 2024 [52]	China	YOLO v5	-	-	-	88.5	89.0	-	-	-
Huang and Kang, 2024 [53]	Korea	Improved YOLO v5	66.7 63.0	-	-	-	98.5	87.2	-	97.2
Lu et al., 2025 [54]	Denmark (Sewer ML)	Improved YOLO v5	40.1	95.5	-	67.9	90.9	90.6	-	92.4
Ours	Istanbul, Türkiye	YOLOv12	54.8	78.5	88.5	93.4	90.84	74.0	72.1	75.0
Ours	Istanbul, Türkiye	RT-DETR v2	53.0	80.5	86.5	92.0	88.41	69.13	90.8	75.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oğurlu, M.; Bayram, B.; Kulavuz, B.; Bakırman, T. Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset. Appl. Sci. 2025, 15, 11096. https://doi.org/10.3390/app152011096

AMA Style

Oğurlu M, Bayram B, Kulavuz B, Bakırman T. Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset. Applied Sciences. 2025; 15(20):11096. https://doi.org/10.3390/app152011096

Chicago/Turabian Style

Oğurlu, Mustafa, Bülent Bayram, Bahadır Kulavuz, and Tolga Bakırman. 2025. "Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset" Applied Sciences 15, no. 20: 11096. https://doi.org/10.3390/app152011096

APA Style

Oğurlu, M., Bayram, B., Kulavuz, B., & Bakırman, T. (2025). Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset. Applied Sciences, 15(20), 11096. https://doi.org/10.3390/app152011096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Automated Sewer Defect Detection: Benchmarking YOLO and RT-DETR on the Istanbul Dataset

Abstract

1. Introduction

1.1. Context and Motivation

1.2. Related Work

1.3. Positioning and Contributions

2. Materials and Methods

2.1. Data Collection

2.1.1. Istanbul Sewer Defect Dataset

2.1.2. Comparison with Existing Datasets

2.2. Model Training

2.2.1. Deep Learning Architectures

2.2.2. You Only Look Once (YOLO) Architecture Family

2.2.3. Real Time Detection Transformer (RT-DETR)

2.3. Evaluation Metrics

3. Results

3.1. Experimental Setup

3.2. Quantitative Results and Performance Benchmarking

3.3. Results with Statistical Analysis

3.4. Qualitative Results and Error Analysis

4. Practical Integration and GIS-Based Deployment

5. Discussion

5.1. Model Performance and Architecture Comparison

5.2. Persistent Challenges and Limitations

5.3. Domain-Specific Challenges

5.4. Performance Comparison with State-of-the-Art

5.5. Broader Applicability and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI