AI-Driven High-Precision Model for Blockage Detection in Urban Wastewater Systems

: In artiﬁcial intelligence (AI), computer vision consists of intelligent models to interpret and recognize the visual world, similar to human vision. This technology relies on a synergy of extensive data and human expertise, meticulously structured to yield accurate results. Tackling the intricate task of locating and resolving blockages within sewer systems is a signiﬁcant challenge due to their diverse nature and lack of robust technique. This research utilizes the previously introduced “S-BIRD” dataset, a collection of frames depicting sewer blockages, as the foundational training data for a deep neural network model. To enhance the model’s performance and attain optimal results, transfer learning and ﬁne-tuning techniques are strategically implemented on the YOLOv5 architecture, using the corresponding dataset. The outcomes of the trained model exhibit a remarkable accuracy rate in sewer blockage detection, thereby boosting the reliability and efﬁcacy of the associated robotic framework for proﬁcient removal of various blockages. Particularly noteworthy is the achieved mean average precision (mAP) score of 96.30% at a conﬁdence threshold of 0.5, maintaining a consistently high-performance level of 79.20% across Intersection over Union (IoU) thresholds ranging from 0.5 to 0.95. It is expected that this work contributes to advancing the applications of AI-driven solutions for modern urban sanitation systems.


Introduction
Computer vision is a field of artificial intelligence (AI) with its own conventional algorithms that extract required information from various visual forms such as photos and videos, and based on that information form, perform actions, or make recommendations in order to detect and identify distinct objects. Thus, the large datasets should increase the performance properties of computer vision.
Object detection techniques of computer vision detect the occurrence of objects in an image or video with bounding boxes and identify their classes. Initially, machine learning was mainly used for object detection tasks but when deep neural networks, i.e., deep learning methods emerged, they became popular due to automatic representative feature extraction from large datasets for training purposes [1]. Occlusion, clutter, and low resolution are some of the sub-problems that are handled very efficiently by deep learning-based detection frameworks [2,3]. It has two method types such as single-stage, which works for inference speed and real-time use, and two-stage, which works for model performance, i.e., detection accuracy. The single-stage detectors remove the process of region of interest (ROI) extraction and moves for classification and regression whereas two-stage detectors extract ROI and then apply classification and regression. The YOLO detection model (YOLOv2 [4], YOLOv3 [5], YOLOv4 [6], and YOLOv5 [7]), SSD [8], CenterNet [9], CornerNet [10], etc., are some single stage detectors. Region proposal models (R-CNN [11], Fast-RCNN [12], Faster RCNN [13], Cascade R-CNN [14], and R-FCN [15]) are two-stage detectors. Classification and localization accuracy and inference speed are two important metrics for object detectors. In the advancement of detection models, transfer learning techniques with quality datasets meet the requirements with a minimum training time [16,17]. Transfer learning harnesses prior knowledge to enhance performance on novel tasks. By fine-tuning, pretrained deep neural models are adapted to new contexts with certain layers preserved and others refined. This leads to many advantages such as achieving quick convergence, good performance, and adaptability in real-world scenarios. As the applications of AI evolve, such as video surveillance, military applications, security aspects, health monitoring, and critical detection tasks, the AI techniques are being enhanced to suit these needs.
Addressing the application-based needs to produce sensible and accurate results, detection models need to be adapted and modified, which usually have heavy computational demands. However, there are methods such as the embedded vision approach with AI that has an ability to enable real-time, efficient, and intelligent visual processing directly on edge devices, which reduces dependency on cloud computing and enhances privacy and responsiveness in many applications [18,19].
Detecting various sewer blockages is a major challenge due to their complex and heterogeneous nature. Moreover, their locations in the sewer network may vary, including main lines, lateral connections, and junctions. Blockages can exhibit varying levels of severity, from partial restrictions that gradually reduce flow to complete blockages that cause sewer overflows. The dynamic and unpredictable nature of urban wastewater systems, influenced by factors such as climate, wastewater composition, and hydraulic conditions adds another layer of complexity. In this research work, transfer learning and fine-tuning techniques are utilized to achieve a high precision rate in the detection of blockages within urban wastewater systems. This approach is intended for real-time implementation on mobile devices and other environments with limited resources, with the goal of effectively removing such blockages. Our primary emphasis is on the training of the single-stage YOLOv5 model using the S-BIRD dataset [20,21], which contains representative and critical multi-class images depicting prevalent sewer blockage scenarios.
The study implements all computer vision and model training procedures using Python programming, OpenCV, PyTorch framework, and other machine learning libraries. These operations are carried out on a DGX GPU workstation system running on the Linux platform, ensuring a robust and efficient experimental environment. The results are analyzed and discussed to demonstrate the effectiveness of the methodology used.

Structural Insights of YOLOv5 Model
YOLOv5 is an anchor-based single-stage detection model, which is built on the PyTorch framework. It focuses on simplicity, model scaling, and transfer learning, making it versatile for a wide range of object detection tasks. The model's backbone is CSP Darknet-53, which incorporates Cross Stage Partial (CSP) connections to enhance information flow and feature representation.
To create feature pyramids for effective object scaling and generalization, YOLOv5 employs the Path Aggregation Network (PAN) as its neck. The head design utilizes anchor boxes to generate output vectors that contain class probabilities, objectness scores, and bounding box coordinates (center_x, center_y, height, and width). The model parameters are updated during training using the following loss function: where L_cls represents the Binary Cross Entropy loss for predicted classes, L_obj represents the Binary Cross Entropy loss for objectness scores, and L_loc represents the Complete Intersection over Union loss for bounding box locations. Here, λ1, λ2, and λ3 are hyperparameters controlling the contribution of each component to the overall loss. The employed auto anchor automatically determines and generates anchor boxes based on the distribution of bounding boxes in the custom dataset using K-means clustering and a genetic learning algorithm. In this, SiLU (Sigmoid Linear Unit) activation function in hidden layers acquire intricate details and Sigmoid activation function in the output layer functions for binary classification. As shown in Figure 1, the backbone employs Convolutional and C3 layers to extract image features, which are then combined at various levels using Conv, Upsample, Concat, and C3 layers in the head. The object detection process is facilitated by a Detect layer that uses anchor boxes and the indicated class count. Particularly, each C3 (CSP-3) block consists of two parallel convolutional layers, the first layer channels input features through a bottleneck layer, compressing the information and the second layer directly outputs feature. These streams are then concatenated and processed through pooling and convolutional layers. The C3 blocks also use skip connections and attention mechanisms to enhance information flow and reduce noisy features.
Electronics 2023, 12, x FOR PEER REVIEW 3 of 13 employed auto anchor automatically determines and generates anchor boxes based on the distribution of bounding boxes in the custom dataset using K-means clustering and a genetic learning algorithm. In this, SiLU (Sigmoid Linear Unit) activation function in hidden layers acquire intricate details and Sigmoid activation function in the output layer functions for binary classification. As shown in Figure 1, the backbone employs Convolutional and C3 layers to extract image features, which are then combined at various levels using Conv, Upsample, Concat, and C3 layers in the head. The object detection process is facilitated by a Detect layer that uses anchor boxes and the indicated class count. Particularly, each C3 (CSP-3) block consists of two parallel convolutional layers, the first layer channels input features through a bottleneck layer, compressing the information and the second layer directly outputs feature. These streams are then concatenated and processed through pooling and convolutional layers. The C3 blocks also use skip connections and attention mechanisms to enhance information flow and reduce noisy features.

Details of Training Instances in Critical Multi-Class S-BIRD
The dataset comprises a total of 14,765 training frames of classes (grease, plastics, and tree roots), which are meticulously annotated with 69,061 objects as shown in Figure 2, resulting in an average of 4.7 annotations per frame. Specifically, the dataset comprises 26,847 annotations for grease, 21,553 annotations for tree roots, and 20,661 annotations for plastics. To ensure uniformity and standardization, the frames were preprocessed and augmented, resulting in an average frame size of 0.173 Megapixels. The frames were resized to a square aspect ratio of 416 × 416 pixels, thereby maintaining a 1:1 aspect ratio class. The angle of the diagonal was calculated to be 0.785 radians (equivalent to 45 degrees), with the diagonal length measuring 588 pixels.

Details of Training Instances in Critical Multi-Class S-BIRD
The dataset comprises a total of 14,765 training frames of classes (grease, plastics, and tree roots), which are meticulously annotated with 69,061 objects as shown in Figure 2, resulting in an average of 4.7 annotations per frame. Specifically, the dataset comprises 26,847 annotations for grease, 21,553 annotations for tree roots, and 20,661 annotations for plastics. To ensure uniformity and standardization, the frames were preprocessed and augmented, resulting in an average frame size of 0.173 Megapixels. The frames were resized to a square aspect ratio of 416 × 416 pixels, thereby maintaining a 1:1 aspect ratio class. The angle of the diagonal was calculated to be 0.785 radians (equivalent to 45 degrees), with the diagonal length measuring 588 pixels. Regarding pixel density, the dataset exhibits a density of 12 pixels per millimeter or 290 pixels per inch. These specific computational details are vital for understanding the characteristics and intricacies of the S-BIRD dataset, which plays a crucial role in effectively training the deep neural network. Figure 3 illustrates the distribution of object classes in each training frame based on the center x for the S-BIRD dataset. Figure 3 shows the relative distribution of center x coordinates across different classes during training. Each segment is color-coded and displays data values and percentiles, providing a clear understanding of object positions along the x-axis. This section provides valuable insights into the dataset's dimensions, resolutions, and geometric properties, which contribute to the successful implementation of transfer learning and fine-tuning techniques for the deep neural detection model.

Training Method and Evaluation
The training process for the YOLOv5-s model (Based on PyTorch 1.10.0a0 with CUDA support) on the S-BIRD dataset involved a series of steps aimed at achieving the Regarding pixel density, the dataset exhibits a density of 12 pixels per millimeter or 290 pixels per inch. These specific computational details are vital for understanding the characteristics and intricacies of the S-BIRD dataset, which plays a crucial role in effectively training the deep neural network. Figure 3 illustrates the distribution of object classes in each training frame based on the center x for the S-BIRD dataset. Figure 3 shows the relative distribution of center x coordinates across different classes during training. Each segment is color-coded and displays data values and percentiles, providing a clear understanding of object positions along the x-axis. This section provides valuable insights into the dataset's dimensions, resolutions, and geometric properties, which contribute to the successful implementation of transfer learning and fine-tuning techniques for the deep neural detection model. Regarding pixel density, the dataset exhibits a density of 12 pixels per millimeter or 290 pixels per inch. These specific computational details are vital for understanding the characteristics and intricacies of the S-BIRD dataset, which plays a crucial role in effectively training the deep neural network. Figure 3 illustrates the distribution of object classes in each training frame based on the center x for the S-BIRD dataset. Figure 3 shows the relative distribution of center x coordinates across different classes during training. Each segment is color-coded and displays data values and percentiles, providing a clear understanding of object positions along the x-axis. This section provides valuable insights into the dataset's dimensions, resolutions, and geometric properties, which contribute to the successful implementation of transfer learning and fine-tuning techniques for the deep neural detection model.

Training Method and Evaluation
The training process for the YOLOv5-s model (Based on PyTorch 1.10.0a0 with CUDA support) on the S-BIRD dataset involved a series of steps aimed at achieving the

Training Method and Evaluation
The training process for the YOLOv5-s model (Based on PyTorch 1.10.0a0 with CUDA support) on the S-BIRD dataset involved a series of steps aimed at achieving the highest precision in detecting sewer blockages. Through the application of transfer learning and fine-tuning techniques, the model's formulation was optimized to suit the specific characteristics of the representative dataset, enabling its effective adaptation for real-world scenarios. To facilitate the training process, annotations for object classes were applied in PyTorch TXT format, as needed. The training process was performed over 6000 epochs, using the stochastic gradient descent (SGD) optimizer with specified hyperparameters. The training process utilized the configurations listed in Table 1. The DGX-1 (utilized 32 GB GPU Card) available at UiT, Narvik, running a Docker container with a defined image served as the training platform, leveraging GPU parallelization for faster computations. Overfitting was mitigated using Early Stopping with a patience of 100 epochs. The training progression concluded at 933 epochs due to a lack of improvement in the last 100 epochs. The most promising results were obtained at epoch 832, leading to the selection of the corresponding model for practical applications. The evaluation metrics are essential for quantifying the model's performance, and they are computed using the following formulas: Here, TP-true positive, FP-false positive, FN-false negative, and mAP-mean average precision.
During the training, at epoch 832, the model exhibited impressive precision (P) and recall (R) values of 94.40% and 93.90%, respectively, across all classes. Notably, Figure 4 illustrates that the developed detection model achieved outstanding average precision values of 95.90% for grease blocks, 98.40% for plastic blocks, and 94.50% for tree root blocks. These high precision values are indicative of the model's ability to accurately detect and classify instances belonging to these specific classes. The overall mean average precision (mAP) for all classes, as indicated in Table 2, is remarkably high at 96.30% with a confidence threshold of 0.5. This highlights the model's proficiency in making precise detections across all classes within the dataset. Moreover, the calculated mAP over various Intersection over Union (IoU) thresholds, ranging from 0.5 to 0.95 with an increment of 0.05, yielded a consistent performance of 79.20%. This demonstrates that the model maintains accurate localization of objects across a broad range of IoU thresholds. The timing results in Table 3 show that the model has efficient inference times, with an average forward time of 0.2 ms, average NMS time of 1.1 ms, and average inference time of 11 ms. These low inference times make the model suitable for real-time applications.   The confusion matrix in Figure 5, provides an overview of the model's performance in correctly classifying instances of grease, plastic, and tree roots. This visualization pro vides a clear breakdown of correct and incorrect classifications for each category.   The confusion matrix in Figure 5, provides an overview of the model's performance in correctly classifying instances of grease, plastic, and tree roots. This visualization provides a clear breakdown of correct and incorrect classifications for each category. Figure 6 shows correlation connections within the frames of the dataset, demonstrating the exact connection between instances and their labels among discrete views. It is also evident that a majority of instances in the dataset are situated towards the outer edges of both the top and bottom sides of the images in the dataset. This indicates the efficiency of the trained model to detect and classify multiple objects in various real-world scenarios.        The scatter diagram, Figure 7, displays the instances in the dataset and their corresponding labels. This visualization helps with understanding the distribution of instances across different classes and assists with identifying potential clustering patterns. The scatter diagram, Figure 7, displays the instances in the dataset and their corresponding labels. This visualization helps with understanding the distribution of instances across different classes and assists with identifying potential clustering patterns. The graph in Figure 8 illustrates the relationship between precision (P) and confidence (C) that informs concerning changes in the model's precision at different confidence levels, providing insights into the model's ability to make accurate detections at various confidence thresholds.   The graph in Figure 8 illustrates the relationship between precision (P) and confidence (C) that informs concerning changes in the model's precision at different confidence levels, providing insights into the model's ability to make accurate detections at various confidence thresholds. The graph in Figure 8 illustrates the relationship between precision ( dence (C) that informs concerning changes in the model's precision at differe levels, providing insights into the model's ability to make accurate detectio confidence thresholds.        Figure 11 exhibits the F1 score at a 94% threshold with a confidenc The F1 score considers both precision and recall, making it a valuable met model performance.      Figure 11 exhibits the F1 score at a 94% threshold with a confidence l The F1 score considers both precision and recall, making it a valuable metric model performance.     Figure 13 exhibits the detection outcomes obtained by deploying the t on Google Source frames [22] as input data. The outcomes include the locat and corresponding class labels (tree roots, grease, or plastic) predicted b These results are of utmost importance as they enable a thorough evalu model's performance and adaptability when dealing with new and diverse world scenarios. Additionally, the model has been specifically optimized to     Figure 13 exhibits the detection outcomes obtained by deploying the trained model on Google Source frames [22] as input data. The outcomes include the location of objects and corresponding class labels (tree roots, grease, or plastic) predicted by the model. These results are of utmost importance as they enable a thorough evaluation of the model's performance and adaptability when dealing with new and diverse data in realworld scenarios. Additionally, the model has been specifically optimized to handle multiple sewer blockages within the same frame, making it highly suitable for real-time detection in various practical situations.  Figure 13 exhibits the detection outcomes obtained by deploying the trained model on Google Source frames [22][23][24][25][26][27] as input data. The outcomes include the location of objects and corresponding class labels (tree roots, grease, or plastic) predicted by the model. These results are of utmost importance as they enable a thorough evaluation of the model's performance and adaptability when dealing with new and diverse data in real-world scenarios. Additionally, the model has been specifically optimized to handle multiple sewer blockages within the same frame, making it highly suitable for real-time detection in various practical situations.

Comparing AI-Driven Approach to MOEAs
The AI-driven approach presented in this research offers several advantages ov Multi-Objective Evolutionary Algorithms (MOEAs) [23] commonly used in wastewa system management. While MOEAs such as NSGA-II, SPEA2, MOPSO, and MODE a effective at optimizing multiple objectives, they often come with the burden of comp mathematical models and high computational requirements [24,25]. In contrast, the approach leverages advanced computer vision and deep learning techniques to det sewer blockages promptly and accurately. The model achieves a remarkable mean av age precision (mAP) of 96.30% at a confidence threshold of 0.5, highlighting its exception precision in sewer blockage detection, which in turn enhances the reliability and efficien of wastewater management systems.
Furthermore, the AI approach relies on labelled training data and lightweight de learning models, enhancing its efficiency and real-time capabilities. This aligns well w the urgent need to address sewer blockages swiftly and prevent disruptions and ov flows. The model's accuracy, speed, and specialized focus on sewer blockage detecti make it a highly promising solution for immediate and effective urban wastewater syste management. In comparison, MOEAs such as the sensitivity-based adaptive procedu (SAP) [26], optimal control algorithms [27], and novel methodologies [28] have show efficiency in various aspects of wastewater management, such as sewer rehabilitation a optimal scheduling. However, their computational demands and reliance on complex gorithms might hinder their real-time applicability. The AI-driven approach's ability process data in real-time, coupled with its high accuracy in detection, gives it a distin edge for addressing dynamic and critical scenarios like sewer blockages.
Overall, while both AI-driven approaches and MOEAs contribute to the advan ment of wastewater management, the AI approach's ability to quickly detect and respo to sewer blockages makes it particularly well-suited for immediate, on-the-ground app cations in modern urban sanitation systems.

Conclusions
This research highlights the potential of artificial intelligence, by employing t YOLOv5 single-stage detection model and transfer learning on the critical S-BIRD ima dataset in sewer blockage detection. By harnessing the power of AI, we achieved a hi precision rate suitable for real-time deployment on resource-constrained mobile device Based on the current work, the following specific conclusions may be made.

Comparing AI-Driven Approach to MOEAs
The AI-driven approach presented in this research offers several advantages over Multi-Objective Evolutionary Algorithms (MOEAs) [28] commonly used in wastewater system management. While MOEAs such as NSGA-II, SPEA2, MOPSO, and MODE are effective at optimizing multiple objectives, they often come with the burden of complex mathematical models and high computational requirements [29,30]. In contrast, the AI approach leverages advanced computer vision and deep learning techniques to detect sewer blockages promptly and accurately. The model achieves a remarkable mean average precision (mAP) of 96.30% at a confidence threshold of 0.5, highlighting its exceptional precision in sewer blockage detection, which in turn enhances the reliability and efficiency of wastewater management systems.
Furthermore, the AI approach relies on labelled training data and lightweight deep learning models, enhancing its efficiency and real-time capabilities. This aligns well with the urgent need to address sewer blockages swiftly and prevent disruptions and overflows. The model's accuracy, speed, and specialized focus on sewer blockage detection make it a highly promising solution for immediate and effective urban wastewater system management. In comparison, MOEAs such as the sensitivity-based adaptive procedure (SAP) [31], optimal control algorithms [32], and novel methodologies [33] have shown efficiency in various aspects of wastewater management, such as sewer rehabilitation and optimal scheduling. However, their computational demands and reliance on complex algorithms might hinder their real-time applicability. The AI-driven approach's ability to process data in real-time, coupled with its high accuracy in detection, gives it a distinct edge for addressing dynamic and critical scenarios like sewer blockages.
Overall, while both AI-driven approaches and MOEAs contribute to the advancement of wastewater management, the AI approach's ability to quickly detect and respond to sewer blockages makes it particularly well-suited for immediate, on-the-ground applications in modern urban sanitation systems.

Conclusions
This research highlights the potential of artificial intelligence, by employing the YOLOv5 single-stage detection model and transfer learning on the critical S-BIRD image dataset in sewer blockage detection. By harnessing the power of AI, we achieved a high precision rate suitable for real-time deployment on resource-constrained mobile devices.
Based on the current work, the following specific conclusions may be made.

•
The developed model demonstrated noticeable precision and recall rates, achieving 94.50%, 95.90%, and 98.40% average precision for tree roots, grease, and plastics, respectively. The mean average precision (mAP) reached an outstanding 96.30% at a confidence threshold of 0.5 and maintained consistent performance at mAP of 79.20% across IoU thresholds ranging from 0.5 to 0.95, indicating the model's proficiency in handling different sewer blockage scenarios. The inference times were efficient, making the model suitable for real-time applications. The detection outcomes on Google Source frames further validated the model's adaptability to diverse data.

•
The results emphasize the effectiveness of transfer learning and fine tuning, reducing training time, enhancing performance, and in adapting deep neural network models to new contexts. • The presented model's ability to accurately detect sewer blockages holds promise for its application in modern wastewater management systems. The AI-driven sewer blockage detection system showcased in this research has significant implications for real-world applications, ranging from urban infrastructure management to environmental conservation.
As AI technologies continue to advance, the integration of computer vision and deep learning models will pave the way for more efficient and intelligent solutions in various new domains.

Data Availability Statement:
The research data will be made available on the request.