Vision-Based Defect Inspection and Condition Assessment for Sewer Pipes: A Comprehensive Survey

Due to the advantages of economics, safety, and efficiency, vision-based analysis techniques have recently gained conspicuous advancements, enabling them to be extensively applied for autonomous constructions. Although numerous studies regarding the defect inspection and condition assessment in underground sewer pipelines have presently emerged, we still lack a thorough and comprehensive survey of the latest developments. This survey presents a systematical taxonomy of diverse sewer inspection algorithms, which are sorted into three categories that include defect classification, defect detection, and defect segmentation. After reviewing the related sewer defect inspection studies for the past 22 years, the main research trends are organized and discussed in detail according to the proposed technical taxonomy. In addition, different datasets and the evaluation metrics used in the cited literature are described and explained. Furthermore, the performances of the state-of-the-art methods are reported from the aspects of processing accuracy and speed.


Background
Underground sewerage systems (USSs) are a vital part of public infrastructure that contributes to collecting wastewater or stormwater from various sources and conveying it to storage tanks or sewer treatment facilities. A healthy USS with proper functionality can effectively prevent urban waterlogging and play a positive role in the sustainable development of water resources. However, sewer defects caused by different influence factors such as age and material directly affect the degradation of pipeline conditions. It was reported in previous studies that the conditions of USSs in some places are unsatisfactory and deteriorate over time. For example, a considerable proportion (20.8%) of Canadian sewers is graded as poor and very poor. The rehabilitation of these USSs is needed in the following decade in order to ensure normal operations and services on a continuing basis [1]. Currently, the maintenance and management of USSs have become challenging problems for municipalities worldwide due to the huge economic costs [2]. In 2019, a report in the United States of America (USA) estimated that utilities spent more than USD 3 billion on wastewater pipe replacements and repairs, which addressed 4692 miles of pipeline [3].

Defect Inspection Framework
Since it was first introduced in the 1960s [4], computer vision (CV) has become a mature technology that is used to realize promising automation for sewer inspections. In order to meet the increasing demands on USSs, a CV-based defect inspection system is required to identify, locate, or segment the varied defects prior to the rehabilitation process. As illustrated in Figure 1, an efficient defect inspection framework for underground sewer pipelines should cover five stages. In the data acquisition stage, there are many There are five stages in the defect inspection framework, which include (a) the data acquisition stage based on various sensors (CCTV, sonar, or scanner), (b) the data processing stage for the collected data, (c) the defect inspection stage containing different algorithms (defect classification, detection, and segmentation), (d) the risk assessment for detected defects using image post-processing, and (e) the final report generation stage for the condition evaluation.
In the past few decades, many defect inspection strategies and algorithms have been presented based on CCTV cameras. Manual inspections by humans are inefficient and error-prone, so several studies attempted to adopt conventional machine learning (ML) approaches in order to diagnose the defects based on morphological, geometrical, or textural features [14][15][16]. With the elevation and progress of ML, deep learning (DL) methods have been widely applied to enhance the overall performance in recent studies on sewer inspections. Previous investigations have reviewed and summarized different kinds of inspections, which mainly include manual inspections [17,18] and automatic inspections based on the conventional machine learning algorithms [15,19] and deep learning algorithms [9,20].
In the attempt to evaluate the infrastructure conditions, some researchers have developed risk assessment approaches using different image post-processing algorithms [21][22][23]. For instance, a defect segmentation method was proposed to separate the cracks from the background, and post-processing was subsequently used to calculate the morphological features of the cracks [22]. In another study, a method based on a fully convo- There are five stages in the defect inspection framework, which include (a) the data acquisition stage based on various sensors (CCTV, sonar, or scanner), (b) the data processing stage for the collected data, (c) the defect inspection stage containing different algorithms (defect classification, detection, and segmentation), (d) the risk assessment for detected defects using image post-processing, and (e) the final report generation stage for the condition evaluation.
In the past few decades, many defect inspection strategies and algorithms have been presented based on CCTV cameras. Manual inspections by humans are inefficient and error-prone, so several studies attempted to adopt conventional machine learning (ML) approaches in order to diagnose the defects based on morphological, geometrical, or textural features [14][15][16]. With the elevation and progress of ML, deep learning (DL) methods have been widely applied to enhance the overall performance in recent studies on sewer inspections. Previous investigations have reviewed and summarized different kinds of inspections, which mainly include manual inspections [17,18] and automatic inspections based on the conventional machine learning algorithms [15,19] and deep learning algorithms [9,20].
In the attempt to evaluate the infrastructure conditions, some researchers have developed risk assessment approaches using different image post-processing algorithms [21][22][23]. For instance, a defect segmentation method was proposed to separate the cracks from the background, and post-processing was subsequently used to calculate the morphological features of the cracks [22]. In another study, a method based on a fully convolutional network and post-processing was introduced to detect and measure cracks [21]. Nevertheless, the existing risk assessment methods are limited to the feature analysis of cracks only, and there is no further research and exploration of each specific category. Table 1 lists the major contributions of five survey papers, which considered different aspects of defect inspection and condition assessment in underground sewer pipelines. In 2019, an in-depth survey was presented to analyze different inspection algorithms [24]. However, it only focused on defect detection, and defect segmentation was not involved in this study. Several surveys [7,10,20] were conducted one year later to discuss the previous studies on sewer defects. Moreover, the recent studies associated with imagebased construction applications are discussed in [8]. In these relevant surveys, the authors of each paper put efforts into emphasizing a particular area. A more comprehensive review of the latest research on defect inspection and condition assessment is significant for the researchers who are interested in integrating the algorithms into real-life sewer applications. In addition, the detailed and well-arranged list tables for the existing defect inspection methods according to the different categories are not provided in these papers.

Contributions
In order to address the above issues, a survey that covers various methods regarding sewer defect inspection and condition assessment is conducted in this study. The main contributions are as follows. This survey creates a comprehensive review of the visionbased algorithms about defect inspection and condition assessment from 2000 to the present. Moreover, we divide the existing algorithms into three categories, which include defect classification, detection, and segmentation. In addition, different datasets and evaluation metrics are summarized. Based on the investigated papers, the research focuses and The rest of this survey is divided into four sections. Section 2 presents the methodology used in this survey. Section 3 discusses the image-based defect inspection algorithms that cover classification, detection, and segmentation. Section 4 analyzes the dataset and the evaluation metrics that are used from 2000 onwards. In Section 5, the challenges and future needs are indicated. Conclusions of previous studies and suggestions for future research are provided in Section 6.

Survey Methodology
A thorough search of the academic studies was conducted by using the Scopus journal database. It automatically arranges the results from different publishers, which include Elsevier, Springer Link, Wiley online library, IEEE Xplore, ASCE Library, MDPI, SACG, preprint, Taylor & Francis Group, and others. Figure 2 shows the distribution of the academic journals reviewed in diverse databases. The journals in the other databases include SPIE Digital Library, Korean Science, Easy Chair, and Nature. In order to highlight the advances in vision-based defect inspection and condition assessment, the papers of these fields that were published between 2000 and 2022 are investigated. The search criterion of this survey is to use an advanced retrieval approach by selecting high-level keywords like ("vision-based sensor" OR "video" OR "image") AND ("automatic sewer inspection" OR "defect classification" OR "defect detection" OR "defect segmentation" OR "condition assessment"). Since there is no limitation on a certain specific construction material or pipe typology, the research on any sewer pipeline that can be entered and that obtained visual data is covered in this survey. Nevertheless, the papers that focus on some topics, which do not relate to the vision-based sewer inspection, are not included in this paper. For example, the quality assessment for sewer images [25], pipe reconstruction, internal pipe structure, wall thickness measurement, and sewer inspections based on other sensors such as depth sensors [26,27], laser scanners [28,29], or acoustic sensors [30,31] are considered irrelevant topics. Figure 3 represents the number of articles including journals and conference papers in different time periods from 2000 to 2022. By manually scanning the title and abstract sections, a total of 124 papers that includes both journals (95) and conferences (29) in English was selected to examine the topic's relevancy. In addition to these papers, four books and three websites were also used to construct this survey. After that, the filtered papers were classified in terms of the employed methods and application areas. Finally, the papers in each category were further studied by analyzing their weaknesses and strengths.   Publications in different time periods

Defect Inspection
In this section, several classic algorithms are illustrated, and the research tendency is analyzed. Figure 4 provides a brief description of the algorithms in each category. According to the literature review, the existing studies about sewer inspection are summarized in three tables. Tables 2-4 show the recent studies about defect classification (Section 3.1), detection (Section 3.2), and segmentation (Section 3.3) algorithms. In order to comprehensively analyze these studies, the publication time, title, utilized methodology, advantages, and disadvantages for each study are covered. Moreover, the specific proportion of each inspection algorithm is computed in Figure 5. It is clear that the defect classification accounts for the most significant percentages in all the investigated studies.

Defect Inspection
In this section, several classic algorithms are illustrated, and the research tendency is analyzed. Figure 4 provides a brief description of the algorithms in each category. According to the literature review, the existing studies about sewer inspection are summarized in three tables. Tables 2-4 show the recent studies about defect classification (Section 3.1), detection (Section 3.2), and segmentation (Section 3.3) algorithms. In order to comprehensively analyze these studies, the publication time, title, utilized methodology, advantages, and disadvantages for each study are covered. Moreover, the specific proportion of each inspection algorithm is computed in Figure 5. It is clear that the defect classification accounts for the most significant percentages in all the investigated studies.

Defect Classification
Due to the recent advancements in ML, both the scientific community and industry have attempted to apply ML-based pattern recognition in various areas, such as agriculture [32], resource management [33], and construction [34]. At present, many types of defect classification algorithms have been presented for both binary and multi-class classification tasks. The commonly used algorithms are described below.

Support Vector Machines (SVMs)
SVMs have become one of the most typical and robust ML algorithms because they are not sensitive to the overfitting problem compared with other ML algorithms [35][36][37]. The principal objective of an SVM is to perfectly divide the training data into two or more classes by optimizing the classification hyperplane [38,39]. A classification hyper-

Defect Classification
Due to the recent advancements in ML, both the scientific community and industry have attempted to apply ML-based pattern recognition in various areas, such as agriculture [32], resource management [33], and construction [34]. At present, many types of defect classification algorithms have been presented for both binary and multi-class classification tasks. The commonly used algorithms are described below.

Support Vector Machines (SVMs)
SVMs have become one of the most typical and robust ML algorithms because they are not sensitive to the overfitting problem compared with other ML algorithms [35][36][37]. The principal objective of an SVM is to perfectly divide the training data into two or more classes by optimizing the classification hyperplane [38,39]. A classification hyperplane equation can be normalized in order to form a two-dimensional sample set that satisfies Equation (1).
where x i ∈ R 2 and y i ∈ (+1, −1); w is the optimal separator and b is the bias. As shown in Figure 6, the circles and triangles indicate two classes of training samples. The optimal hyperplane is represented as H, and the other two parallel hyperplanes are represented as H 1 and H 2 . On the premise of correctly separating samples, the maximum margin between the two hyperplanes (H 1 and H 2 ) is conducive to gaining the optimal hyperplane (H).
where ∈ ℝ and ∈ 1, 1 ; is the optimal separator and is the bias. As shown in Figure 6, the circles and triangles indicate two classes of training samples. The optimal hyperplane is represented as H, and the other two parallel hyperplanes are represented as H1 and H2. On the premise of correctly separating samples, the maximum margin between the two hyperplanes (H1 and H2) is conducive to gaining the optimal hyperplane (H). Despite classifying various types of defects with high accuracy, the SVM algorithm cannot be applied to end-to-end classification problems [40]. As demonstrated in [41], Ye et al. established a sewer image diagnosis system where a variety of image preprocessing algorithms, such as Hu invariant moments [42] and lateral Fourier transform [43] were used for the feature extraction, and the SVM was then used as the classifier. The accuracy of the SVM classifier reached 84.1% for seven predefined classes, and the results suggested that the training sample number is positively correlated with the final accuracy. In addition to this study, Zuo et al. applied the SVM algorithm that is based on Despite classifying various types of defects with high accuracy, the SVM algorithm cannot be applied to end-to-end classification problems [40]. As demonstrated in [41], Ye et al. established a sewer image diagnosis system where a variety of image pre-processing algorithms, such as Hu invariant moments [42] and lateral Fourier transform [43] were used for the feature extraction, and the SVM was then used as the classifier. The accuracy of the SVM classifier reached 84.1% for seven predefined classes, and the results suggested that the training sample number is positively correlated with the final accuracy. In addition to this study, Zuo et al. applied the SVM algorithm that is based on a specific histogram to categorize three different cracks at the sub-class level [11]. Before the classification process, bilateral filtering [44,45] was applied in image pre-processing in order to denoise input images and keep the edge information. Their proposed method obtained a satisfactory average accuracy of 89.6%, whereas it requires a series of algorithms to acquire 2D radius angular features before classifying the defects.

Convolutional Neural Networks (CNNs)
A CNN was first proposed in 1962 [46], and it has demonstrated excellent performances in multiple domains. Due to its powerful generalization ability, CNN-based classifiers that automatically extract features from input images are superior to the classifiers that are based on the pre-engineered features [47]. Consequently, numerous researchers have applied CNNs to handle the defect classification problem in recent years. Kumar et al. presented an end-to-end classification method using several binary CNNs in order to identify the presence of three types of commonly encountered defects in sewer images [48]. In their proposed framework, the extracted frames were inputted into networks that contained two convolutional layers, two pooling layers, two fully connected layers, and one output layer. The classification results achieved high values in terms of average accuracy (0.862), precision (0.877), and recall (0.906), but this work was limited to the classification of ubiquitous defects.
Meijer et al. reimplemented the network proposed in [48], and they compared the performances based on a more realistic dataset introduced in [49]. They used a single CNN to deal with multi-label classification problems, and their classifier outperformed the method presented by Kumar et al. In another work, several image pre-processing approaches, which included histogram equalization [50] and morphology operations [51], were used for noise removal. After that, a fine-tuned defect classification model was used to extract informative features based on highly imbalanced data [52]. Their presented model architecture was based on the VGG network, which achieved first place in the ILSVRC-2014 [53]. As illustrated in Figure 7, the model structure in the first 17 layers is frozen, and the other sections are trainable; also, two convolutional layers and one batch normalization were added to enhance the robustness of the modified network. Meijer et al. reimplemented the network proposed in [48], and they compared the performances based on a more realistic dataset introduced in [49]. They used a single CNN to deal with multi-label classification problems, and their classifier outperformed the method presented by Kumar et al. In another work, several image pre-processing approaches, which included histogram equalization [50] and morphology operations [51], were used for noise removal. After that, a fine-tuned defect classification model was used to extract informative features based on highly imbalanced data [52]. Their presented model architecture was based on the VGG network, which achieved first place in the ILSVRC-2014 [53]. As illustrated in Figure 7, the model structure in the first 17 layers is frozen, and the other sections are trainable; also, two convolutional layers and one batch normalization were added to enhance the robustness of the modified network.

Defect Detection
Rather than the classification algorithms that merely offer each defect a class type, object detection is conducted to locate and classify the objects among the predefined classes using rectangular bounding boxes (BBs) as well as confidence scores (CSs). In recent studies, object detection technology has been increasingly applied in several fields, such as intelligent transportation [75][76][77], smart agriculture [78][79][80], and autonomous construction [81][82][83]. The generic object detection consists of the one-stage approaches and the two-stage approaches. The classic one-stage detectors based on regression include YOLO [84], SSD [85], CornerNet [86], and RetinaNet [87]. The two-stage detectors are based on region proposals, including Fast R-CNN [88], Faster R-CNN [89], and R-FCN [90]. In this survey, the one-stage and two-stage methods that were employed in sewer inspection studies are both discussed and analyzed as follows.

You Only Look Once (YOLO)
YOLO is a one-stage algorithm that maps directly from image pixels to BBs and class probabilities. In [84], object detection was addressed as a single regression problem using a simple and unified pipeline. Due to its advantages of robustness and efficiency, an updated version of YOLO, which is called YOLOv3 [91], was explored to locate and classify defects in [9]. YOLOv3 outperformed the previous YOLO algorithms in regard to detecting the objects with small sizes because the YOLOv3 model applies a 3-scale mechanism that concatenates the feature maps of three scales [92,93]. Figure 8 illustrates how the YOLOv3 architecture implements the 3-scale prediction operation. The prediction result with a scale of 13 × 13 is obtained in the 82nd layer by down-sampling and convolution operations. Then, the result in the 79th layer is concatenated with the result of the 61st layer after up-sampling, and the prediction result with 26 × 26 is generated after several convolution operations. The result of 52 × 52 is generated at layer 106 using the same method. The predictions at different scales have different receptive fields that determine the appropriate sizes of the detection objects in the image. As a result, YOLOv3 with a 3-scale mechanism is capable of detecting more fine-grained features.
Based on the detection model developed by [9], a video interpretation algorithm was proposed to build an autonomous assessment framework in sewer pipelines [94]. The assessment system verified how the defect detector can be put to use with realistic infrastructure maintenance and management. A total of 3664 images extracted from 63 videos were trained by the YOLOv3 model, which achieved a high mean average precision (mAP) of 85.37% for seven defects and also obtained a fast detection speed for real-time applications. classify defects in [9]. YOLOv3 outperformed the previous YOLO algorithms in regard to detecting the objects with small sizes because the YOLOv3 model applies a 3-scale mechanism that concatenates the feature maps of three scales [92,93]. Figure 8 illustrates how the YOLOv3 architecture implements the 3-scale prediction operation. The prediction result with a scale of 13 × 13 is obtained in the 82nd layer by down-sampling and convolution operations. Then, the result in the 79th layer is concatenated with the result of the 61st layer after up-sampling, and the prediction result with 26 × 26 is generated after several convolution operations. The result of 52 × 52 is generated at layer 106 using the same method. The predictions at different scales have different receptive fields that determine the appropriate sizes of the detection objects in the image. As a result, YOLOv3 with a 3-scale mechanism is capable of detecting more fine-grained features. Based on the detection model developed by [9], a video interpretation algorithm was proposed to build an autonomous assessment framework in sewer pipelines [94]. The assessment system verified how the defect detector can be put to use with realistic infrastructure maintenance and management. A total of 3664 images extracted from 63 videos were trained by the YOLOv3 model, which achieved a high mean average precision (mAP) of 85.37% for seven defects and also obtained a fast detection speed for realtime applications.

Single Shot Multibox Detector (SSD)
Similarly, another end-to-end detector that is named SSD was first introduced for multiple object classes in [85]. Several experiments were conducted to analyze the detection speed and accuracy based on different public datasets. The results suggest that the SSD model (input size: 300 × 300) obtained faster speed and higher accuracy than the YOLO model (input size: 448 × 448) on the VOC2007 test. As shown in Figure 9, the SSD method first extracts features in the base network (VGG16 [53]). It then predicts the fixed-size bounding boxes and class scores for each object instance using a feed-forward CNN [95]. After that, a non-maximum suppression (NMS) algorithm [96] is used to refine the detections by removing the redundant boxes. Similarly, another end-to-end detector that is named SSD was first introduced for multiple object classes in [85]. Several experiments were conducted to analyze the detection speed and accuracy based on different public datasets. The results suggest that the SSD model (input size: 300 × 300) obtained faster speed and higher accuracy than the YOLO model (input size: 448 × 448) on the VOC2007 test. As shown in Figure 9, the SSD method first extracts features in the base network (VGG16 [53]). It then predicts the fixed-size bounding boxes and class scores for each object instance using a feed-forward CNN [95]. After that, a non-maximum suppression (NMS) algorithm [96] is used to refine the detections by removing the redundant boxes. Moreover, the SSD method was utilized to detect defects for CCTV images in a condition assessment framework [97]. Several image pre-processing algorithms were used to enhance the input images prior to the feature extraction process. Then three state-of-the-art (SOTA) detectors (YOLOv3 [91], SSD [85], and faster-RCNN [89]) based on DLs were tested and compared on the same dataset. The defect severity was rated in the end from different aspects in order to assess the pipe condition. Among three experimental models, YOLOv3 demonstrated that it obtained a relatively balanced performance between speed and accuracy. The SSD model achieved the fastest speed (33 ms per image), indicating the feasibility of real-time defect detection. However, the detection accuracy of SSD was the lowest, which was 28.6% lower than the accuracy of faster R-CNN.
The faster R-CNN model was introduced to first produce candidate BBs and then refine the generated BB proposals [89]. Figure 10 shows the architecture of faster R-CNN developed by [98] in a defect detection system. First of all, the multiple CNN layers in Moreover, the SSD method was utilized to detect defects for CCTV images in a condition assessment framework [97]. Several image pre-processing algorithms were used to enhance the input images prior to the feature extraction process. Then three state-of-theart (SOTA) detectors (YOLOv3 [91], SSD [85], and faster-RCNN [89]) based on DLs were tested and compared on the same dataset. The defect severity was rated in the end from different aspects in order to assess the pipe condition. Among three experimental models, YOLOv3 demonstrated that it obtained a relatively balanced performance between speed and accuracy. The SSD model achieved the fastest speed (33 ms per image), indicating the feasibility of real-time defect detection. However, the detection accuracy of SSD was the lowest, which was 28.6% lower than the accuracy of faster R-CNN.

Faster Region-Based CNN (Faster R-CNN)
The faster R-CNN model was introduced to first produce candidate BBs and then refine the generated BB proposals [89]. Figure 10 shows the architecture of faster R-CNN developed by [98] in a defect detection system. First of all, the multiple CNN layers in the base network were used for feature extraction. Then, the region proposal network (RPN) created numerous proposals based on the generated feature maps. Finally, these proposals were sent to the detector for further classification and localization. Compared with the one-stage frameworks, the region proposal-based methods require more time in handling different model components. However, the faster R-CNN model that trains RPN and fast R-CNN detector separately is more accurate than other end-to-end training models, such as YOLO and SSD [99]. As a result, the faster R-CNN was explored in many studies for more precise detection of sewer defects. In [98], 3000 CCTV images were fed into the faster R-CNN model, and the trained model was then utilized to detect four categories of defects. This research indicated that the data size, training scheme, network structure, and hyper-parameter are important impact factors for the final detection accuracy. The results show the modified model achieved a high mAP of 83%, which was 3.2% higher than the original model. In another work [99], a defect tracking framework was firstly built by using a faster R-CNN detector and learning discriminative features. In the defect detection process, a mAP of 77% was obtained for detecting three defects. At the same time, the metric learning model was trained to reidentify defects. Finally, the defects in CCTV videos were tracked based on detection information and learned features.  In [98], 3000 CCTV images were fed into the faster R-CNN model, and the trained model was then utilized to detect four categories of defects. This research indicated that the data size, training scheme, network structure, and hyper-parameter are important impact factors for the final detection accuracy. The results show the modified model achieved a high mAP of 83%, which was 3.2% higher than the original model. In another work [99], a defect tracking framework was firstly built by using a faster R-CNN detector and learning discriminative features. In the defect detection process, a mAP of 77% was obtained for detecting three defects. At the same time, the metric learning model was trained to reidentify defects. Finally, the defects in CCTV videos were tracked based on detection information and learned features.  • Does not require prior knowledge • Can generalize well with limited parameters The robustness and efficiency can be improved for real-world applications [111]

Defect Segmentation
Defect segmentation algorithms can predict defect categories and pixel-level location information with exact shapes, which is becoming increasingly significant for the research on sewer condition assessment by re-coding the exact defect attributes and analyzing the specific severity of each defect. The previous segmentation methods were mainly based on mathematical morphology [112,113]. However, the morphology segmentation approaches were inefficient compared to the DL-based segmentation methods. As a result, the defect segmentation methods based on DL have been recently explored in various fields. The studies related to sewer inspection are described as follows.

Morphology Segmentation
Morphology-based defect segmentation contains many methods, such as closing bottom-hat operation (CBHO), opening top-hat operation (OTHO), and morphological segmentation based on edge detection (MSED). By evaluating and comparing the segmentation performances of different methods, MSED was verified as being useful to detect cracks, and OTHO was verified as being useful to detect open joints [113]. They also indicated the removal of the text on the CCTV images is necessary to further improve the detection accuracy. Similarly, MSED was applied to segment eight categories of typical defects, and it outperformed another popular approach called OTHO [112]. In addition, some morphology features, including area, axis length, and eccentricity, were also measured, which is of great significance to assist inspectors in judging and assessing defect severity. Although the morphology segmentation methods showed good segment results, they need multiple image pre-processing steps before the segmentation process.

Semantic Segmentation
Automatic localization of the sewer defect's shape and the boundary was first proposed by Wang et al. using a semantic segmentation network called DilaSeg [114]. In order to improve the segmentation accuracy, an updated network named DilaSeg-CRF was introduced by combining the CNN with a dense conditional random field (CRF) [115,116]. Their updated network improved the segmentation accuracy considerably in terms of the mean intersection over union (mIoU), but the single data feature and the complicated training process reflect that the DilaSe-CRF is not suitable to be applied in real-life applications.
Recently, the fully convolutional network (FCN) has been explored for the pixels-topixels segmentation task [117][118][119][120]. Meanwhile, some other network architectures that are similar to an FCN have emerged in large numbers, including U-Net [121]. Pan et al. proposed a semantic segmentation network called PipeUNet, in which the U-Net was used as the backbone due to its fast convergence speed [122]. As shown in Figure 11, the encoder and decoder on both sides form a symmetrical architecture. In addition, three FRAM blocks were added before the skip connections to improve the ability of feature extraction and reuse. Besides, the focal loss was demonstrated, which is useful for handling the imbalanced data problem (IDP). Their proposed PipeUNet achieved a high mIoU of 76.3% and a fast speed of 32 frames per second (FPS). the detection accuracy. Similarly, MSED was applied to segment eight categories of typical defects, and it outperformed another popular approach called OTHO [112]. In addition, some morphology features, including area, axis length, and eccentricity, were also measured, which is of great significance to assist inspectors in judging and assessing defect severity. Although the morphology segmentation methods showed good segment results, they need multiple image pre-processing steps before the segmentation process.

Semantic Segmentation
Automatic localization of the sewer defect's shape and the boundary was first proposed by Wang et al. using a semantic segmentation network called DilaSeg [114]. In order to improve the segmentation accuracy, an updated network named DilaSeg-CRF was introduced by combining the CNN with a dense conditional random field (CRF) [115,116]. Their updated network improved the segmentation accuracy considerably in terms of the mean intersection over union (mIoU), but the single data feature and the complicated training process reflect that the DilaSe-CRF is not suitable to be applied in real-life applications.
Recently, the fully convolutional network (FCN) has been explored for the pixelsto-pixels segmentation task [117][118][119][120]. Meanwhile, some other network architectures that are similar to an FCN have emerged in large numbers, including U-Net [121]. Pan et al. proposed a semantic segmentation network called PipeUNet, in which the U-Net was used as the backbone due to its fast convergence speed [122]. As shown in Figure 11, the encoder and decoder on both sides form a symmetrical architecture. In addition, three FRAM blocks were added before the skip connections to improve the ability of feature extraction and reuse. Besides, the focal loss was demonstrated, which is useful for handling the imbalanced data problem (IDP). Their proposed PipeUNet achieved a high mIoU of 76.3% and a fast speed of 32 frames per second (FPS).

Dataset and Evaluation Metric
The performances of all the algorithms were tested and are reported based on a specific dataset using specific metrics. As a result, datasets and protocols were two primary determining factors in the algorithm evaluation process. The evaluation results are not convincing if the dataset is not representative, or the used metric is poor. It is challenging to judge what method is the SOTA because the existing methods in sewer inspections utilize different datasets and protocols. Therefore, benchmark datasets and standard evaluation protocols are necessary to be provided for future studies.

Dataset Collection
Currently, many data collection robotic systems have emerged that are capable of assisting workers with sewer inspection and spot repair. Table 5 lists the latest advanced robots along with their respective information, including the robot's name, company, pipe diameter, camera feature, country, and main strong points. Figure 12 introduces several representative robots that are widely utilized to acquire images or videos from underground infrastructures. As shown in Figure 12a, LETS 6.0 is a versatile and powerful inspection system that can be quickly set up to operate in 150 mm or larger pipes. A representative work (Robocam 6) of the Korean company TAP Electronics is shown in Figure 12b. Robocam 6 is the best model to increase the inspection performance without the considerable cost of replacing the equipment. Figure 12c is the X5-HS robot that was developed in China, which is a typical robotic crawler with a high-definition camera. In Figure 12d, Robocam 3000, sold by Japan, is the only large-scale system that is specially devised for inspecting pipes ranging from 250 mm to 3000 mm. It used to be unrealistic to apply the crawler in huge pipelines in Korea.

Benchmarked Dataset
Open-source sewer defect data is necessary for academia to promote fair comparisons in automatic multi-defect classification tasks. In this survey, a publicly available benchmark dataset called Sewer-ML [125] for vision-based defect classification is introduced. The Sewer-ML dataset, acquired from Danish companies, contains 1.3 million images labeled by sewer experts with rich experience. Figure 13 shows some sample images from the Sewer-ML dataset, and each image includes one or more classes of defects. The recorded text in the image was redacted using a Gaussian blur kernel to protect private information. Besides, the detailed information of the datasets used in recent papers is described in Table 6. This paper summarizes 32 datasets from different countries in the world, of which the USA has 12 datasets, accounting for the largest proportion. The largest dataset contains 2,202,582 images, whereas the smallest dataset has only 32 images. Since the images were acquired by various types of equipment, the collected images have varied resolutions ranging from 64 × 64 to 4000 × 46,000.

Benchmarked Dataset
Open-source sewer defect data is necessary for academia to promote fair comparisons in automatic multi-defect classification tasks. In this survey, a publicly available benchmark dataset called Sewer-ML [125] for vision-based defect classification is introduced. The Sewer-ML dataset, acquired from Danish companies, contains 1.3 million images labeled by sewer experts with rich experience. Figure 13 shows some sample images from the Sewer-ML dataset, and each image includes one or more classes of defects. The recorded text in the image was redacted using a Gaussian blur kernel to protect private information. Besides, the detailed information of the datasets used in recent papers is described in Table 6. This paper summarizes 32 datasets from different countries in the world, of which the USA has 12 datasets, accounting for the largest proportion. The largest dataset contains 2,202,582 images, whereas the smallest dataset has only 32 images. Since the images were acquired by various types of equipment, the collected images have varied resolutions ranging from 64 × 64 to 4000 × 46,000.

Evaluation Metric
The studied performances are ambiguous and unreliable if there is no suitable metric. In order to present a comprehensive evaluation, multitudinous methods are proposed in recent studies. Detailed descriptions of different evaluation metrics are explained in Table 7. Table 8 presents the performances of the investigated algorithms on different datasets in terms of different metrics. Table 7. Overview of the evaluation metrics in the recent studies.

Metric
Description Ref.

Precision
The proportion of positive samples in all positive prediction samples [9] Recall The proportion of positive prediction samples in all positive samples [48] Accuracy The proportion of correct prediction in all prediction samples [48]

Metric Description
Ref.

F1-score Harmonic mean of precision and recall [69] FAR
False alarm rate in all prediction samples [57] True accuracy The proportion of all predictions excluding the missed defective images among the entire actual images [58] AUROC Area under the receiver operator characteristic (ROC) curve [49] AUPR Area under the precision-recall curve [49] mAP mAP first calculates the average precision values for different recall values for one class, and then takes the average of all classes [9] Detection rate The ratio of the number of the detected defects to total number of defects [106] Error rate The ratio of the number of mistakenly detected defects to the number of non-defects [106] PA Pixel accuracy calculating the overall accuracy of all pixels in the image [116] mPA The average of pixel accuracy for all categories [116] mIoU The ratio of intersection and union between predictions and GTs [116] fwIoU Frequency-weighted IoU measuring the mean IoU value weighing the pixel frequency of each class [116]   As shown in Table 8, accuracy is the most commonly used metric in the classification tasks [41,48,52,54,[56][57][58]61,[65][66][67][69][70][71]73]. In addition to this, other subsidiary metrics such as precision [11,48,67,69,73], recall [11,48,57,67,69,73], and F1-score [69,73] are also well supported. Furthermore, AUROC and AUPR are calculated in [49] to measure the classification results, and FAR is used in [57,58] to check the false alarm rate in all the predictions. In contrast to classification, mAP is a principal metric for detection tasks [9,98,99,105,110]. In another study [97], precision, recall, and F1-score are reported in conjunction to provide a comprehensive estimation for defect detection. Heo et al. [106] assessed the model performance based on the detection rate and the error rate. Kumar and Abraham [108] report the average precision (AP), which is similar to the mAP but for each class. For the segmentation tasks, the mIoU is considered as an important metric that is used in many studies [116,122]. Apart from the mIoU, the per-class pixel accuracy (PA), mean pixel accuracy (mPA), and frequency-weighted IoU (fwIoU) are applied to evaluate the segmented results at the pixel level.

Challenges and Future Work
This part first discusses the main challenges in recent studies, and some potential methods are then indicated to address these difficulties in the future work. Since a few surveys have already mentioned the partial limitations, a more complete summary of the existing challenges and future research direction are presented in this survey.

Data Analysis
During the data acquisition process, vision-based techniques such as the traditional CCTV are the most popular because of their cost-effective characteristics. Nevertheless, it is challenging for the visual equipment to inspect all the defects whenever they are below the water level or behind pipelines. As a result, the progress in hybrid devices has provided a feasible approach to acquire unavailable defects [126]. For example, the SSET methods [31,127,128] have been applied to collect quality data and evaluate the detected defects that are hard to deduce based on the visual data. In addition, the existing sewer inspection studies mainly focus on the concrete pipeline structures. The inspection and assessment for the traditional masonry sewer system that are still ubiquitous in most of the European cities become cumbersome for inspectors in practice. As for this issue, several automated diagnostic techniques (CCTV, laser scanning, ultrasound, etc.) for brick sewers are analyzed and compared in detail by enumerating the specific advantages and disadvantages [129,130]. Furthermore, varied qualities of the exiting datasets under distinct conditions and discontinuous backgrounds require image preprocessing prior to the inspection process to enhance the image quality and then improve the final performance [131].
Moreover, the current work concentrates on the research of structural defects such as cracks, joints, breaks, surface damage, lateral protrusion, and deformation, whereas there is less concern about the operation and maintenance defects (roots, infiltration, deposits, debris, and barriers). As mentioned in Section 4.1.2, there are 32 datasets investigated in this survey. Figure 14 shows the previous studies on sewer inspections of different classes of defects. We listed 12 classes of common defects in underground sewer pipelines. In addition to this, other defects that are rare and at the sub-class level are also included. According to the statistics for common defects, the proportion (50.5%) of structural defects is 20.3% higher than the proportion (30.2%) of operation and maintenance defects, which reflects that future research needs more available data for operation and maintenance defects.

Model Analysis
Although defect severity analysis methods have been proposed in several papers in order to assess the risk of the detected cracks, approaches for the analysis of other defects are limited. As for cracks, the risk levels can be assessed by measuring morphological features such as the crack length, mean width, and area to judge the corresponding severity degree. In contrast, it is difficult to comprehensively analyze the severity degrees for other defects because only the defect area is available for other defects. Therefore, researchers should explore more features that are closely related to the defect severity, which is significant for further condition assessment.
In addition, the major defect inspection models rely on effective supervised learning methods that cost much time in the manual annotation process for training [10]. The completely automated systems that include automatic labeling tools need to be developed for more efficient sewer inspections. On the other hand, most of the inspection approaches that demand long processing times only test based on still images, so these methods cannot be practiced in real-time applications for live inspections. More efforts should be made in future research to boost the inference speed in CCTV sewer videos.

Model Analysis
Although defect severity analysis methods have been proposed in several papers in order to assess the risk of the detected cracks, approaches for the analysis of other defects are limited. As for cracks, the risk levels can be assessed by measuring morphological features such as the crack length, mean width, and area to judge the corresponding severity degree. In contrast, it is difficult to comprehensively analyze the severity degrees for other defects because only the defect area is available for other defects. Therefore, researchers should explore more features that are closely related to the defect severity, which is significant for further condition assessment.
In addition, the major defect inspection models rely on effective supervised learning methods that cost much time in the manual annotation process for training [10]. The completely automated systems that include automatic labeling tools need to be developed for more efficient sewer inspections. On the other hand, most of the inspection approaches that demand long processing times only test based on still images, so these methods cannot be practiced in real-time applications for live inspections. More efforts should be made in future research to boost the inference speed in CCTV sewer videos.

Conclusions
Vision-based automation in construction has attracted increasing interest of researchers from different fields, especially with image processing and pattern recognition. The main outcomes of this paper include (1) an exhaustive review of diverse research approaches presented in more than 120 studies through a scientific taxonomy, (2) an analytical discussion of various algorithms, datasets, and evaluation protocols, and (3) a compendious summary of the existing challenges and future needs. Based on the current research situation, this survey outlines several suggestions that can facilitate future research on vision-based sewer inspection and condition assessment. Firstly, classification and detection have become a topic of great interest in the past several decades, which has attracted a lot of researchers' attention. Compared with them, defect segmentation at the pixel level is a more significant task to assist the sewer inspectors in evaluating the risk level of the detected defects. However, it has the lowest proportion of the research of overall studies. Hence, automatic defect segmentation should be given greater focus considering its research significance. Secondly, we suggest that a public dataset and source code be created to support replicable research in the future. Thirdly, the

Conclusions
Vision-based automation in construction has attracted increasing interest of researchers from different fields, especially with image processing and pattern recognition. The main outcomes of this paper include (1) an exhaustive review of diverse research approaches presented in more than 120 studies through a scientific taxonomy, (2) an analytical discussion of various algorithms, datasets, and evaluation protocols, and (3) a compendious summary of the existing challenges and future needs. Based on the current research situation, this survey outlines several suggestions that can facilitate future research on vision-based sewer inspection and condition assessment. Firstly, classification and detection have become a topic of great interest in the past several decades, which has attracted a lot of researchers' attention. Compared with them, defect segmentation at the pixel level is a more significant task to assist the sewer inspectors in evaluating the risk level of the detected defects. However, it has the lowest proportion of the research of overall studies. Hence, automatic defect segmentation should be given greater focus considering its research significance. Secondly, we suggest that a public dataset and source code be created to support replicable research in the future. Thirdly, the evaluation metrics should be standardized for a fair performance comparison. Since this review presents clear guidelines for subsequent research by analyzing the concurrent studies, we believe it is of value to readers and practitioners who are concerned with sewer defect inspection and condition assessment.