AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information

Lee, Sungyeol; Kang, Jaemo; Kim, Jinyoung; Kong, Myeongsik

doi:10.3390/app15148003

Open AccessArticle

AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information

Department of Geotechnical Engineering Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8003; https://doi.org/10.3390/app15148003

Submission received: 30 May 2025 / Revised: 1 July 2025 / Accepted: 16 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Artificial Intelligence for Structural Health Monitoring, Inspection, Maintenance, and Rehabilitation of Civil Infrastructure)

Download

Browse Figures

Versions Notes

Abstract

This study analyzed the probability of damage in heat transport pipelines buried in urban areas using pipeline attribute information and damage history data and developed an AI-based predictive model. A dataset was constructed by collecting spatial and attribute data of pipelines and defining basic units according to specific standards. Damage trends were analyzed based on pipeline attributes, and correlation analysis was performed to identify influential factors. These factors were applied to three machine learning algorithms: Random Forest, eXtreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM). The model with optimal performance was selected by comparing evaluation indicators including the F2-score, accuracy, and area under the curve (AUC). The LightGBM model trained on data from pipelines in use for over 20 years showed the best performance (F2-score = 0.804, AUC = 0.837). This model was used to generate a risk map visualizing the probability of pipeline damage. The map can aid in the efficient management of urban heat transport systems by enabling preemptive maintenance in high-risk areas. Incorporating external environmental data and auxiliary facility information in future models could further enhance reliability and support the development of a more effective maintenance decision-making system.

Keywords:

heat transport pipelines; damage probability prediction; risk mapping; pipeline infrastructure management; machine learning

1. Introduction

The heat transport pipelines buried in urban areas transport hot water under high-pressure conditions for district heating. A leak caused by damage to heat transport pipelines may reduce heating efficiency, and it can also cause human casualties and property damage because the upper part of the ground may collapse, owing to outflowing water. Therefore, efforts have been made to identify damage to pipelines using equipment such as infrared thermal imaging cameras for heat transport pipeline management [1,2,3]. However, inspections that use such equipment are restricted by weather conditions and geographical factors, and efficient damage inspections are difficult because limited equipment must be used over a wide area. Therefore, it is necessary to perform maintenance by selecting areas with high damage risk first [4,5]. To support such decision-making, this study proposes the development of an AI-based model to predict the damage probability of heat transport pipelines using spatial, attribute, and historical damage data. The goal is to enable efficient, data-driven maintenance prioritization.

Because damage to, and leakage from, heat transport pipelines may cause the loss of thermal energy and safety accidents, studies have been conducted on diagnosis and prediction technologies. Systems capable of diagnosing leakage have been developed based on various image and sensor data, including infrared thermal imaging cameras, acoustic sensors, and distributed temperature measurements [6,7]. In addition, effective heat transport pipeline leak detection technologies have been developed by analyzing changes in data, such as the pressure and flow rate of pipelines, through simulation models [8]. To predict damage to heat transport pipelines, it is important to select the factors that affect damage and conduct systematic analysis. Heat transport pipelines with a high number of damage cases were selected through statistical analysis, and their characteristics were analyzed. In addition, aging, pressure, temperature, terrain, and installation conditions were reported as major factors that cause damage [9]. In addition, a statistically based damage risk prediction model was developed using operational data from heat transport pipelines currently in service in the Republic of Korea [10]. The deterioration and aging mechanisms of pipelines were analyzed, and AI technique application methods were proposed to predict the service life of heat transport pipelines using factors that cause damage [11]. Studies have also utilized sensor data—such as temperature, pressure, flow rate, and vibration—collected from heat transport pipelines to predict potential anomalies through AI-based approaches [12]. As described, studies have been conducted to predict damage to heat transport pipelines and prevent leakage using various techniques. However, owing to the nature of urban heat transport pipelines, it is difficult to secure a large amount of data. In addition, it is difficult to develop damage prediction models using AI techniques because data on damage to heat transport pipelines represent a very small proportion. Owing to these constraints, prior research has been largely confined to AI-based anomaly detection using installed sensor data and statistical analyses based on limited attribute information. To overcome these limitations and enhance maintenance decision-making, this study developed a machine learning-based damage probability prediction model by segmenting the heat transport pipelines currently operated in the Republic of Korea into analytical units, using their spatial and attribute data. In this study, the spatial and attribute information of heat transport pipelines and their damage history data were collected. Based on the spatial and attribute information, the pipelines were divided into basic units according to certain standards. Heat transport pipeline damage probability, according to the attribute information, was analyzed using the data for each basic unit, and factors that affect data characteristics and damage were identified through correlation analysis. A dataset was constructed by selecting the influential factors derived from correlation analysis as input data and the combined heat transport pipeline data—classified by damage status and service life (over 20, 25, and 30 years)—as output data. The constructed dataset was applied to Random Forest (RF), eXtreme Gradient Boosting (XGB), and Light Gradient Boosting Machine (LGBM), which are machine learning (ML) algorithms, to develop models. A model with the optimal performance was selected by comparing the evaluation indicators of each model. Efficient decision-making measures for heat transport pipeline maintenance were presented by visualizing damage probability for some areas in the target area using the heat transport pipeline damage probability values predicted through the selected model.

2. Research Method and Data

2.1. Research Method

In this study, 324,495 data points composed of spatial and attribute information, and 2293 data points containing damage history were collected for heat transport pipelines buried for district heating in urban areas of South Korea. The attribute information includes various data, such as information on the operator, pipe function (supply and return pipes), pipe diameter, completion year, sensor wire condition, pressure, and auxiliary facilities. Most of the data, however, include large amounts of missing or unreliable data. Therefore, the operator, pipe function, pipe diameter, completion year, and sensor wire condition were selected as variables that can be utilized. For the collected pipeline data, data were not constructed with certain standards. Therefore, a method to calculate basic units with certain standards was presented, and the data were divided based on the basic units. The heat transport pipeline damage tendency, according to the attribute information, was examined by analyzing damage probability for each factor using the attribute information of heat transport pipelines, with damage history classified into basic units. In addition, correlation analysis was conducted to select factors that have a statistical impact on damage to heat transport pipelines. A dataset for AI model development was constructed by selecting the influence factors identified through this analysis as input data and heat transport pipelines with damage history as output data. The output data, i.e., pipelines with damage history, represent a very small proportion (approximately 0.75%) of the entire dataset. Thus, the dataset was reinforced by adding pipelines that were used for over 20, 25, and 30 years, which had high damage risks, to the output data.

To develop a model for predicting heat transport pipeline damage probability, the entire dataset was divided into training, validation, and test data at a ratio of 64:16:20. Models were developed by training ML algorithms using 80% of the training data, and the generalization performance of the models was verified using the validation data. The performance of the models was verified using the test data that did not participate in learning. The constructed training dataset was used for the training of RF, XGB, and LGBM, which are ML algorithms. A model with the optimal performance was then selected by tuning hyperparameters. Accuracy and F2-score were selected as indicators for evaluating the performance of the model. In addition, overfitting of the model was assessed using the AUC curves of the training and test data. Finally, the importance of the variables used during heat transport pipeline damage classification was derived through the developed model, and decision-making measures for efficient maintenance were presented by visualizing the heat transport pipelines in some of the target areas using the damage probability predicted by the model.

2.2. Heat Transport Pipeline Data

2.2.1. Data Types

To develop a model for predicting heat transport pipeline damage probability according to the attribute information, the location information of heat transport pipelines, components, and attribute information according to the components were collected from the operators of the target area. The components include the equipment history and maintenance history of facilities, as well as topographical information that interacts with the facilities. The attribute information according to the components refers to data that represent the characteristics of each component (e.g., location, length, and area). In this study, the equipment history, attribute information, and spatial information of heat transport pipelines were collected to develop a model for predicting the damage probability of heat transport pipelines. The geometric form of pipelines was examined using GIS location information, and the object information of heat transport pipelines was identified through their identification numbers. In addition, the regional characteristics of heat transport pipelines were investigated through the operator codes. The functions of heat transport pipelines are data generated from pipeline characteristics, which distinguish between pipes that supply heated water for thermal energy transfer and those designated for recovery. The condition of the sensor wire indicates the condition of the detection system capable of monitoring damage and leakage from heat transport pipelines. Key management sections refer to sections with a high likelihood of damage to heat transport pipelines under the judgment of the operators. Table 1 shows the characteristics of heat transport pipeline attribute information.

2.2.2. Basic Unit Setting

The collected attribute information of heat transport pipelines is classified according to the arbitrary standards of the operator. It is necessary, however, to construct a dataset by setting basic units with certain standards to improve the efficiency of predicting the damage probability of heat transport pipelines. Therefore, basic units were set based on the attribute information of heat transport pipelines. Preprocessing was performed to construct the data required to set the basic units of heat transport pipelines. Most of the data did not contain missing or abnormal values, but data discretization was performed for attribute information that included some missing and abnormal values. The basic units were set considering priorities according to four standards. A total of 324,494 basic units were generated based on the 132,210 heat transport pipeline data collected through basic unit setting.

Combination of Heat Transport Pipelines Based on Attribute Information

For basic unit setting, when the attribute information of connected heat transport pipelines (operator, pipe function (supply or return pipes), pipe diameter, heat supply status, installation date (period of use), sensor wire condition, and key management sections) was identical, they were combined into one basic unit. The attribute information refers to factors that are expected to have direct or indirect impacts on the structural stability of pipes and damage to them. Among the information, the attribute information included in the damage history provided by the operators (operator, pipe function, pipe diameter, and installation date) was subjected to correlation analysis subsequently.

Maximum and Minimum Length Setting

The minimum and maximum lengths of the basic units of heat transport pipelines were set. The minimum length was set to 1 m and the maximum length to 36 m. The heat transport pipeline data, however, contained pipelines with a length of less than 1 m. They were combined with connected pipelines of the same attribute information. In the case of pipelines with a length over 36 m, their length was divided by 36 and they were split into basic units in the quantity of the rounded result. For example, as shown in Figure 1, a 50 m long pipeline was split into two basic units based on the division by 36. This process prevented basic units from exceeding 36 m.

2.2.3. Damage History Data

The damage history information of heat transport pipelines is essential to estimate their damage probability according to the attribute information. Therefore, the information of heat transport pipelines that were repaired and pipelines with damage history were collected. From the collected data, pipelines that were damaged and restored owing to such causes as leakage, geothermal heat, abnormal high temperature, steam, and loss of function were selected as damage history data. A total of 2293 heat transport pipelines were found to have damage history. Based on the damage data, damage probability according to heat transport pipeline attribute information was estimated.

3. Heat Transport Pipeline Damage Probability Analysis

This study proposes an AI model to estimate the damage probability of heat transport pipelines based on collected pipeline data. To develop a highly reliable damage risk probability model, it is necessary to select factors significantly related to pipeline damage. Therefore, damage probability was analyzed using pipeline attribute information as independent variables and repair and damage history as the dependent variable. Based on this, significant factors affecting damage to heat transport pipelines were identified.

To analyze damage probability, data on the operator, pipe function (supply or return), pipe diameter, installation year, sensor wire condition, and repair and damage history were used. The dataset, collected as of 2024, included 324,494 equipment history records from a total pipeline length of 4868.66 km and 2265 damage history records. Data missing attribute information were corrected before calculating damage probability.

3.1. Period of Use

Damage probability was analyzed according to the period of use. Among the damage history data, 1761 records included period-of-use information, while 504 did not. Excluding data without clear period-of-use information could underestimate the overall damage probability; therefore, a correction was applied. The correction allocated the period of use for damage cases without this information proportionally, as shown in Equation (1).

D_{T, p - i}^{'} = D_{T, p - i} \frac{D_{i} + D_{e}}{D_{i}}

(1)

›: $D_{T, p - i}^{'}$ : Number of damage cases for heat transport pipelines used for Y_u years (#) (corrected value).
›: $D_{T, p - i}$ : Number of damage cases for heat transport pipelines used for Y_u years (#) (actual value).
›: $D_{i}$ : Number of damage cases with period-of-use information (1761).
›: $D_{e}$ : Number of damage cases without period-of-use information (504).

The damage probability for the period of use was calculated using Equation (2) after analyzing the equipment history and damage history of heat transport pipelines. In Equation (2), the damage probability by period of use was calculated by dividing the cumulative number of damage cases of the heat transport pipelines used for T years over the last N years by the cumulative pipe length of the heat transport pipelines used for T years over the last N years.

F_{T} = \frac{\sum_{i = 1}^{N} D_{T, p - i}}{\sum_{i = 1}^{N} L_{T, p - i}}

(2)

›: $F_{T}$ : Damage probability according to the period of use (#/km/year).
›: $D_{T, p - i}$ : Number of heat transport pipeline damage cases by period of use (#).
›: $L_{T, p - i}$ : Length of heat transport pipelines by period of use (km).
›: $p$ : Relevant year (year).
›: $N$ : Damage history collection period (year).

Figure 2 shows the number of damage cases and damage probability according to the period of use. It was found that damage probability tended to increase as the period of use increased. In particular, damage probability sharply increased around 20 and 30 years. This indicates that the period of use has a significant impact on damage to heat transport pipelines.

3.2. Operator

Through the analysis of damage probability, according to the operator of heat transport pipelines, the significance between the two factors was examined. The operator varies by region, and the damage risk according to regional characteristics can be identified through the operator information. The dataset included 19 operators (A to S) with operation information for all 2265 damage cases. The damage probability according to the operator was calculated using the same method as the equation for deriving the damage probability according to the period of use. The results are shown in Figure 3. The probability was significantly different depending on the operator. This appears to be because the regions where each operator runs, the district heating system had different topographical conditions (altitude and slope) and moisture contents in soil. This indicates that the operator (regional characteristics) is significantly involved in the damage to heat transport pipelines.

3.3. Pipe Function

Most heat transport pipelines comprise supply pipes that inject heated water under high pressure and return pipes in which the supplied hot water returns [11]. In the dataset of this study, 2240 damage data included the information of supply and return pipes, and an analysis was conducted using the data. It was found that the number of pipeline damage cases was approximately 2.28 times larger for the supply pipes compared to the return pipes. This appears to be because the pressure of the supply pipes was relatively higher as the lengths of the supply and return pipes were almost identical (2454 km for the supply pipes and 2452 km for the return pipes). Therefore, the pipe function is considered to be a factor that damages heat transport pipelines. Table 2 shows the number of damage cases and damage probability for each pipe function.

3.4. Pipe Diameter

The pipe diameter of heat transport pipelines in the target area varied from 20 to 1200 mm. In general, steel pipes can be classified as small-diameter pipes for diameters of 200 mm or less, medium-diameter pipes for 200 to 400 mm, and large-diameter pipes for 400 mm or more. Therefore, damage probability was analyzed according to these pipe diameter categories. There were a total of 2215 data points that included attribute information on pipe diameter and could be used for analysis. Figure 4 shows the number of damage cases and damage probability by pipe diameter. In addition, Table 3 shows the number of damage cases and probability according to the pipe diameter category. There was no significant correlation between pipe diameter and damage to heat transport pipelines, but the damage probability of medium-diameter pipes was the lowest among small-, medium-, and large-diameter pipes.

3.5. Sensor Wire Condition

A damage sensor wire is attached to the exterior of a heat transport pipeline, allowing operators to detect damage. When internal hot water leaks due to steel pipe damage or moisture from surrounding soil infiltrates because of insulation damage, moisture penetrates the sensor wire, causing a drop in its insulation level [13,14,15]. The sensor wire condition data are categorized into two states: normal and disconnection. In the normal state, insulation levels range from stages 1 to 13. Disconnection indicates that the sensor wire is malfunctioning.

There were 240 damage cases with normal sensor wire condition and complete attribute data, which were used to analyze damage probability. Figure 5 presents the number of damage cases and damage probability according to insulation level. Damage probability generally decreased as insulation level increased, with a sharp decline observed from level 5 onwards. Pipelines with disconnected sensor wires totaled 1728.02 km in length, with a damage probability of 0.066.

4. Heat Transport Pipeline Damage Probability Model

4.1. Dataset

4.1.1. Input Data

Factors affecting damage to heat transport pipelines were identified through damage probability based on attribute information. Using these factors, damage risk probability models were developed. The operator, pipe function, pipe diameter, and sensor wire condition of heat transport pipelines were utilized as input data. Records with missing values were removed before applying the dataset to ML algorithms. A total of 326,788 data points were used for model development.

Data preprocessing was conducted to efficiently develop the damage probability prediction model. First, the operator and pipe function, which were character strings, were converted into numeric strings using LabelEncoder. Because pipe diameter and sensor wire condition had asymmetrical data distributions, efficient model learning was expected to be difficult [16,17]. Therefore, a log transformation was applied to better align their data scales.

4.1.2. Output Data

Safety and risk classes were defined using damage history and installation year of the pipelines. Although data on damaged pipelines were collected, the damaged class accounted for a very small proportion (0.75%). To improve model performance, the risk class data were augmented by extracting period-of-use data based on installation year. Previous studies showed that the damage risk of underground pipelines increased with usage duration [18,19]. According to the failure history of heat transport pipelines by period of use (refer to Figure 2), the probability of pipeline failure begins to increase noticeably after 20 years of service, and a pronounced surge is observed for pipelines older than 30 years. Accordingly, pipelines with service lives of 20 years, 25 years, and 30 years or more were each incorporated as “risk” classes to enhance the output dataset. Subsequently, model performance was evaluated and compared for each scenario.

4.2. Data Correlation Analysis

To develop efficient models, statistical significance between input and output (target) data was investigated. Pearson correlation analysis was conducted to examine the linear relationship between variables and quantify the strength of associations [20]. The correlation coefficient is calculated using Equation (3), where higher absolute values indicate stronger correlation [21,22]. Although heat transport pipelines have various data types (attribute information, auxiliary facilities, land use), only a few factors are systematically managed. Thus, the statistical significance between input and output data was examined using hypothesis testing for population correlations. Data with p-values ≤ 0.05 were considered statistically significant for damage risk.

Corr (X, Y) = ρ (X, Y) = \frac{Cov (X, Y)}{σ_{x} σ_{y}},

(3)

where

σ_{x} and σ_{y}

: standard deviations of the variables.

Table 4 shows the correlation results. The operator had the highest correlation, followed by pipe diameter, sensor wire condition, and pipe function. Because the operator is categorical data, its negative correlation is considered meaningless. All input data were statistically significant with p-values ≤ 0.05.

4.3. ML Model

ML-based models predicting heat transport pipeline damage probability were developed using attribute information. Machine learning models demonstrate strong effectiveness when applied to structured tabular data. In addition, they offer greater flexibility in data preprocessing compared to image-based models [23]. Therefore, in this study, machine learning algorithms were employed with consideration for operational usability by field personnel. Binary classification (safety vs. risk) was performed using RF, eXtreme Gradient Boosting (XGB), and Light Gradient Boosting Machine (LGBM) algorithms. The optimal model was selected by comparing evaluation indicators.

4.3.1. RF

RF is an ensemble model based on decision trees. It is suitable for solving both regression and classification problems [24]. Many studies have developed classification models using RF [25,26,27,28,29]. Ensemble models typically achieve better results than models trained on a single algorithm by repeatedly learning multiple models and combining them through techniques such as voting and bagging. RF allows free selection of variables because the risk of overfitting is relatively low, and it performs well even with data where variables have low correlations [30].

4.3.2. XGBoost (eXtreme Gradient Boosting)

XGBoost (XGB) is a widely used boosting algorithm that sequentially trains individual tree models while updating their weights. It has been applied to both regression and classification problems and can easily address overfitting by tuning hyperparameters. Additionally, XGB efficiently handles large datasets within a short time, making it popular across various fields [31].

Decision-making of the XGB model is performed using Equation (4).

\hat{y_{i}}

is the predicted value of the i-th sample, and

f_{k}

is the value predicted by applying the sigmoid function of the k-th tree. All results of each predicted value are added to derive results. The predicted value is calculated using Equation (5) [32].

\hat{y_{i}} = \sum_{k = 1}^{K} f_{k} (x_{i})

(4)

\hat{y_{i}} = \frac{1}{1 + e^{- f (x_{i})}}

(5)

Errors occur during the process of training the tree. The weight is calculated by minimizing the error using Equation (6). In the equation,

{\hat{y_{i}}}^{(t - 1)}

is the predicted value of the previous model,

h_{t} (x_{i})

is the tree learned by the current model, and

η

is the learning rate that indicates the reflection rate of the previous model. This method is repeated to improve the error of the model [33].

{\hat{y_{i}}}^{(t)} = {\hat{y_{i}}}^{(t - 1)} + η h_{t} (x_{i})

(6)

4.3.3. LightGBM (Light Gradient Boosting Machine)

LightGBM (LGBM) employs a boosting technique similar to XGB. It is a tree-based, high-performance algorithm commonly used to determine feature importance in regression and classification tasks. By utilizing a subset of the data for rapid computation and feature reduction, the model achieves improved computational efficiency and demonstrates effectiveness in handling imbalanced datasets [34,35]. It is efficient in handling large datasets.

The model calculates the loss function using the cross entropy. The cross entropy is calculated using Equation (7). N is the number of samples, K is the number of classes,

y_{i, j}

is the binary variable that indicates whether the i-th sample belongs to the j-th class, and

p_{i, j}

is the probability that the i-th sample will belong to the j-th class. LGBM derives results through learning by updating the model while minimizing the cross entropy received from the previous model [36].

C E = \frac{1}{N} {\sum_{i = 1}}^{N} {\sum_{j = 1}}^{N} y_{i, j} l o g (p_{i, j})

(7)

4.4. Model Evaluation Indicators

In this study, the dataset was applied to the ML models, and the results were compared. The evaluation indicators selected to compare model performance were accuracy, to intuitively assess the model’s performance; F2-score, to evaluate classification performance for damaged heat transport pipelines; and the area under the curve (AUC), to assess the reliability of the model [37,38].

F2-Score is used to evaluate binary classification models where recall is highly important. It considers both precision and recall but assigns a higher weight to recall. This enables evaluation of whether the prediction model properly classified damage to heat transport pipelines [39,40].

AUC evaluates model performance based on the area under the Receiver Operating Characteristic (ROC) curve, which uses recall and specificity. Table 5 presents model performance according to AUC values. Model performance is considered good if the value is 0.8 or higher [41].

Equations (8)–(12) represent methods to calculate the model evaluation indicators. Figure 6 is the confusion matrix that shows model classification results for evaluation index calculation.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(8)

Recall (Sensitivity) = \frac{T P}{T P + F N}

(9)

Precision = \frac{T P}{T P + F P}

(10)

F 2 Score = 5 \times \frac{Precision \times Recall}{4 \times Precision + Recall}

(11)

Specificity = \frac{T N}{T N + F P}

(12)

4.5. Results of Models for Predicting Heat Transport Pipeline Damage Probability

The Scikit-learn library, which includes Python 3.8 and machine learning packages, was used to develop models predicting heat transport pipeline damage probability. The algorithms used were RF, XGB, and LGBM. Models based on other algorithms (e.g., CatBoost and AdaBoost) were developed but showed issues such as overfitting and performance degradation. Therefore, only results from the selected algorithms are presented.

The dataset was divided into training, validation, and test sets at a ratio of 64:16:20. Only the training set was used for model training, with overfitting evaluated by comparing performance with the validation set. The test set was used to calculate evaluation indicators [42]. In addition, oversampling techniques were applied to the training set to augment the relatively scarce instances of pipeline failure.

To improve model development, the output data (heat transport pipeline damage history) were reinforced. It is known that damage risk sharply increases after a certain period of use. Thus, pipes used for over 20, 25, and 30 years were labeled as damaged in addition to existing damage history. Three datasets were created based on this and applied to the ML algorithms. The results are shown in Table 6. Datasets A, B, and C correspond to data with pipes used over 20, 25, and 30 years added to the damage history, respectively.

The model using dataset A with the LGBM algorithm achieved the highest F2-score (accuracy: 0.737, F2-score: 0.804, AUC: 0.837). The ROC curve for this model is shown in Figure 7, displaying an upward-sloping pattern with a 0.01 difference between training and validation. Thus, the model’s generalization ability is stable, indicating significant predictive performance.

4.6. Importance Analysis

The LGBM model, which exhibited the highest evaluation indicators, was used to analyze the importance of the input data utilized by the classifier to predict heat transport pipeline damage probability, as shown in Figure 8. Pipe diameter_log and sensor wire condition_log are variables transformed by applying the logarithm to the raw data. The most important factor for the classifier’s predictions was the operator, followed by pipe diameter, sensor wire condition, and pipe function. It is difficult to quantify how much these factors affect damage to heat transport pipelines based solely on the importance values presented. However, because the importance ranking corresponds to the correlation analysis results, it is expected that factors with high importance will be useful in decision-making processes, such as maintenance planning.

4.7. Visualization of Heat Transport Pipeline Damage Probability

A risk map was prepared using the heat transport pipeline damage probability prediction model presented in this study, as shown in Figure 9. The developed model allows calculation of the damage probability of pipelines segmented into basic units. Sections with high damage risk based on the model’s prediction were visualized in red, while those with relatively low risk were shown in blue.

Because the model heavily weighted the operator (regional characteristics), it was expected that similar damage probabilities would appear within the same operator’s jurisdiction. However, when visualizing heat transport pipeline damage probability, the model predicted different probabilities even within the same operator, as it properly considered various factors. This visualization contributes to the development of a risk map, which is expected to serve as basic data for prioritizing efficient heat transport pipeline maintenance and proactively managing hazardous areas.

5. Discussion

In this study, attribute information and damage history data of heat transport pipelines were collected to analyze the damage probability based on these attributes and to develop a risk model. The pipelines were divided into basic units following certain standards. Damage probability according to attribute information was analyzed using the segmented pipeline data, and an AI-based model for predicting heat transport pipeline damage probability was developed by selecting statistically significant factors.

It was found that damage to heat transport pipelines is influenced by the period of use, operator, pipe function, pipe diameter, and sensor wire condition. As the period of use increased, damage probability tended to rise, owing to aging. This likely results from aging damaging the insulation material that protects the steel pipes and prevents heat loss; corrosion then occurs from leakage of internal hot water and infiltration of external groundwater through damaged areas [43]. Differences in sensor wire condition also affected damage probability for the same reason. Previous studies reported that the service life of pipelines using the same steel pipes as heat transport pipelines, including water supply systems, is at least approximately 30 years, despite variations due to internal and external conditions. Damage probability analysis of the target area’s heat transport pipeline data similarly showed high damage probability for pipes used over 30 years [18,19]. Damage probability also varied significantly by operator, likely due to differences in management and maintenance guidelines and soil moisture content affected by topographical characteristics (altitude and slope) [44,45,46,47]. Heat transport pipelines carry hot water under high temperature and pressure, causing damage primarily to supply pipes. Pipe diameter was not closely related to damage, although mid-sized pipelines tended to exhibit lower damage probability.

An AI-based model predicting heat transport pipeline damage probability was developed. Owing to an imbalance between input data (attribute information) and output data (damage status), the output data were reinforced with pipelines used over 20 years, because damage probability sharply increased with age. Applying this reinforced dataset to RF, XGB, and LGBM models produced a model with approximately 74% accuracy and an F2-score of approximately 0.804. This result demonstrates higher reliability compared to previous studies that relied on statistical models using single factors such as pipeline age or operator, as the proposed model incorporates multiple influencing factors in combination [10]. The high F2-score indicates good classification of damaged pipelines. In particular, the model demonstrates strong detection performance for damaged pipelines, distinguishing itself from prior studies that focused heavily on accuracy alone [12]. When including pipes used over 30 years, model accuracy and AUC improved compared to those including pipes over 20 years. The primary goal was to predict damaged pipelines; therefore, the F2-score was the key evaluation metric, with accuracy and AUC serving as auxiliary indicators to assess reliability. A risk map visualizing pipeline damage risk was created using the model, enabling prioritized maintenance and proactive management of high-risk pipes to prevent accidents.

The proposed model predicted risks using limited factors due to data collection challenges. Future research should aim to develop a more reliable model by incorporating diverse internal and external factors, including various pipeline attribute information, geotechnical data, auxiliary facilities of heat transport pipelines, and nearby excavation activities.

6. Conclusions

This study presented the damage tendency of heat transport pipelines by analyzing their attribute information and damage history. Based on this analysis, an AI-based damage probability prediction model was developed. Additionally, a risk map illustrating heat transport pipeline damage risks was created from the model’s prediction results. The period of use, operator, sensor wire condition, and pipe diameter showed significant correlations with damage. In particular, pipeline aging was found to notably affect damage occurrence. The AI-based model, using these factors, achieved a high F2-score (over 0.8), indicating effective classification of pipelines with high damage probability. The risk map derived from the model is expected to aid management authorities in efficient pipeline maintenance. However, the reliability of the model is limited by the relatively small number of attribute variables available for analysis. In addition, the data used for model development were confined to heat transport pipelines in South Korea, which may limit the generalizability of the model to other regions or contexts. The strategy of labeling all pipelines exceeding a certain service age as at-risk, adopted to address the severe class imbalance in the dataset, may also introduce labeling bias. Therefore, it will be necessary to further enrich the dataset with more comprehensive maintenance and failure records in future studies. Moreover, integrating a broader range of environmental factors—such as burial depth, soil characteristics, and information on auxiliary facilities—will be important for further improving model performance and reliability. Despite these limitations, the model development protocol proposed in this study is expected to provide a valuable foundation for data-driven decision-making and proactive risk management of underground infrastructure in urban environments.

Author Contributions

Conceptualization, J.K. (Jaemo Kang) and J.K. (Jinyoung Kim); developed the models and carried out the model simulations, S.L. and M.K.; writing—original draft preparation, S.L. and M.K.; writing—review and editing, J.K. (Jaemo Kang) and J.K. (Jinyoung Kim). All authors have read and agreed to the published version of the manuscript.

Funding

Research for this paper was carried out under the Korea Institute of Civil Engineering and Building Technology (KICT) Research Program (project no. 20250270-001 Intelligent Geotechnical Data Analysis and Underground Infrastructure Stability Assessment Based on Data Fusion), funded by the Ministry of Science and ICT.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area under the curve
LGBM	Light Gradient Boosting Machine
LightGBM	Light gradient boosting machine
ML	Machine learning
RF	Random Forest
ROC	Receiver Operating Characteristic

References

Zhou, S.; O’Neill, Z.; O’Neill, C. A Review of Leakage Detection Methods for District Heating Networks. Appl. Therm. Eng. 2018, 137, 567–574. [Google Scholar] [CrossRef]
Rafati, A.; Shaker, H.R. Predictive Maintenance of District Heating Networks: A Comprehensive Review of Methods and Challenges. Therm. Sci. Eng. Prog. 2024, 53, 102722. [Google Scholar] [CrossRef]
van Dreven, J.; Boeva, V.; Abghari, S.; Grahn, H.; Al Koussa, J.; Motoasca, E. Intelligent Approaches to Fault Detection and Diagnosis in District Heating: Current Trends, Challenges, and Opportunities. Electronics 2023, 12, 1448. [Google Scholar] [CrossRef]
Igwenagu, U.T.I.; Debnath, R.; Ahmed, A.A.; Alam, M.J.B. An Integrated Approach for Earth Infrastructure Monitoring Using UAV and ERI: A Systematic Review. Drones 2025, 9, 225. [Google Scholar] [CrossRef]
Ravindran, G. Evaluation of New Technologies to Support Asset Management of Metro Systems; UCL Press: London, UK, 2020. [Google Scholar]
Guan, H.; Xiao, T.; Luo, W.; Gu, J.; He, R.; Xu, P. Automatic Fault Diagnosis Algorithm for Hot Water Pipes Based on Infrared Thermal Images. Build. Environ. 2022, 218, 109111. [Google Scholar] [CrossRef]
Adegboye, M.A.; Fung, W.-K.; Karnik, A. Recent Advances in Pipeline Monitoring and Oil Leakage Detection Technologies: Principles and Approaches. Sensors 2019, 19, 2548. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Chen, J.; Fu, Q.; Wu, H.; Wang, Y.; Lu, Y. Detection of District Heating Pipe Network Leakage Fault Using UCB Arm Selection Method. Buildings 2021, 11, 275. [Google Scholar] [CrossRef]
Valinčius, M.; Žutautaitė, I.; Dundulis, G.; Rimkevičius, S.; Janulionis, R.; Bakas, R. Integrated Assessment of Failure Probability of the District Heating Network. Reliab. Eng. Syst. Saf. 2015, 133, 314–322. [Google Scholar] [CrossRef]
Kong, M.; Kang, J. Methodology for Estimating the Probability of Damage to a Heat Transmission Pipe. J. Korean Geo-Environ. Soc. 2021, 22, 15–21. (In Korean) [Google Scholar]
Langroudi, P.P.; Weidlich, I. Applicable Predictive Maintenance Diagnosis Methods in Service-Life Prediction of District Heating Pipes. Environ. Clim. Technol. 2020, 24, 294–304. [Google Scholar] [CrossRef]
Pishvaie, M.R.; Hadipoor, M.; Jafari, S.; Baghery, S. Intelligent Approaches to Fault Detection and Diagnosis in District Heating Systems: A Review. Processes 2023, 11, 2512. [Google Scholar]
Tol, H.İ.; Madessa, H.B. Enhancing District Heating System Efficiency: A Review of Return Temperature Reduction Strategies. Appl. Sci. 2025, 15, 2982. [Google Scholar] [CrossRef]
Lee, Y.H.; Kim, S.H.; Kang, U.S.; Kim, W.C.; Kim, J.G. Evaluation of Electrochemical Properties and Life Prediction of Sensor Wire in Leak Detection Systems of Underground Heating Pipelines. J. Electrochem. Soc. 2024, 171, 103508. [Google Scholar] [CrossRef]
Lidén, P.; Adl-Zarrabi, B.; Hagentoft, C.E. Diagnostic Protocol for Thermal Performance of District Heating Pipes in Operation, 2: Estimation of Present Thermal Conductivity in Aged Pipe Insulation. Energies 2021, 14, 5302. [Google Scholar] [CrossRef]
Song, S.; Kim, J. Advanced Monitoring Technology for District Heating Pipelines Using Fiberoptic Cable. In Proceedings of the 15th International Symposium on District Heating and Cooling, Seoul, Republic of Korea, 4–7 September 2016; pp. 1–8. [Google Scholar]
Rahman, A. Statistics-Based Data Preprocessing Methods and Machine Learning Algorithms for Big Data Analysis. Int. J. Artif. Intell. 2019, 17, 44–65. [Google Scholar]
Bilal, M.; Ali, G.; Iqbal, M.W.; Anwar, M.; Malik, M.S.A.; Kadir, R.A. Auto-prep: Efficient and Automated Data Preprocessing Pipeline. IEEE Access 2022, 10, 107764–107784. [Google Scholar] [CrossRef]
Khan, L.R.; Tee, K.F. Risk-Cost Optimization of Buried Pipelines Using Subset Simulation. J. Infrastruct. Syst. 2016, 22, 04016001. [Google Scholar] [CrossRef]
Ebenuwa, A.U.; Tee, K.F. Fuzzy Reliability and Risk-Based Maintenance of Buried Pipelines Using Multiobjective Optimization. J. Infrastruct. Syst. 2020, 26, 04020008. [Google Scholar] [CrossRef]
Asuero, A.G.; Sayago, A.; González, A.G. The Correlation Coefficient: An Overview. Crit. Rev. Anal. Chem. 2006, 36, 41–59. [Google Scholar] [CrossRef]
Xu, H.; Deng, Y. Dependent Evidence Combination Based on Shearman Coefficient and Pearson Coefficient. IEEE Access 2018, 6, 11634–11640. [Google Scholar] [CrossRef]
Tai, J.; Che, C. Automated Machine Learning: A Survey of Tools and Techniques. J. Ind. Eng. Appl. Sci. 2024, 2, 71–76. [Google Scholar] [CrossRef]
Wählby, U.; Jonsson, E.N.; Karlsson, M.O. Comparison of Stepwise Covariate Model Building Strategies in Population Pharmacokinetic-Pharmacodynamic Analysis. AAPS PharmSciTech. 2002, 2002, 68–79. [Google Scholar] [CrossRef] [PubMed]
Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Taylor & Francis: Oxfordshire, UK, 1984. [Google Scholar]
Pal, M. Random Forest Classifier for Remote Sensing Classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Park, E.J.; Park, J.H.; Kim, H.H. Mapping Species-Specific Optimal Plantation Sites Using Random Forest in Gyeongsangnam-do Province, South Korea. J. Agric. Life Sci. 2019, 53, 65–74. (In Korean) [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009; p. 745. [Google Scholar]
Tian, S.; Zhang, X.; Tian, J.; Sun, Q. Random Forest Classification of Wetland Landcovers from Multy-Sensor Data in the Arid Region of Xinjiang, China. Remote Sens. 2016, 8, 954. [Google Scholar] [CrossRef]
Lee, S.H.; Yoon, Y.A.; Jung, J.H.; Sim, H.S.; Chang, T.W.; Kim, Y.S. A Machine Learning Model for Predicting Silica Concentrations through Time Series Analysis of Mining Data. J. Korean Soc. Qual. Manag. 2020, 48, 511–520. [Google Scholar]
Louppe, G. Understanding Random Forests; University of Liege: Leige, Belgium, 2014; p. 211. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System, KDD’16. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zhang, Y.; Haghani, A. A Gradient Boosting Method to Improve Travel Time Prediction. Emerg. Technol. 2015, 58, 308–324. [Google Scholar] [CrossRef]
Zhang, D.; Chen, H.D.; Zulfiqar, H.; Yuan, S.S.; Huang, Q.L.; Zhang, Z.Y.; Deng, K.J. iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. Comp. Math. Methods Med. 2021, 2021, 6664362. [Google Scholar] [CrossRef] [PubMed]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; NeurIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient BoostingDecision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Lv, J.; Wang, C.; Gao, W.; Zhao, Q. An Economic Forecasting Method Based on the LightGBM-Optimized LSTM and Time-Series Model. Hindawi Comput. Intell. Neurosci. 2021, 2021, 10. [Google Scholar] [CrossRef] [PubMed]
Gu, Q.; Zhu, L.; Cai, Z. Evaluation Measures of the Classification Performance of Imbalanced Data Sets. In Proceedings of the ISICA 2009—The 4th International Symposium on Computational Intelligence and Intelligent Systems, Communications in Computer and Information Science, Huangshi, China, 23–25 October 2009; Cai, Z., Li, Z., Kang, Z., Liu, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 51, pp. 461–471. [Google Scholar] [CrossRef]
Bekkar, M.; Djemaa, H.K.; Alitouche, T.A. Evaluation Measures for Models Assessment over Imbalanced Data Sets. J. Inf. Eng. Appl. 2013, 3, 27–38. [Google Scholar]
Gietz, H.; Sharma, J.; Tyagi, M. Machine Learning for Automated Sand Transport Monitoring in a Pipeline Using Distributed Acoustic Sensor Data. IEEE Sens. J. 2024, 24, 22444–22457. [Google Scholar] [CrossRef]
Chen, X.; Karin, T.; Jain, A. Automated Defect Identification in Electroluminescence Images of Solar Modules. Sol. Energy 2022, 242, 20–29. [Google Scholar] [CrossRef]
Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Cawley, G.C.; Talbot, N.L.C. On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
Vega, A.; Yarahmadi, N.; Jakubowicz, I. Determination of the Long-Term Performance of District Heating Pipes through Accelerated Ageing. Polym. Degrad. Stab. 2018, 153, 15–22. [Google Scholar] [CrossRef]
Guo, X.; Fu, Q.; Hang, Y.; Lu, H.; Gao, F.; Si, J. Spatial Variability of Soil Moisture in Relation to Land Use Types and Topographic Features on Hillslopes in the Black Soil Area of Northeast China. Sustainability 2020, 12, 3552. [Google Scholar] [CrossRef]
Kim, S.; Lee, H.; Woo, N.C.; Kim, J. Soil Moisture Monitoring on a Steep Hillside. Hydrol. Processes. 2007, 21, 2910–2922. [Google Scholar] [CrossRef]
Kim, M.S.; Onda, Y.; Kim, J.K.; Kim, S.W. Effect of Topography and Soil Parameterisation Representing Soil Thicknesses on Shallow Landslide Modelling. Quat. Int. 2015, 384, 91–106. [Google Scholar] [CrossRef]

Figure 1. Maximum and minimum length setting method.

Figure 2. Number of damage cases and damage probability according to the period of use.

Figure 3. Number of damage cases and damage probability according to the operator.

Figure 4. Number of damage cases and damage probability according to the pipe diameter.

Figure 5. Number of damage cases and damage probability according to the sensor wire condition.

Figure 6. Confusion matrix.

Figure 7. ROC curve.

Figure 8. Analysis of important factors using models.

Figure 9. Heat transport pipeline damage probability risk map.

Table 1. Data characteristics of heat transport pipelines.

Attribute Information	Data Characteristics
GIS location information	Shape file
Equipment ID	Numeric string
Operator	Character string
Pipe function	Character string
Pipe diameter	Numeric string
Installation date	Numeric string
Sensor wire condition	Character string

Table 2. Number of damage cases and damage probability for each pipe function.

Pipe Function	Average Period of Use (Year)	Pipe Length (km)	Number of Damage Cases (#)	Damage Probability (#/km/Year)
Supply pipe	16.00	2454.00	1557	0.043
Return pipe	16.02	2452.26	683	0.019

Table 3. Number of damage cases and damage probability according to the pipe diameter category.

Classification	Pipe Length (km)	Number of Damage Cases (#)	Damage Probability (#/km/Year)
Small diameter	2228.41	1099	0.034
Medium diameter	968.40	311	0.022
Large diameter	1666.15	805	0.033

Table 4. Results of Pearson correlation.

Factor	Pearson Correlation	p-Value
Operator	−0.302	0.000
Pipe function	0.004	0.037
Pipe diameter	−0.204	0.000
Sensor wire condition	−0.080	0.000

Table 5. Model evaluation according to AUC.

AUC	Evaluation
AUC ≧ 0.9	Excellent
0.8 ≦ AUC < 0.9	Good
0.7 ≦ AUC < 0.8	Fair
AUC < 0.7	Poor

Table 6. Performance of models for predicting heat transport pipeline damage probability.

Dataset	Model	Accuracy	F2-Score	AUC
A	XGB	0.741	0.803	0.839
	LGBM	0.737	0.804	0.837
	RF	0.707	0.744	0.804
B	XGB	0.753	0.776	0.862
	LGBM	0.741	0.781	0.861
	RF	0.714	0.726	0.832
C	XGB	0.768	0.730	0.901
	LGBM	0.770	0.740	0.900
	RF	0.730	0.721	0.891

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Kang, J.; Kim, J.; Kong, M. AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information. Appl. Sci. 2025, 15, 8003. https://doi.org/10.3390/app15148003

AMA Style

Lee S, Kang J, Kim J, Kong M. AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information. Applied Sciences. 2025; 15(14):8003. https://doi.org/10.3390/app15148003

Chicago/Turabian Style

Lee, Sungyeol, Jaemo Kang, Jinyoung Kim, and Myeongsik Kong. 2025. "AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information" Applied Sciences 15, no. 14: 8003. https://doi.org/10.3390/app15148003

APA Style

Lee, S., Kang, J., Kim, J., & Kong, M. (2025). AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information. Applied Sciences, 15(14), 8003. https://doi.org/10.3390/app15148003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information

Abstract

1. Introduction

2. Research Method and Data

2.1. Research Method

2.2. Heat Transport Pipeline Data

2.2.1. Data Types

2.2.2. Basic Unit Setting

Combination of Heat Transport Pipelines Based on Attribute Information

Maximum and Minimum Length Setting

2.2.3. Damage History Data

3. Heat Transport Pipeline Damage Probability Analysis

3.1. Period of Use

3.2. Operator

3.3. Pipe Function

3.4. Pipe Diameter

3.5. Sensor Wire Condition

4. Heat Transport Pipeline Damage Probability Model

4.1. Dataset

4.1.1. Input Data

4.1.2. Output Data

4.2. Data Correlation Analysis

4.3. ML Model

4.3.1. RF

4.3.2. XGBoost (eXtreme Gradient Boosting)

4.3.3. LightGBM (Light Gradient Boosting Machine)

4.4. Model Evaluation Indicators

4.5. Results of Models for Predicting Heat Transport Pipeline Damage Probability

4.6. Importance Analysis

4.7. Visualization of Heat Transport Pipeline Damage Probability

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI