Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models

Lee, Sungyeol; Kang, Jaemo; Kim, Jinyoung; Kong, Myeongsik

doi:10.3390/app152011104

Open AccessArticle

Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models

Department of Geotechnical Engineering Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11104; https://doi.org/10.3390/app152011104

Submission received: 22 September 2025 / Revised: 10 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Advanced Diagnostics and Nondestructive Testing Technologies for Civil Structures)

Download

Browse Figures

Versions Notes

Abstract

This study generated an AI (Artificial Intelligence)-based prediction model for identifying high-risk groups of failures in urban district heating pipelines using pipeline attribute information and historical failure records. A total of 324,495 records from normally operating pipelines and 2293 failure cases were collected. Because the dataset exhibited severe imbalance, a KNN (K Nearest Neighbors)-based similarity selection was applied to reclassify the top 10% of normal data most similar to failure cases as high-risk. Input variables for model development included pipe diameter, purpose, insulation level, year of burial, and burial environment, supplemented with derived variables to enhance predictive capacity. The dataset was trained using XGBoost (eXtreme Gradient Boosting) v3.0.2, LightGBM (Light Gradient-Boosting Machine) v4.5.0, and an ensemble model (XGBoost + LightGBM), and the performance metrics were compared. The XGBoost model (K = 2) achieved the best results, with an F2-score of 0.921 and an AUC of 0.993. Variable importance analysis indicated that year of burial, insulation level, and purpose were the most influential features, highlighting pipeline aging and insulation condition as key determinants of high-risk classification. The proposed approach enables prioritization of failure risk management and identification of vulnerable sections using only attribute data, even in situations where sensor installation and infrared thermography are limited. Future research should consider distance functions suitable for mixed variables, sensitivity to unit length, and SHAP (Shapley Additive exPlanations)-based interpretability analysis to further generalize the model and enhance its field applicability.

Keywords:

heat transport pipelines; damage probability prediction; risk mapping; pipeline infrastructure management; machine learning

1. Introduction

Urban district heating pipelines are buried underground and transport high-temperature water under high pressure, supplying thermal energy. Failure of these pipelines can lead to significant energy resource losses due to reduced heating efficiency. In case of damage expansion that causes large-scale leakage, hot water may infiltrate the surrounding ground, resulting in surface subsidence or collapse. Such incidents can damage roads and buildings, posing serious risks to both life and property. Because these pipelines are typically buried in urban underground spaces, backfill pressure, surface loads, and repeated vehicular loading are transferred to the pipeline structure, potentially causing damage. Continuous monitoring and maintenance are thus essential. Owing to the thermal characteristics of hot water, leakage generally results in localized increases in soil temperature. Leveraging this property, infrared thermography has been widely used to detect leakage and pipeline defects [1,2,3]. However, weather conditions and geographical constraints often limit infrared equipment applicability. Moreover, the use of a limited number of inspection devices across wide areas hinders efficient failure detection. To address these challenges, technologies that can identify high-risk areas with greater precision are needed [4,5]. Statistical and AI-based techniques have been actively explored for fault detection and failure prediction in district heating pipelines. Several systems have been reported that analyze image and sensor data obtained from infrared thermography or sensors attached to pipelines to detect failures and support maintenance decision-making [6,7]. Simulation-based approaches have also been developed, where pressure and flow conditions are modeled and analyzed to identify signs of potential failure [8]. Research on statistical approaches has highlighted the importance of selecting and systematically analyzing influencing factors for predicting pipeline failures. Previous studies have identified that aging, pressure and temperature, along with topographic and installation conditions, are primary causes of system failure [9]. In Korea, a statistical failure risk prediction model was developed using operational data from actual district heating pipelines [10]. Other studies analyzed deterioration mechanisms to support lifecycle assessment and proposed AI-based approaches to strengthen failure prediction [11]. Additionally, AI models trained on sensor data, such as temperature, pressure, flow, and vibration, have been used to detect abnormal patterns indicative of failure [12]. In addition, several recent studies have proposed intelligent monitoring frameworks in the field of mechanical and structural health monitoring, which provide valuable insights for applying AI-based fault detection and diagnosis in pipeline systems [13]. In addition to these district heating–specific studies, a considerable body of research has been conducted on water distribution and gas pipeline networks, which provides valuable insights into predictive modeling. For instance, statistical and machine learning models have been applied to water mains and gas pipes to identify failure-prone sections and prioritize rehabilitation [14,15,16,17]. GIS-based approaches have also been proposed to link spatial attributes and environmental conditions to pipeline deterioration [18,19]. These studies underscore the significance of integrating spatial data and environmental factors with operational records when developing predictive models. Recent studies further indicate that data imbalance is a critical limitation in infrastructure failure prediction. Since failure events represent only a small proportion of total operational records, traditional models often suffer from biased learning [20,21]. To overcome this challenge, advanced data-driven approaches such as SMOTE, GAN-based augmentation, and anomaly detection frameworks have been proposed [22,23]. Moreover, deep learning methods have been applied to leakage detection in urban pipelines [24], while other research has emphasized the role of soil–pipe interaction and structural reliability in predicting the remaining life of buried infrastructure [25,26].

Despite these advances, limitations remain. For reasons of public safety, acquiring spatial and attribute data for pipelines is often restricted. Furthermore, failure history data represent only a small fraction compared with normal operational data, leading to severe class imbalance and limiting the effectiveness of AI-based models. Consequently, prior studies have largely focused on anomaly detection using sensor data or limited statistical analyses of selected attributes. In practice, long-buried pipelines face issues such as sensor degradation, which compromises the reliability of collected data. To overcome these challenges, analyzing failure risks using spatial and attribute information along with historical failure records, which are more manageable in terms of data acquisition, is necessary. In this study, we developed an AI-based high-risk prediction model for district heating pipeline failures to support more effective maintenance decision-making. Our research utilized spatial and attribute data from operational pipelines in Korea, combined with historical failure information. Pipelines were segmented into basic units for dataset construction. K-Nearest Neighbors (KNN) was applied to reclassify normal operational data by identifying sections with characteristics similar to historical failure cases, which were then incorporated as high-risk data. The constructed dataset was used to train XGBoost (XGB), LightGBM (LGBM), and an ensemble model (XGB + LGBM). The models were compared using multiple evaluation metrics, and the best-performing model was selected. The relative importance of key influencing variables was also analyzed.

2. Research Method and Basic Unit Definition

2.1. Research Method

The objective of this study was the development of an AI-based high-risk prediction model for district heating pipeline failures. To this end, 324,495 records of operational pipelines and 2293 records of historical failures were collected from Korean pipeline systems. The spatial data of the pipelines were stored as line-type *.shp files, and were mapped to the target area using QGIS v3.40.7. Attribute information included operator, purpose (supply or return), pipe diameter, year of completion, sensor wire condition, insulation level, burial depth, pressure, and accessory information. In addition, the burial environment was defined as the type of surface usage above the pipeline, which was extracted from the mapped spatial information to generate additional data. Because many of the attributes contained missing or erroneous values, only reliable variables were retained for analysis. The final input variables were pipe diameter, purpose, insulation level, year of burial, and burial environment. The line data in the spatial information had been divided without standardized quantitative criteria, making objective risk evaluation difficult. To address this, pipelines were segmented and merged under consistent rules to establish basic units. These unitized data were used for correlation analysis to identify statistically significant factors influencing failure, which then served as input variables, whereas failure status was set as the output variable in the constructed dataset.

The dataset was characterized by severe class imbalance, with failure records constituting only approximately 0.75% of the total. Such an imbalance can hinder the development of reliable models suitable for field application [27]. Furthermore, minor pipeline damage not detected during inspections could be misclassified as normal, reinforcing the imbalance. To mitigate this issue, the KNN algorithm was applied to identify normal records that shared similar characteristics with failure cases. These selected records were incorporated as augmented data, and together with the failure data, were defined as high-risk data. To enrich the input data, composite variables were generated by combining existing attributes: pipe diameter and burial environment (diameter_environment), purpose and burial environment (purpose_environment), and insulation level and burial environment (insulation_environment). The output variable was redefined into two classes: low and high risk.

For model development, the dataset was divided into training, validation, and testing subsets at a ratio of 75:10:15. The training set (75%) was used to develop models using machine learning algorithms, whereas the validation set (10%) was applied to confirm generalization performance. The testing set (15%), not used in training, was employed for final performance evaluation. The models were trained using XGB, LGBM, and an ensemble model (XGB + LGBM), with hyperparameters optimized for each. Model performance was assessed using the F2-score and AUC. Overfitting was considered avoided when the difference in F2-score between training and testing datasets was <0.1. The developed models were further analyzed to determine the importance of influencing factors in high-risk classification. The risk score distribution of actual failures and augmented high-risk records was evaluated to verify the appropriateness of the relabeled data. Finally, predicted risk levels were visualized for a subset of the study area to provide decision support for effective maintenance planning.

2.2. Definition of Basic Units of Heating Pipelines

2.2.1. Data

For the development of the high-risk prediction model, spatial and attribute data of district heating pipelines in Korean urban areas were collected. Spatial information represents the physical characteristics of buried pipelines, including location, length, and area. From this information, topographical features and burial environment (surface usage) were extracted. The attribute data used in this study consisted of spatial information, pipeline facility ID, pipe diameter, purpose, year of burial, and insulation level. The facility ID identifies individual pipeline objects. Pipe diameter refers to the width of the pipeline. Purpose indicates whether a pipeline supplies heated water or returns cooled water within the district heating system. The year of burial specifies the installation date, thus providing the service age. Insulation level represents the condition of sensor wires designed to detect moisture intrusion caused by pipeline or insulation damage. A change in insulation level indicates possible pipeline failure. However, because sensor wires may already be short-circuited, the insulation level alone cannot reliably determine the pipeline condition. Disconnected sensor records were excluded from modeling features. Table 1 summarizes the characteristics of the pipeline attribute data.

2.2.2. Basic Unit Definition

The attribute information of the collected district heating pipelines had originally been classified according to arbitrary criteria set by operators. For efficient prediction of failure probabilities, establishing standardized basic units for dataset construction was necessary. The basic units were defined based on pipeline attribute information [28]. To construct the data required for this definition, preprocessing procedures were conducted. Although most of the data contained no missing or abnormal values, discretization was applied to attributes with occasional gaps or anomalies. The basic units were defined using four criteria in order of priority. Based on this process, 132,210 pipeline records were converted into a total of 324,494 basic units.

Integration via Attribute Information

As a first step in unit definition, pipelines that were physically connected and shared identical attribute information were combined into single basic units. The attributes considered for integration were operator, purpose (supply or return), diameter, heating service status, installation date (service life), sensor wire condition, and designation as a critical management section. These attributes were considered to directly or indirectly influence structural stability and failure likelihood. Among them, the attributes provided by operators with associated failure history data (operator, purpose, diameter, and installation date) were later used in correlation analysis.

Maximum and Minimum Length Criteria

Length constraints were also applied to standardize the basic units. The minimum length was set at 1 m, whereas the maximum length was set at 36 m. In cases where pipeline segments were <1 m, they were merged with adjacent segments possessing the same attribute information. Conversely, pipeline segments > 36 m were divided into multiple basic units by dividing their length by 36 and rounding to the nearest integer. For example, as shown in Figure 1, a 50 m pipeline was divided into two basic units. Through this process, no individual basic unit exceeded 36 m in length.

Basic Unit Homogeneity Rule

A basic unit was defined as a contiguous pipeline segment that remains homogeneous in its governing attributes. Unit boundaries were determined whenever any of the following attributes changed: operator, purpose (supply/return), diameter, insulation level (sensor wire condition), installation date (service life), and burial environment. This rule ensured that each basic unit represents a physically and operationally consistent section of pipeline suitable for failure probability estimation.

2.2.3. Failure History Data

Estimation of failure probability based on attribute information required the inclusion of historical failure records. Therefore, data were collected on pipelines that had undergone repair operations and were associated with documented failure events. The selected failure cases included pipelines that experienced leakage, geothermal effects, abnormal high temperature, steam leakage, or functional loss, all of which resulted in failure and subsequent restoration. In total, 2293 pipeline failures were recorded. These data were used to estimate failure probabilities associated with different attribute characteristics of the district heating pipelines.

2.2.4. Reproducibility and Implementation Details

To ensure the reproducibility of the proposed approach, all experiments were implemented in Python 3.10 using the pandas, numpy, scikit-learn, XGBoost, LightGBM, imblearn, and Optuna libraries. A fixed random seed (random_state = 42) was applied throughout data splitting, resampling, and model training to maintain result consistency. The overall modeling procedure can be summarized as follows:

Input: Raw district heating pipeline dataset

1.

Data preprocessing:

-: Remove missing/abnormal values
-: Standardize continuous attributes (diameter, insulation level, installation year)
-: Encode categorical attributes (purpose, burial environment)
-: Generate derived features (diameter × environment, environment×insulation, environment_purpose)

2.

KNN-based relabeling:

-: Calculate Euclidean distance between normal and failure samples
-: Select top 10% of normal records most similar to failure cases and relabel as high risk

3.

Data balancing:

-: Apply stratified train/validation/test split with fixed seed

4.

Model development:

-: Hyperparameter optimization via Optuna for XGBoost and LightGBM
-: Determine optimal classification threshold using F2-score on validation data

5.

Evaluation:

-: Report ACC, Precision, Recall, F1, F2, AUC metrics
-: Compare performance across K values to select the most balanced condition
-: Output: Final high-risk prediction model (XGB with K = 2)

3. Data Characteristics and Correlation Analysis

3.1. Data Characteristics

The input data used in this study consisted of spatial and attribute information of district heating pipelines, including facility ID, pipe diameter, purpose, insulation level, year of burial, and burial environment. Facility ID identified pipeline objects segmented into basic units. Pipe diameter represented the width of each pipeline. As district heating pipelines operate by supplying and returning high-pressure hot water for heat transfer, the purpose attribute indicated whether a pipeline functioned as a supply or return line.

To monitor pipeline integrity, sensor wires were attached to the outer surface of the pipelines by operators. When internal hot water leaked due to pipe rupture or when external soil moisture penetrated through damaged insulation, the insulation level decreased as moisture infiltrated the sensor wire [29,30,31]. The collected data recorded sensor status as either normal or disconnected. Under normal conditions, insulation levels were graded from 1 to 13, with disconnection indicating sensor malfunction. In this study, the insulation level does not directly indicate the actual failure of the district heating pipeline but serves as an indirect measure by reflecting the electrical condition of the moisture-detection sensor wire. While it can signal potential damage through changes caused by moisture ingress or sensor degradation, it does not provide a direct confirmation of pipeline rupture. Therefore, the insulation level should be regarded as a proxy indicator rather than a definitive measure of pipeline failure.

The year of burial represented the installation date of the pipeline and allowed for aging quantification. Burial environment described land use above the buried pipeline, identified by mapping spatial pipeline data to land use information. Environments were classified as water (rivers or streams), road, and other categories (parks, residential areas, etc.). The output variable was pipeline failure, expressed as binary values: 0 (normal) and 1 (failure). The dataset included 324,495 normal and 2293 failure records. Using the KNN algorithm, high-risk groups were redefined by incorporating normal records with characteristics similar to failure cases.

The dataset consisted of both categorical (purpose, burial environment) and numerical (diameter, insulation level, year of burial) variables. In particular, supply pipelines (S) accounted for 162,406 cases (161,087 normal; 1319 failures), whereas return pipelines (R) accounted for 162,106 cases (161,535 normal; 571 failures). Burial environment included 3223 water (3059 normal; 164 failures), 211,670 road (210,632 normal; 1038 failures), and 109,620 other (108,932 normal; 688 failures) segments. Table 2 presents descriptive statistics of the numerical variables. On average, failure data were characterized by larger pipe diameters, lower insulation levels, and older burial years compared with normal data.

3.2. Correlation Analysis

To develop the high-risk prediction model, correlation analysis was performed between the input data and pipeline failure events to examine statistical significance. Correlation analysis is a method for identifying the linear relationship between independent (inputs) and dependent (failure outcomes) variables and for quantifying the strength of these relationships [32]. In this study, Pearson correlation analysis was applied to numerical variables, whereas chi-square tests were conducted on categorical variables. Pearson correlation coefficients were calculated using Equation (1). Larger absolute values of the coefficient indicated stronger correlations between variables [33].

C o r r (X, Y) = ρ (X, Y) = \frac{C o v (X, Y)}{σ_{x} σ_{y}}

(1)

where

σ_{x}, σ_{y}

: standard deviation of variables.

Statistical significance was evaluated using hypothesis testing with the test statistic in Equation (2) [34].

t = r \sqrt{\frac{n - 2}{1 - r^{2}}}

(2)

where r: correlation coefficient, n: sample size

For categorical variables, bivariate chi-square tests were conducted. This method assesses whether the observed distribution of a variable matches expected values. Test statistics were derived by comparing expected frequencies with observed values for each cell of a contingency table [35]. The chi-square statistic (

X^{2}

) reflected the difference between observed and expected frequencies, with larger values indicating stronger statistical association [36].

Given the complexity of factors influencing pipeline failures and the small proportion of failure records, it was unlikely that a single factor would show strong correlation with failure. Therefore, only variables demonstrating statistical significance were selected as model inputs. A p-value < 0.05 was considered significant.

The Pearson correlation results for numerical variables, diameter, insulation level, and year of burial, are shown in Table 3. All variables yielded p-values < 0.05, indicating statistically significant correlations with failure. Pipeline diameter showed a positive correlation with failure, whereas insulation level and year of burial demonstrated negative correlations. These results suggested that larger diameters increase failure likelihood, whereas lower insulation levels and older burial years are associated with higher failure risk. The chi-square test results for categorical variables, burial environment and purpose, are presented in Table 4. Both variables showed statistically significant relationships with pipeline failure. Based on these findings, diameter, insulation level, year of burial, burial environment, and purpose were selected as the influencing factors for model development.

4. High-Risk Data Selection for Heating Pipelines

This study collected both operational (normal) data and historical failure records with the objective of developing a high-risk prediction model for district heating pipelines. Because the dataset was highly imbalanced, constructing a reliable prediction model directly from the raw data posed limitations for field application. Identifying buried pipeline failures typically requires large-scale excavation work [37], which is difficult to conduct given constraints of budget, time, and manpower. Furthermore, even pipelines in normal operation may exhibit minor damage or leakage that remains undetected. To account for these factors, failure characteristics were analyzed and the KNN algorithm was applied to identify records within the normal dataset that shared similar attributes with failure cases. These selected records were merged with actual failure data and redefined as high-risk data, thereby alleviating data imbalance to some extent.

The KNN algorithm is a supervised learning method that classifies data based on the majority label among the K nearest neighbors [38]. With only the parameter K required, the algorithm is simple to implement, intuitively interpretable, and applicable to nonlinear data due to its distance-based approach. However, disadvantages include large memory consumption, slow prediction speed in large-scale datasets, and a risk of overfitting [39]. KNN requires minimal training effort and can classify new data points immediately upon arrival; however, as the dimensionality of the dataset increases, the efficiency of distance computation and the accuracy of classification may deteriorate due to the curse of dimensionality [40]. Figure 2 illustrates the conceptual framework of KNN classification.

In this study, the distance between samples was measured using the Euclidean distance metric after standardizing all continuous attributes. Although alternative metrics such as the Gower distance are commonly used for mixed-type datasets, Euclidean distance was selected because the majority of attributes in the present dataset were continuous (e.g., diameter, installation year, insulation level), while only a limited number were categorical. This choice enabled efficient similarity computation for a large-scale dataset while maintaining adequate representativeness.

Figure 3 provides a schematic overview of the data preparation and relabeling procedure, including unitization, distance calculation, and selection of the top 10% of normal records most similar to failure cases.

The key hyperparameter in KNN is the K value, which determines the number of neighbors considered. In this study, a K range (0–10) was used to select high-risk data, which were then used to train machine learning models. As K increased, more failure samples were referenced, broadening the similarity criterion and leading to potential averaging effects. To mitigate excessive incorporation of data and reduce the risk of overfitting, the proportion of selected high-risk data was limited to the top 10%. For each high-risk record incorporated, the distance to the nearest failure sample was calculated and defined as the pseudo mean metric. A lower pseudo mean metric indicated greater similarity between incorporated and actual failure data. In addition, the average distance between all normal and failure data was calculated, termed the all mean metric. This measure reflected the overall similarity between the normal dataset and failure cases and served as a baseline for evaluating whether incorporated data were closer to failure cases compared with the broader dataset. The comparison between pseudo mean metric and all mean metric was used to indirectly assess the quality, validity, and potential noise of the incorporated data [41,42]. Table 5 summarizes the number of incorporated high-risk records, the pseudo mean metric, and the all mean metric for different values of K.

The number of incorporated records remained at approximately 32,000 under the 10% selection threshold. In all cases, the pseudo mean metric was lower than the all mean metric, indicating that the incorporated data were consistently closer to failure cases than the full set of normal data. As K increased, the distance metrics became more similar, suggesting that the boundary between normal and failure characteristics became less distinct. Figure 4 illustrates the difference between the two metrics across varying K values. The consistently lower pseudo mean metric demonstrated that the KNN algorithm successfully identified appropriate high-risk records for inclusion.

5. Development of the High-Risk Prediction Model for Heating Pipelines

In this study, high-risk data were identified using the KNN algorithm and applied to ML models including XGB, LGBM, and their ensemble (XGB + LGBM). For each dataset generated using different K values (0–10), the ML algorithms were trained and evaluated to determine the optimal K value for high-risk prediction.

5.1. XGBoost (eXtreme Gradient Boosting, XGB)

XGB is a tree-based boosting algorithm that trains a sequence of weak learners and combines them to produce a strong predictive model. Each subsequent learner adjusts the weights of misclassified samples from the previous stage, thereby improving overall performance. XGB is widely used for regression and classification tasks and is particularly effective for large datasets. It has the advantage of handling nonlinear relationships while being less prone to overfitting compared with traditional models [43].

The prediction of the i-th sample, denoted as

\hat{y_{i}}

, is expressed in Equation (3), where

f_{k}

represents the prediction of the k-th decision tree after applying a sigmoid function. The final output is obtained by summing the predictions of all trees [44]. Prediction values are updated iteratively using Equation (4).

\hat{y_{i}} = \sum_{k = 1}^{K} f_{k} (x_{i})

(3)

\hat{y_{i}} = \frac{1}{1 + e^{- f (x_{i})}}

(4)

Errors from the tree models are minimized using Equation (5). In Equation (5),

{\hat{y_{i}}}^{(t - 1)}

denotes the prediction from the previous iteration,

h_{t} (x_{i})

is the output of the current tree, and

η

represents the learning rate. By repeating this sequential process, the overall model error is gradually reduced [45].

{\hat{y_{i}}}^{(t)} = {\hat{y_{i}}}^{(t - 1)} + η h_{t} (x_{i})

(5)

5.2. LightGBM (Light Gradient Boosting Machine, LGBM)

LGBM is another boosting algorithm similar to XGB but optimized for computational efficiency. Instead of using the entire dataset, LGBM samples subsets of data and features, thereby simplifying computation and significantly improving processing speed. The algorithm minimizes a cross-entropy loss function and is designed to scale effectively to large datasets. Because of its high performance and efficiency, LGBM has been widely adopted across diverse applications [46].

The cross-entropy loss function used in LGBM is expressed in Equation (6), where N is the number of samples, K is the number of classes,

y_{i, j}

is a binary variable indicating whether the i-th sample belongs to the j-th class, and

p_{i, j}

is the predicted probability of this assignment [47]. LGBM updates the model iteratively to minimize cross-entropy and improve predictive accuracy.

C E = \frac{1}{N} {\sum_{i = 1}}^{N} {\sum_{j = 1}}^{N} y_{i, j} l o g (p_{i, j})

(6)

5.3. Model Evaluation Metrics

For imbalanced classification problems, performance metrics such as accuracy alone are insufficient, as they may provide misleading results when failure cases account for only a small fraction of the data. To address this, the primary evaluation metric in this study was the F2-score, which emphasizes recall by assigning greater weight to false negatives. This metric was particularly appropriate for assessing the efficiency of the model in identifying high-risk pipelines [48,49,50]. Accuracy and the area under the receiver operating characteristic curve (AUC) were used as supplementary evaluation metrics. To prevent overfitting, differences between F2-scores for the training and validation datasets were examined, and those <0.1 indicated overfitting avoidance.

Model performance was calculated using a confusion matrix (Table 6). From this, accuracy, recall, precision, F2-score, and specificity were derived using Equations (7)–(11) [51,52]. The reliability of the best-performing model was further validated by examining the AUC derived from the ROC (receiver operating characteristic) curve (Figure 5). The criteria for interpreting AUC are presented in Table 7 [53].

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(7)

R e c a l l (S e n s i t i v i t y) = \frac{T P}{T P + F N}

(8)

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

F 2 S c o r e = 5 \times \frac{R e c a l l \times P r e c i s i o n}{4 \times P r e c i s i o n + R e c a l l}

(10)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(11)

High-risk data redefined through the KNN algorithm were applied to ML models to develop a high-risk failure prediction model for district heating pipelines. The input variables were pipe diameter, purpose, insulation level, year of burial, and burial environment, whereas the output variable was defined as a binary classification of high-risk status. To reduce skewness in numerical variables, log transformation was applied. Categorical variables were integer-encoded through label encoding. In addition, interaction variables were created by combining pipe diameter and burial environment, insulation level and burial environment, and purpose and burial environment to enrich feature diversity. The final dataset is presented in Table 8. Because the proportion of high-risk records redefined using the distance-based KNN algorithm was limited to the top 10%, data imbalance remained. To address this, SMOTE was applied. The preprocessed dataset was then trained using XGB, LGBM, and XGB + LGBM algorithms, with hyperparameters optimized through the Optuna framework.

Model thresholds were set to maximize the F2-score, and models were considered valid only when the difference between training and validation F2-scores was <0.1. The ensemble model (XGB + LGBM) was constructed by averaging the probability outputs of the two base models. The results for F2-score, recall, and AUC across different K values are shown in Figure 6, Figure 7 and Figure 8, respectively. The case K = 0 represented models trained without applying the KNN-based relabeling, where only the SMOTE technique was applied to address class imbalance. When comparing average results across K = 1–10 with the baseline (K = 0)we found that KNN application improved model performance by approximately 86% in terms of F2-score values. Recall and AUC were also significantly improved. At K = 1, the F2-score was approximately 0.78; however, values improved sharply for K ≥ 2, exceeding 0.92. Recall remained above 0.96, and AUC reached 0.99, indicating very high reliability. The differences between individual algorithms (XGB, LGBM, XGB + LGBM) were negligible.

Although the KNN method successfully incorporated high-risk records with similar characteristics to failure data, these redefined records could not be assumed equivalent to actual failure cases. Accordingly, a distinction between the two groups had to be maintained. To identify the optimal K value, the median risk scores predicted by the model were compared between actual failure cases and incorporated high-risk records. The optimal K was selected when the median risk score of failure cases was higher than that of incorporated data. As shown in Figure 9, the distinction between groups was most evident at K = 1; however, model performance was relatively low under this condition. At K = 2, failure cases exhibited higher median scores and the model achieved an F2-score > 0.90, leading to the selection of XGB (K = 2) as the final model. Figure 10 shows the distribution of two variables for the Risk Score derived from the optimal model. It can be seen that the actual failure data is distributed with relatively high scores.

The performance indicators of the models (K = 2) are summarized in Table 9. Both F2-score and AUC demonstrated strong results. The importance of input variables derived from the model is shown in Figure 11, where year of burial ranked highest, followed by insulation level and purpose. Although the correlation analysis previously identified insulation level as the most significant factor, both analyses consistently highlighted year of burial and insulation level as dominant predictors of high-risk classification. The main hyperparameters are shown in Table 10. To further examine the influence of the most important variable, the monotonic relationship between the year of burial and the predicted risk score was analyzed. Figure 12 illustrates that pipelines installed in earlier years exhibit substantially higher predicted risk scores compared to recently installed ones. This trend supports the interpretation that pipeline aging is a major contributor to failure risk and provides additional confidence in the model’s variable importance analysis.

In this study, three gradient boosting–based models were compared: XGBoost, LightGBM, and a simple average ensemble of the two. All three models showed comparable predictive performance; however, XGBoost (K = 2) achieved the highest F2-score and AUC, leading to its selection as the final model. This advantage is likely due to XGBoost’s strong regularization (α, λ) and class-weighting (scale_pos_weight) capabilities, which effectively enhanced recall—a key objective of this study.

The simple average ensemble did not outperform the single XGBoost model because both base learners share similar structures and error patterns, limiting diversity and thus the potential ensemble gain. Furthermore, probability averaging slightly diluted the optimized F2 decision threshold, reducing recall-oriented performance improvements.

Given these results, XGBoost (K = 2) was selected as the final model, as it provided the best balance between high predictive accuracy, recall-oriented F2 optimization, and interpretability.

6. Discussion

This study developed a high-risk prediction model for district heating pipeline failures by combining attribute information and failure history data using KNN and machine learning techniques. Even data classified as normal may contain undetected minor damage or latent failures. To address this, normal records with characteristics similar to actual failures were redefined as high-risk data. This approach alleviated the imbalance in the dataset and improved the performance of the high-risk prediction model.

Selecting the optimal K value was critical for incorporating data through the KNN algorithm. The results showed that model performance improved incrementally as K value increased. However, incorporated high-risk data cannot be assumed equivalent in risk level to actual failure data. For this reason, the model should preserve a distinction between the two groups. In this study, the optimal K value was selected when the median risk score of failure data was higher than that of incorporated records. Under this criterion, the XGB model with K = 2 achieved the best performance. This criterion is expected to enhance the applicability of the model when used in real-world failure risk classification.

The XGB (K = 2) model demonstrated significant improvement compared with previously developed prediction models [28]. The F2-score increased from 0.804 to 0.921, while AUC improved from 0.837 to 0.993. Prior studies often expanded high-risk groups by incorporating pipelines with uniform levels of aging as positive samples. Such approaches risk distorting data distribution and reducing reliability. In contrast, this study redefined high-risk data by selectively incorporating records with failure-like characteristics using KNN-based preprocessing. This method enhanced predictive performance and increased the reliability of the results provided to end users.

Applying KNN preprocessing to the entire dataset is not generally preferred. However, in the present case, pipelines classified as normal may still include segments with actual damage or elevated failure risk. Moreover, the incorporated records were not artificially generated but rather selected from existing normal data. If the definition of high-risk groups is formally established in consultation with pipeline management authorities, the practical utility of the model can be strengthened further. Additionally, SHAP is an explainable AI technique that attributes each feature’s contribution to the model’s output, allowing both global and local interpretation of predictions. Incorporating SHAP in future failure risk analyses could help engineers understand the influence of individual pipeline attributes on predicted risk levels, thereby improving decision-making and field inspection strategies.

Variable importance analysis revealed year of burial (0.515), insulation level (0.131), and purpose (0.074) as the most influential factors. Although correlation analysis identified insulation level as the most strongly correlated, both analyses consistently emphasized year of burial and insulation level as critical factors. Year of burial reflects pipeline aging [54], whereas insulation level reflects cases of moisture infiltrating the sensor wire either through pipe rupture or damaged insulation, lowering the recorded level [32,55]. These findings indicated that both factors directly contribute to high failure risk. Accordingly, effective risk management requires continuous monitoring of sensor insulation conditions and targeted maintenance strategies for aging pipelines.

The findings of this study may support improved risk management of district heating pipelines. Nevertheless, consensus with management authorities is required regarding the definition of high-risk data. The current high-risk dataset combines failure cases with incorporated records, which do not carry identical engineering risk. To refine this definition, further engineering analysis should be conducted that considers factors such as ground conditions, external loading, and construction methods. Incorporating these aspects would enhance the reliability of pipeline failure risk assessments and provide a robust foundation for developing advanced infrastructure safety management systems.

7. Conclusions

This study presented a high-risk failure prediction model for district heating pipelines by integrating spatial and attribute information with historical failure records.

Normal data with characteristics similar to failure cases were identified using the KNN algorithm and redefined as high-risk data, expanding the dataset used for model training without artificially generating samples. This approach enabled the construction of predictive models that incorporate latent failure risks and better reflect the real distribution of pipeline conditions.

Based on the redefined dataset, three machine learning algorithms—XGBoost, LightGBM, and their ensemble—were evaluated to classify high-risk pipelines. By jointly considering model performance metrics and the median risk scores of actual failures, the condition K = 2 was identified as the most balanced. Among the tested models, XGBoost (K = 2) achieved the best performance (F2-score = 0.921, AUC = 0.993), demonstrating its ability to effectively handle the pseudo-labeled dataset while maintaining strong generalization.

Variable analysis revealed that year of burial and insulation level were the most influential predictors of failure, followed by purpose, diameter, and burial environment. These findings highlight the critical role of pipeline aging and sensor insulation degradation in assessing failure risk and provide engineering insight into effective monitoring and maintenance strategies.

The proposed methodology offers practical potential for risk mapping and safety management of district heating networks, supporting proactive maintenance planning and prioritization of high-risk segments. Such integration of latent risk information can enhance the reliability of infrastructure management and inform field-level decision-making.

Nevertheless, this study was limited by the attributes included in the dataset, which excluded ground characteristics, external loading, and construction methods, and by the use of data from a subset of pipelines in Korea, raising concerns about generalizability. Future research should incorporate more diverse pipeline and geotechnical information, validate the model with additional real-world failure records, and perform field verification to confirm its practical applicability.

Author Contributions

Conceptualization: J.K. (Jaemo Kang) and J.K. (Jinyoung Kim); Developed the models and carried out the model simulations: S.L. and M.K.; Writing—original draft preparation: S.L. and M.K.; Writing—review and editing: J.K. (Jaemo Kang) and J.K. (Jinyoung Kim). All authors have read and agreed to the published version of the manuscript.

Funding

Research for this paper was carried out under the Korea Institute of Civil Engineering and Building Technology (KICT) Research Program (project no. 20250329-001 Development of Risk Prevention and Response Scenarios for Underground Urban Infrastructure) funded by the Ministry of Science and ICT.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, S.; O’Neill, Z.; O’Neill, C. A review of leakage detection methods for district heating networks. Appl. Therm. Eng. 2018, 137, 567–574. [Google Scholar] [CrossRef]
Rafati, A.; Shaker, H.R. Predictive maintenance of district heating networks: A comprehensive review of methods and challenges. Therm. Sci. Eng. Prog. 2024, 53, 102722. [Google Scholar] [CrossRef]
van Dreven, J.; Boeva, V.; Abghari, S.; Grahn, H. Intelligent Approaches to Fault Detection and Diagnosis in District Heating: Current Trends, Challenges, and Opportunities. Electronics 2023, 12, 1448. [Google Scholar] [CrossRef]
Ihwnagu, U.T.I.; Debnath, R.; Ahmed, A.A.; Alam, M.J.B. An Integrated Approach for Earth Infrastructure Monitoring Using UAV and ERI: A Systematic Review. Drones 2025, 9, 225. [Google Scholar] [CrossRef]
Ravindran, G. Evaluation of New Technologies to Support Asset Management of Metro Systems; UCL: London, UK, 2020. [Google Scholar]
Guan, H.; Xiao, T.; Luo, W.; Gu, J.; He, R.; Xu, P. Automatic fault diagnosis algorithm for hot water pipes based on infrared thermal images. Build. Environ. 2022, 218, 109111. [Google Scholar] [CrossRef]
Adegboye, M.A.; Fung, W.-K.; Karnik, A. Recent advances in pipeline monitoring and oil leakage detection technologies: Principles and approaches. Sensors 2019, 19, 2548. [Google Scholar] [CrossRef]
Shen, Y.; Chen, J.; Fu, Q.; Wu, H.; Wang, Y.; Lu, Y. Detection of district heating pipe network leakage fault using UCB arm selection method. Buildings 2021, 11, 275. [Google Scholar] [CrossRef]
Valinčius, M.; Žutautaitė, I.; Dundulis, G.; Rimkevičius, S.; Janulionis, R.; Bakas, R. Integrated assessment of failure probability of the district heating network. Reliab. Eng. Syst. Saf. 2015, 133, 314–322. [Google Scholar] [CrossRef]
Kong, M.; Kang, J. Methodology for Estimating the Probability of Damage to a Heat Transmission Pipe. J. Korean GEO-Environ. Soc. 2023, 22, 15–21. (In Korean) [Google Scholar]
Langroudi, P.P.; Weidlich, I. Applicable Predictive Maintenance Diagnosis Methods in Service-Life Prediction of District Heating Pipes. Environ. Clim. Technol. 2020, 24, 294–304. [Google Scholar] [CrossRef]
Pishvaie, M.R.; Hadipoor, M.; Jafari, S.; Baghery, S. Intelligent Approaches to Fault Detection and Diagnosis in District Heating Systems: A Review. Processes 2023, 11, 2512. [Google Scholar]
Zhu, Q.; Zhu, L.; Wang, Z.; Zhang, X.; Li, Q.; Han, Q.; Yang, Z.; Qin, Z. Hybrid Triboelectric–Piezoelectric Nanogenerator Assisted Intelligent Condition Monitoring for Aero-Engine Pipeline System. Chem. Eng. J. 2025, 519, 165121. [Google Scholar] [CrossRef]
Shirzad, M.; Vahdani, B.; Yazdi, M. A Machine Learning Approach for Failure Prediction of Water Distribution Networks. Reliab. Eng. Syst. Saf. 2021, 210, 107558. [Google Scholar]
Christodoulou, S.E.; Gagatsis, A.; Xanthos, S.; Aslani, P. Risk-Based Prioritization of Water Pipe Replacement Using Statistical Failure Models. Urban Water J. 2010, 7, 121–134. [Google Scholar]
Le Gauffre, P.; Joannis, C.; Le Gat, Y.; Breysse, D. A GIS-Based Method for Pipe Network Diagnosis and Rehabilitation. Autom. Constr. 2007, 16, 525–536. [Google Scholar]
Hutton, C.; Kapelan, Z.; Vamvakeridou-Lyroudia, L.; Savić, D.A. Failure Predictions in Water Distribution Pipes Using Condition Assessment and Hydraulic Models. J. Infrastruct. Syst. 2016, 22, 04016017. [Google Scholar]
Kleiner, Y.; Rajani, B. Comprehensive Review of Structural Deterioration of Water Mains: Statistical Models. Urban Water 2001, 3, 131–150. [Google Scholar] [CrossRef]
Rajani, B.; Kleiner, Y. Protecting Critical Infrastructure: Structural Reliability of Buried Pipes. J. Infrastruct. Syst. 2001, 7, 120–128. [Google Scholar]
Park, J.; Jung, D.; Kim, H.; Lee, S. Spatial Analysis of District Heating Pipe Failures Using GIS Data in Korea. Appl. Sci. 2022, 12, 10345. [Google Scholar]
Xu, Y.; Zhang, L.; Yang, H. Application of Deep Learning in Leakage Detection of Urban Pipelines. Sensors 2020, 20, 6760. [Google Scholar]
Sun, L.; Wang, P.; He, X. Data-Driven Prediction of Pipeline Failures under Imbalanced Datasets. Reliab. Eng. Syst. Saf. 2022, 223, 108500. [Google Scholar]
Zhang, W.; Liu, G.; Zhou, J. Remaining Life Estimation of Buried Pipelines Considering Soil–Pipe Interaction. Tunn. Undergr. Space Technol. 2019, 83, 237–248. [Google Scholar]
Mesri, G.; Stark, T.D. Long-Term Performance of Buried Infrastructure: Lessons from Case Histories. Can. Geotech. J. 2018, 55, 1089–1102. [Google Scholar]
Euroheat & Power. Guidelines for District Heating Pipe System Reliability Assessment; Euroheat & Power: Brussels, Belgium, 2018. [Google Scholar]
CEN/TC 107. EN 253; District Heating Pipes—Preinsulated Bonded Pipe Systems for Directly Buried Hot Water Networks. European Committee for Standardization: Brussels, Belgium, 2019.
Lee, S.Y.; Kang, J.M.; Kim, J.Y. Prediction modeling of ground subsidence risk based on machine learning using the attribute information of underground utilities in urban areas in Korea. Appl. Sci. 2023, 13, 5566. [Google Scholar] [CrossRef]
Lee, S.; Kang, J.; Kim, J.; Kong, M. AI-Based Damage Risk Prediction Model Development Using Urban Heat Transport Pipeline Attribute Information. Appl. Sci. 2025, 15, 8003. [Google Scholar] [CrossRef]
Lee, Y.H.; Kim, S.H.; Kang, U.S.; Kim, W.C.; Kim, J.G. Evaluation of electrochemical properties and life prediction of sensor wire in leak detection systems of underground heating pipelines. J. Electrochem. Soc. 2024, 171, 103508. [Google Scholar] [CrossRef]
Lidén, P.; Adl-Zarrabi, B.; Hagentoft, C.E. Diagnostic Protocol for Thermal Performance of District Heating Pipes in Operation. Part 2: Estimation of Present Thermal Conductivity in Aged Pipe Insulation. Energies 2021, 14, 5302. [Google Scholar] [CrossRef]
Song, S.; Kim, J. Advanced monitoring technology for district heating pipelines using fiber optic cable. In Proceedings of the 15th International Symposium on District Heating and Cooling, Seoul, Republic of Korea, 4–7 September 2016; pp. 1–8. [Google Scholar]
Ebenuwa, A.U.; Tee, K.F. Fuzzy reliability and risk-based maintenance of buried pipelines using multiobjective optimization. J. Infrastruct. Syst. 2020, 26, 04020008. [Google Scholar] [CrossRef]
Asuero, A.G.; Sayago, A.; González, A.G. The correlation coefficient: An overview. Crit. Rev. Anal. Chem. 2006, 36, 41–59. [Google Scholar] [CrossRef]
Xu, H.; Deng, Y. Dependent evidence combination based on shearman coefficient and Pearson coefficient. IEEE Access 2018, 6, 11634–11640. [Google Scholar] [CrossRef]
Aslam, M.; Smarandache, F. Chi-square test for imprecise data in consistency table. Front. Appl. Math. Stat. 2023, 9, 1279638. [Google Scholar] [CrossRef]
Zhou, Q. Using chi-square categorical testing to analyse the survey data and find people’s attitude towards inequalities. J. Educ. Humanit. Soc. Sci. 2023, 24, 330–339. [Google Scholar] [CrossRef]
Xi, D.; Lu, H.; Zou, X.; Fu, Y.; Ni, H.; Li, B. Development of trenchless rehabilitation for underground pipelines from an academic perspective. Tunn. Undergr. Space Technol. 2024, 144, 105515. [Google Scholar] [CrossRef]
Ruys, W.L.; Ghafouri, A.; Chen, C.; Biros, G. Scalable k-NN graph construction for heterogeneous architectures. ACM Trans. Parallel Comput. 2025, 12, 1–35. [Google Scholar] [CrossRef]
Yang, S.; Xie, J.; Liu, Y.; Yu, J.X.; Gao, X.; Wang, Q.; Peng, Y.; Cui, J. Revisiting the index construction of proximity graph-based approximate nearest neighbor search. Proc. VLDB Endow. 2025, 18, 1825–1838. [Google Scholar] [CrossRef]
Halder, R.K.; Uddin, M.N.; Uddin, M.A.; Aryal, S.; Khraisat, A. Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications. J. Big Data 2024, 11, 113. [Google Scholar] [CrossRef]
Bekker, J.; Davis, J. Learning from positive and unlabeled data: A survey. Mach. Learn. 2020, 109, 719–760. [Google Scholar] [CrossRef]
Mensink, T.E.J.; Verbeek, J.; Perronnin, F.; Csurka, G. Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2624–2637. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zhang, Y.; Haghani, A. A gradient boosting method to improve travel time prediction. Transp. Res. Part C Emerg. Technol. 2015, 58, 308–324. [Google Scholar] [CrossRef]
Zhang, D.; Chen, H.D.; Zulfiqar, H.; Yuan, S.S.; Huang, Q.L.; Zhang, Z.Y.; Deng, K.J. iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. Comput. Math. Methods Med. 2021, 2021, 6664362. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Lv, J.; Wang, C.; Gao, W.; Zhao, Q. An Economic Forecasting Method Based on the LightGBM-Optimized LSTM and Time-Series Model. Comput. Intell. Neurosci. 2021, 2021, 8128879. [Google Scholar] [CrossRef] [PubMed]
Gu, Q.; Zhu, L.; Cai, Z. Evaluation measures of the classification performance of imbalanced data sets. In Proceedings of the ISICA 2009—The 4th International Symposium on Computational Intelligence and Intelligent Systems, Huangshi, China, 23–25 October 2009; pp. 461–471. [Google Scholar]
Bekkar, M.; Djemaa, H.K.; Alitouche, T.A. Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 2013, 3, 27–38. [Google Scholar]
Gietz, H.; Sharma, J.; Tyagi, M. Machine learning for automated sand transport monitoring in a pipeline using distributed acoustic sensor data. IEEE Sens. J. 2024, 24, 22444–22457. [Google Scholar] [CrossRef]
Chen, X.; Karin, T.; Jain, A. Automated defect identification in electroluminescence images of solar modules. Sol. Energy 2022, 242, 20–29. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
Vega, A.; Yarahmadi, N.; Jakubowicz, I. Determination of the long-term performance of district heating pipes through accelerated ageing. Polym. Degrad. Stab. 2018, 153, 15–22. [Google Scholar] [CrossRef]
Khan, L.R.; Tee, K.F. Risk-cost optimization of buried pipelines using subset simulation. J. Infrastruct. Syst. 2016, 22, 04016001. [Google Scholar] [CrossRef]

Figure 1. Method for setting maximum and minimum lengths.

Figure 2. Conceptual diagram of KNN classification.

Figure 3. Overview of the Machine Learning Pipeline.

Figure 4. Comparison of KNN metrics.

Figure 5. Conceptual diagram of ROC Curve.

Figure 6. F2-score results by K value.

Figure 7. Recall results by K value.

Figure 8. AUC results by K value.

Figure 9. Median risk scores of failure and incorporated data by K value.

Figure 10. Risk score distribution of the XGB (K = 2) model.

Figure 11. Importance of influencing variables.

Figure 12. Monotonic relationship between Year of Installation and Predicted Risk Score (XGB, K = 2).

Table 1. Data characteristics of heating pipelines.

Attribute	Data Characteristics
Spatial information	Shp
Facility ID	Numeric
Purpose	String
Diameter	Numeric
Year of burial	Numeric
Insulation level	Numeric
Burial environment	String

Table 2. Descriptive statistics of normal and failure pipelines.

Statistic	Diameter (Normal)	Diameter (Failure)	Insulation Level (Normal)	Insulation Level (Failure)	Year of Burial (Normal)	Year of Burial (Failure)
Mean	−270.43	−306.61	−10.30	−3.27	−2008.39	−1997.52
Standard deviation	−206.87	−239.18	−3.32	−3.46	−9.58	−8.57
Min	−20	−20	−0	−0	−1987	−1987
Max	−1100	−850	−13	−13	−2024	−2023

Table 3. Pearson correlation analysis results for numerical variables.

Factor	Corr	p-Value
Diameter	0.013	0.000
Insulation level	−0.159	0.000
Year of burial	−0.086	0.000

Table 4. Chi-square test results for categorical variables.

Factor	$X^{2}$	p-Value
Burial environment	1165.017	0.000
Purpose	295.585	0.000

Table 5. Data characteristics by K value.

K	Incorporated Data	Pseudo Mean Metric	All Mean Metric
1	32,751	0.1011	0.6748
2	32,266	0.2218	0.9420
3	32,275	0.2972	1.0754
4	32,281	0.3511	1.1738
5	32,333	0.3928	1.2400
6	32,267	0.4311	1.2961
7	32,264	0.4640	1.3468
8	32,393	0.4932	1.3891
9	32,314	0.5201	1.4254
10	32,303	0.5446	1.4584

Table 6. Confusion Matrix.

Confusion Matrix		Prediction
Confusion Matrix		Negative	Positive
Reference	Negative	True Negative (TN)	False Positive (FP)
Reference	Positive	False Negative (FN)	True Positive (TP)

Table 7. Model evaluation according to AUC (Fawcett, 2006) [52].

AUC	Evaluation
AUC ≥ 0.9	Excellent
0.8 ≤ AUC < 0.9	Good
0.7 ≤ AUC < 0.8	Fair
AUC < 0.7	Poor

Table 8. Dataset for model development.

Input								Output
Diameter	Insulation level	Year of burial	Purpose	Burial environment	Diameter × burial environment	Insulation level × burial environment	Purpose × burial environment	High-risk of failure

Table 9. Performance metrics of models (K = 2).

Model	F2-Score	Accuracy	Recall	AUC
XGB (K = 2)	0.921	0.964	0.975	0.993
LGBM (K = 2)	0.912	0.961	0.968	0.991
XGB + LGBM (K = 2)	0.910	0.962	0.960	0.992

Table 10. Hyper-parameter of models (K = 2).

Model	n_Estimators	Learning_Rate	Max_Depth
XGB (K = 2)	120	0.0854	5
LGBM (K = 2)	117	0.0891	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Kang, J.; Kim, J.; Kong, M. Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models. Appl. Sci. 2025, 15, 11104. https://doi.org/10.3390/app152011104

AMA Style

Lee S, Kang J, Kim J, Kong M. Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models. Applied Sciences. 2025; 15(20):11104. https://doi.org/10.3390/app152011104

Chicago/Turabian Style

Lee, Sungyeol, Jaemo Kang, Jinyoung Kim, and Myeongsik Kong. 2025. "Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models" Applied Sciences 15, no. 20: 11104. https://doi.org/10.3390/app152011104

APA Style

Lee, S., Kang, J., Kim, J., & Kong, M. (2025). Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models. Applied Sciences, 15(20), 11104. https://doi.org/10.3390/app152011104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of High-Risk Failures in Urban District Heating Pipelines Using KNN-Based Relabeling and AI Models

Abstract

1. Introduction

2. Research Method and Basic Unit Definition

2.1. Research Method

2.2. Definition of Basic Units of Heating Pipelines

2.2.1. Data

2.2.2. Basic Unit Definition

Integration via Attribute Information

Maximum and Minimum Length Criteria

Basic Unit Homogeneity Rule

2.2.3. Failure History Data

2.2.4. Reproducibility and Implementation Details

3. Data Characteristics and Correlation Analysis

3.1. Data Characteristics

3.2. Correlation Analysis

4. High-Risk Data Selection for Heating Pipelines

5. Development of the High-Risk Prediction Model for Heating Pipelines

5.1. XGBoost (eXtreme Gradient Boosting, XGB)

5.2. LightGBM (Light Gradient Boosting Machine, LGBM)

5.3. Model Evaluation Metrics

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI