A Hybrid Framework Combining Data-Driven and Catenary-Based Methods for Wide-Area Powerline Sag Estimation

: This paper is concerned with the airborne-laser-data-based sag estimation for wide-area transmission lines. A systematic data processing framework is established for multi-source data collected from power lines, which is applicable to various operating conditions. Subsequently, a k-means-based clustering approach is employed to handle the spatial heterogeneity and sparsity of powerline corridor data after comprehensive performance comparisons. Furthermore, a hybrid model of the catenary and XGBoost (HMCX) method is proposed for sag estimation, which improves the accuracy of sag estimation by integrating the adaptability of catenary and the sparsity awareness of XGBoost. Finally, the effectiveness of HMCX is veriﬁed by using power data from 116 actual lines.


Introduction
The clearance distance of overhead power lines under various working conditions must meet safety requirements to prevent accidents such as electric shock, discharge, and short circuits. The clearance distance is directly affected by the sag fluctuation; highprecision sag estimation is of great value for maintenance and expansion in power inspection and scheduling [1][2][3][4][5][6]. The sag varies dynamically with conductor temperature and horizontal stress [7]. The changes in operating conditions and fluctuations in the surrounding environment make it difficult to accurately estimate the sag of wide-area overhead power lines. With the advancement of intelligent inspection, the problem of accurate sag monitoring or estimation has attracted the attention of academia and industry again [8][9][10][11].
In the field of sag estimation, some researchers try to realize the on-line monitoring of sag by means of various measurement techniques and monitoring equipment. For example, Mahaj et al. [12] calculated the value of the sag by recording the GPS signals corresponding to the physical movement of the lines. Pan et al. [13] derived the sag directly through image recognition of the conductor employing an HD camera. Some researchers calculate the sag through sensors such as tension sensors [14], temperature sensors [15], fiber bragg grating sensors [16], and magnetic field sensor arrays [17]. Kopsidas et al. [18] developed a holistic method to calculate the sag and ampacity based on the mechanical and electrical parameters of the overall system. Among the above methods, the accuracy of the sag method based on tension and temperature is higher because they are variables directly related to sag variation. However, limited by the equipment cost, Such methods are only applicable to a small part of the line with equipment already installed rather than generalizing to wide-area transmission lines.
With the aid of advanced sensing techniques and intelligent algorithms, numerous researchers have sought to solve the problem of sag estimation for wide-area lines.

1.
A systematic data processing framework for aircraft-based inspection is established to preprocess the multi-source heterogeneous corridor data.

2.
A similarity clustering algorithm based on k-means is introduced to solve the problem of spatial heterogeneity and spatial dependence in the corridor data.

3.
A novel HMCX method for sag estimation through utilizing corridor data is proposed, which combines the adaptability of catenary with the sparsity awareness of XGBoost. 4.
The feasibility and effectiveness of HMCX are verified by using power data from 116 actual lines. The proposed HMCX method outperforms catenary, linear Regression, and Bayesian ridge regression involved in this study.
The remainder of the paper is structured as follows: An overview of the corridor database is presented in Section 2. The whole data analysis framework using corridor data and the proposed HMCX method for sag estimation are designed in Section 3. The explanatory experimental results include data analysis results, the performance of HMCX, and comparisons with other algorithms are summarized in Section 4. Eventually, the conclusions and some future work are discussed in Section 5.
Notations: All symbols used herein are in standard form unless otherwise indicated. Table 1 shows the symbols used in this paper and their meanings. Matrices are bold uppercase x = (x 1 , . . . , x n ), x i ∈ R n × 1 dimensional column vector x T = (x 1 , . . . , x n ) T , x i ∈ R Transpose of a column vector or 1 × n dimensional row vector X ∈ R m×n or X = [x ij ] m×n , x ij ∈ R m × n dimensional matrix Matrix of column vectors stacked horizontally B = {b 1 , b 2 The set X includes m instances, each instance is represented by vector x i of n attributes.
Z, Z * , N Integers, positive integers and natural numbers, respectively R, R * Real numbers and positive real numbers, respectively R n n-dimensional vector space of real numbers 1 In the following context, X is also treated as [x 1 , x 2 , . . . , x m ] to support matrix operations, Mat(X ) is used to matrix X for the distinction representation, which is not mathematically rigorous.

Description of Corridor Database
In order to obtain high-precision corridor information, electric power institutions collect point cloud models of transmission line corridors through laser scanners mounted on fixed-wing, helicopter, and multi-rotor UAVs. Limited by the length of a single operation of the aircraft, the lines are mostly in the form of short line segments. This paper collects the corridor data of 36 overhead transmission lines obtained from 116 ALS jobs in Guangdong Power Grid, and the year is from January 2018 to June 2020. The number of ALS operation records for 550 kV, 220 kV, and 110 kV is 76, 34, and 4 times, respectively. By using python for data processing and munging, we got the data of 30,945 phase lines of spans.
The flight platforms used in the ALS operation of the lines studied in this paper are mainly manned/unmanned helicopters and fixed-wing UAVs. A laser scanner is mounted on the flight platform, thereby realizing the aircraft-based inspection for power transmission lines. Figure 1 shows a picture of a helicopter-borne laser scanning inspection operation. The laser scanner models used in the studied lines are RIEGL VUX-1LR and RIEGL miniVUX-3UAV. The processing software of the collected line point cloud is LiDAR360.
After data extraction, cleaning, munging, transformation, matching, and load, a corridor database containing 30,945 phase lines of spans with 20 variables or features is generated, where the variables include the weather information, conductor parameters, terrain, point clouds, etc. The variables or constants of the database are shown in Table 2.
For ALS operation in the line corridor, the following information is recorded: 1.
The ambient temperature and wind speed of the take-off site are recorded at the beginning of ALS operations.

2.
The classified point cloud information is extracted from LiDAR, including the span length, height difference, sag value, and distance from the maximum sag point to the tower of each line span.

3.
The conductor parameters, voltage, tower type, and service time are recorded in the ledger.

Methodology
This section presents the construction process of the HMCX. First, a description of the data processing for corridor data is introduced, and a processing framework, including data extraction, data preprocessing, and data integration is proposed. Then, the catenary model of sag and its characteristics are described. Next, a similarity cluster analysis based on k-means is performed on the data with the catenary error. Finally, the HMCX is designed. The flow chart of this section is shown in Figure 2.

Data Processing
The data processing and analysis framework is established to solve the problem of multi-source heterogeneous data aggregation and analysis, as shown in Figure 3. The four phases of data extraction, data transformation, data integration, and database generation are mainly responsible for data processing.  Various kinds of information are extracted from the data, including line information, LiDAR data, and environmental data. Table 2 lists the variables or parameters of 20 kinds that were extracted from each data source. The variables or parameters are transformed by four standard steps. The following are the specific steps: 1.
Data consistency processing. It includes processing the format and content of multisource data, unifying units and representations, and performing consistent processing to facilitate subsequent processing.

2.
Data interpretation and transformation. The actual meaning of the parameters of the multi-source data is interpreted, and the unit is uniformly converted. Discrete variables of span type and terrain information are processed using one-hot encoding, the wire type and its parameters are matched to the dataset. The standardization is accomplished on continuous variables. The date feature is converted to the number of days the line has been put into operation; the tower coordinates at both ends of one span are converted to the Euclidean distance between them.

3.
Missing values, duplicate values, and outliers handling use random forest regression to impute and fill features with fewer missing values and identify and eliminate duplicates and outliers in the dataset.

4.
Feature analysis, selection, and reduction. In order to eliminate the influence of redundant variables, the feature selection based on gradient boosting decision tree (GBDT) is used to analyze the importance of features. Kernel principal component analysis (PCA) is used to determine whether the information between features is redundant and to reduce dimensionality.

5.
Data integration is the process of using data from different sources to construct a unified view. Data from multiple sources can be linked through the mapping relationship of the common parameters. The corridor database is merged and updated by matching, redundant fusion, cooperative fusion, and complementary fusion.
According to the importance of features, the feature variables with low importance ranking are eliminated. After data processing, we obtained a dataset of 15 sag-related parameters. For N spans, the 15 sag-related parameters are described as follows: Span type (SPT) p = (p 1 , . . . , p N )(p i ∈ {0, 1, 2, 3}). The span type p i includes double linear towers, double-sided tension towers, left single tension towers, and right single tension towers, which are represented as 0, 1, 2, and 3 by one-hot encoding, respectively.
Conductor type (CDT) C = {c 1 , . . . , , is one of the conductor types, and it is the parameters of the conductor of the i-th span. The parameters are elastic coefficient (ELC), breaking force (BRF), resistance per kilometer (RPK), diameter of wire (DW), linear expansion coefficient (LEC), weight per unit length (WPL), and total cross-sectional area (TCSA), respectively.
The respective vector representations of the conductor parameters in the dataset are as follows: ELC is e = (e 1 , . . . , e N )(e i ∈ R * ), , and TCSA is a = (a 1 , . . . , a N )(a i ∈ R * ).
Service time (ST) s = (s 1 , . . . , s N )(s i ∈ R * ). The service time is obtained by converting the date into the number of days the line has been put into operation.
Span length (SPL) l = (l 1 , . . . , l N )(l i ∈ R * ). The span length l i is obtained by converting the coordinates of the suspension points at both ends of a span into Euclidean distance.
Height difference (HD) h = (h 1 , . . . , h N )(h i ∈ R). The height difference h i is obtained by the difference between the heights of the suspension points at both ends.
Maximum sag (MS) f = ( f 1 , . . . , f N )( f i ∈ R * ). In ALS operations, point cloud quality problems such as ghost points, jump points, noise points, and missing caused by the jitter of the aircraft fuselage often occur. A differential threshold filtering method is utilized to smooth the point cloud.
Ambient temperature (AT) t = (t 1 , . . . , t N )(t i ∈ R) and wind speed (WS) v = (v 1 , . . . , v N )(v i ∈ R) are recorded in the take-off site at the beginning of ALS operations.
Therefore, the i-th span data can be represented as a tuple P i = (x i , y i ), where , and y i = f i , containing 15 sag-related variables in total. Finally, the corridor dataset can be represented as (1)

Catenary Model for Sag Calculation
This section introduces the catenary model of sag calculation and the calculation method of sag difference.

Catenary-Based Sag Calculation
The catenary model is frequently used to estimate line rough sag, because its physical parameters are clear and highly adaptable. The line between two suspension points is usually assumed to be a catenary line, which is assumed to be a chain of flexible cables without rigidity. The mechanism model of catenary is obtained from the force balance relationship. The point cloud of a span is shown in Figure 4 after classification.
The maximum sag appears at d f x /dx = 0, the maximum sag of the unequal height suspension point can be obtained as: where L h=0 is the length of the catenary in the span of the overhead line of the same height suspension point, namely: In practice, the above relationship is often used to calculate the sag of the line, which has clear physical information and is not complicated to implement.
The catenary model parameters are easy to determine using the X i data of the corridor dataset P in Section 3.1. Wind speed and conductor information are used to determine the comprehensive specific load. The horizontal stress is determined according to the safety factor and conductor parameters. Combining the span information in the X i , the calculated sag valuef i of span i can be obtained by substituting it into (5). Thus, the calculated sag of the catenary model for the corridor dataset can be represented as:

Sag Difference between the Catenary and the Extracted Sag
The catenary model belongs to the white-box model, which is an accurate mathematical model based on the wire's internal force mechanism. The catenary model does not account for the effect of drift in parameter values caused by aging, which brings great uncertainty to the calculation. Therefore, the catenary model has the obvious disadvantage of large calculation errors.
Based on x i of the corridor dataset P obtained in the previous subsection, it is easy to obtain the calculated sag valuef i of span i by substituting it into (5). Therefore, the catenary model error between calculated sagf i and the extracted sag f i from the LiDAR of the i-th span, namely, sag difference is: The sag differences from the catenary error corresponding to the corridor dataset is: . Furthermore, the corridor dataset with errors is updated as: Then the matrix forms are obtained: e, b, r, d, ι, w, a, u, s, l, h, t, v], The matrix form of all datasets considering catenary-based error, from a practical point of view, can be expressed as: p, e, b, r, d, ι, w, a, u, s, l, h, t, v, δ].

k-Means-Based Similarity Clustering Considering Sag Difference
To reduce the impact of spatial sparsity and heterogeneity on estimation outcomes, we consider employing a clustering method to partition the dataset to improve similarity within one cluster. In order to choose an appropriate clustering method, we compared the effect of six clustering algorithms on corridor data clustering. Through the three performance indicators and the Kappa-based consistency test of the clustering results, we finally selected k-means and determined its initial value conditions. See Section 4.2 for more details.
For dataset P with catenary error, we employ the Euclidean distance scheme of k-means for clustering. The k-means algorithm divides a set of N samples X into k disjoint clusters C, each cluster C is described by the mean µ j of the samples in the cluster (often called the centroid of the cluster). The k-means algorithm aims to choose centroids µ j that minimise the inertia, or sum-of-squares criterion within cluster: The cluster number k needs to be determined in advance, and the positions of the k initial centroids have a great impact on the clustering results and running time. k is determined by comparing three performance indicators under a different numbers of clusters.
Finally, the corridor dataset with errors P = {P i } N i=1 is grouped into k cluster divisions. The set of k clusters is represented as: where C 1 ∪ C 2 ∪ · · · ∪ C k = P . The dataset of cluster C i can be denoted as: where n is the sample number of cluster C i . The data of the k-th span in the cluster C i can be represented as: where x (k)

Importance of the Features Used by the Model after Model Training
To reveal the relative importance of each feature when estimating and provide a better understanding of the data, the GBDT is introduced to estimate the relative importance of each feature. GBDT are tree ensemble algorithms. One of the advantages of tree ensemble algorithms is to output the importance of the features used by the model after model training. A tree can be formally expressed as: with parameters Θ = {R j , γ j } J 1 , where R j , j = 1, 2, . . . , J are the disjoint regions of the joint predictor variable space, j is the terminal nodes of the tree; γ j is a constant assigned to each R j ; I(S) is the indicator function of the set S that maps elements of S to 1 and all other elements to 0.
The boosted tree is a sum of trees, where M is the number of trees. For a single decision tree T, the square of the relative importance of the variable X l is the sum of the squared improvements of all internal nodes chosen as the splitting variable [48]: where the sum is over the J − 1 internal nodes of the tree, t is the node, the input variable X v(t) divides the region into two subregions at node t,τ 2 t is the maximal estimated improvement in squared error risk for the feature X l over the entire region.
The global importance of feature X l is measured by the average of the relative importance over trees: In general, importance indicates the usefulness or value of each feature in building a boosted decision tree in the model. The more a feature uses a decision tree to make key decisions, the higher its relative importance. GBDT and XGBoost are tree ensemble algorithms, the feature importance can be calculated and sorted for each feature when the predictive model is trained.

HMCX Method for Sag Estimation
The advantage of the catenary model is that the resulting model is highly adaptable. XGBoost, being an ensemble learning technology, has a good gradient boosting performance in subtrees, and also exhibits good extrapolation capability for sparse data [32]. The widearea sag estimation problem faced by the ALS operation scenario can be addressed more effectively with the two.
In this paper, we proposed HMCX to address the problem of sag estimation based on corridor data. The technical details are described below.

The Catenary-Based Method
The catenary model (5) given in Section 3.2.1 is used to obtain a rough calculated sag valuef i in the Equation (7).
According to the catenary model, the calculated sag of cluster C i is: The sag difference corresponding to the corridor dataset between the calculated sag and the sag value from LiDAR is:

The Data-Driven Method
The ensemble learning algorithm XGBoost is used as the data-driven model in HMCX. The main principles of XGBoost are introduced in this section.
For the corridor dataset with n examples and 15 features P i = (x i , δ i )(|P i | = n < N, x i ∈ R 14 , δ i ∈ R), the ensemble learning based on XGBoost are constructed [32].
where F = { f (x) = ω q(x) }(q : R m → T, ω ∈ R T ) denotes the space of regression trees. Each f k represents an independent tree structure q and leaf weights ω. Model complexity is introduced to measure the computational efficiency of the algorithm. The objective function of XGBoost is defined as: , where l is a differentiable convex loss function that measures the difference between the prediction sag differenceδ i and the target sag difference δ i , Ω represents the complexity of the model, and k represents all the trees established. The objective function is transformed into: where ) are the first and second derivatives on l. L (t) presents the quality of a tree structure. The lower value of the objective function, the better the overall structure of the tree. After a given sample split ratio, the dataset of each cluster is divided into training set and test set. Then, the data-driven model is trained through utilizing the training set. The output and input of the training procedure are the test set information x C i and the sag differences δ C i of the database P . For cluster C i , performing the above operations for , then the trained data-driven model (22) (DDM #i) is created for cluster C i . Finally, the data-driven models for clusters based on sag differences is developed. After training, the data-driven models obtain the best parameters, and the trained model can be used to estimate the sag difference. The DDM #i predicts the sag difference δ (k) C i of the data in cluster C i .
According to the data-driven model, the estimated sag difference of cluster C i is:

The HMCX Method
This section describes the HMCX procedure for sag estimation. For the k span sag estimation in the cluster C i , the catenary model in (5) is used to determine the rough sag valuef k C i of the k-th span at first. Then, the sag difference δ k C i is predicted by performing the data-driven model of cluster C i . Finally, the sag estimation result is determined by adding the calculated sagf C i and the sag difference δ C i estimated by the data-driven model. The estimated sagf k C i of the k-th span in the cluster c i is: Finally, for any span data P i = (x i , y i ), there is an estimated sagf i , and all datasets with estimated sag are:F = {f 1 , . . . ,f N }(N = |P|).
The estimated sag for cluster C i is:

The Framework of HMCX
The HMCX framework for wide-area sag estimation is depicted in Figure 5, which combines the catenary-based model with the data-driven model.  The HMCX framework consists of two parts: 1.
Offline training. The offline training phase mainly illustrates the procedure for HMCX training. First, the corridor database is split into several clusters utilizing the k-means method, and each cluster is divided into a test set and a training set. Next, the sag of the catenary line span is calculated utilizing test set information, and the sag difference to the real sag is determined from LiDAR. Then, the XGBoost model is trained by utilizing the sag differences and the test set information. Finally, the data-driven models for clusters based on sag differences are developed. 2.
Hybrid model sag estimation. This phase describes the procedure for sag estimation implementing the HMCX. First, the test set data is clustered by utilizing the clustering model produced by offline training. Then, the catenary model is used to determine the sag value of the data, and the sag difference prediction is performed based on the datadriven model to which it belongs. Finally, the sag estimation result is determined by adding the calculated sag and the sag difference estimated by the data-driven model.

Remark 1.
The conversion between the HMCX and the XGBoost can be accomplished by controlling the catenary model's output switch. When the catenary model's output is 0, the HMCX is equivalent to the XGBoost model. This approach allows for more flexible model utilization in practice.

Performance Indicators
To prove the effectiveness and feasibility of the proposed methods, mean absolute error (MAE), root mean square error (RMSE), R-squared coefficient of determination (R 2 ), and Theil inequality coefficient (TIC) are selected as indicators.
wheref i is the estimated value through utilizing HMCX, f i is the sag value extracted from LiDAR data, andf is the mean value of all sag

Remark 2.
The RMSE is used to measure the deviation between the estimated sag and the sag value of the point cloud. The R 2 reflects the proportion of the dependent variable. The interval of R 2 is (0, 1). When R 2 is closer to 1, it means that the estimated value has a higher correlation with the actual value, that all estimations perfectly match the real results. TIC is between 0 and 1, the smaller the value, the smaller the difference, and the higher the estimation accuracy between the fitted value and the true value.

Experimental Results
In this section, the corridor database is used to verify the feasibility of the above method and processing framework. A total of 30,944 valid line data points were processed and obtained. The data processing and analysis platform were run on a Windows 10 computer with a 3.8 GHz Intel processor and 8 GB RAM.

Results of Data Analysis
The correlation matrix shows the Pearson correlation coefficient between two variables, which is used to count the degree of linear correlation between two variables. The correlation matrix diagrams between features are shown in Figure 6.
According to the correlation matrix diagram, the influencing factors of the maximum sag of the large crossing were ranked according to the correlation coefficient: SPL (0.

Results of Cluster Analysis
To reduce the influence of spatial sparsity and heterogeneity on the estimation results, we partition the corridor dataset with clustering methods to improve the similarity within a cluster. Six clustering algorithms (k-means, MeanShift, Ward, agglomerative clustering, DBSCAN, and Gaussian mixture) were introduced for comparative analysis. The performance of clustering methods was compared. Three common performance indicators, Calinski-Harabasz score (CHS), Silhouette score (SS), and Davies-Bouldin index (DBI), were selected to evaluate the effects of the above six clustering methods. The specific explanations of the indicators are omitted. The clustering performance results are shown in Table 3.
It can be seen from Table 3 that when the number of clusters was 3, the relative maximum value of SS was 0.735, and the relative minimum value of DBI was 0.374. Although the CHS index results obtained when the number of clusters was 10 appear to be better, its essence is caused by the high dispersion of the data. Therefore, a suboptimal solution was compromised, and the initial condition of k-means clustering was set to k = 3.
In order to compare the consistency of the results obtained by different clustering algorithms, this section introduces a Kappa-based consistency test to verify the consistency of the clustering results [49]. Table 4 shows the the frequency of clusters under different clustering algorithms. Table 5 shows the Kappa consistency test results between each two algorithms.   From the cluster frequency in Table 4 and the Kappa value in Table 5, it can be seen that k-means, Ward, and Gaussian Mixture show strong consistency in the clustering effect, and the clustering frequency is roughly similar. From a statistical point of view, the p < 0.01, which means there is a significant correlation between the clustering results. MeanShift and Agglomerative Clustering are less consistent with other algorithms, these two algorithms perform poorly on the corridor dataset and fail to segment the dataset well.
After comprehensively considering the performance of each index and the computational complexity, the following results were obtained: k-means had higher SS and lower DBI and outperformed other methods on corridor data. The poor performance of the density-based DBSCAN clustering method on the inspection dataset also proves the sparseness of the data. The best performance was obtained when the corridor data was divided into three clusters. Therefore, we chose k-means as the clustering method, and set the number of clusters to three. Figure 7 shows the distribution of clustering results on the span length. There are three clusters and their frequencies were 11,610, 18,228, and 1106, respectively. After the k-means clustering was completed, the feature importance to the sag difference in the three clusters was calculated and ranked based on GBDT. The proportion of feature importance to the sag difference in the clusters is shown in Figure 8.
The first three most important variables for the sag difference value from Cluster #0 in the Figure 8 are DW, WPL, SCD. For Cluster #1, the first three most important variables are WPL, DW, LEC. The first three most important variables for Cluster #2 are ELF, BRF, ST.
According to the actual operation and maintenance experience, the lines corresponding to these three clusters may be: large-span lines with larger diameter conductors, heavy-duty lines with larger ampacities, and aging lines with long-term service.

Estimation Result Analysis
To verify the effectiveness of HMCX, the catenary model and three data-driven models based on XGBoost, Linear Regression (LR), and Bayesian Ridge Regression (BayesRR) were introduced for comparative analysis. At the same time, to verify the clustering effect, the indicators of the above methods for all data and each cluster are presented. In order to obtain a reliable and stable model and obtain data results relatively objectively, 10-fold cross-validation was used to partition the dataset. The dataset was randomly divided into 10 parts, of which 9 parts were used for model training and the remaining 1 part was used for testing. We repeated this process 10 times, keeping different remainders for testing.
For all data, the 100 example results of the test set data were randomly extracted for display. The estimation results and estimation error of the data based on the above methods are shown in Figures 9 and 10. The frequency distributions of the estimation error of HMCX are shown in Figure 11.
From Figures 10 and 11, it can be seen that the overall error fluctuation and error distribution of HMCX are smaller than other methods for all data. The catenary has the largest positive error fluctuation, up to 20 m, which may be caused by the high uncertainty of the field measurement of its related parameters. Although HMCX achieves smaller errors than other methods, it can be clearly seen that there is a mutation data at the position where the span number is 20.  In order to see the effect of clustering more intuitively, we compared the three clusters, and the following are their results. The sag estimation results and the sag estimation errors of cluster #0 are shown in Figures 12 and 13. The sag estimation results and the errors of cluster #1 are shown in Figures 14 and 15. And the results of cluster #2 are shown in Figures 16 and 17.
From Figures 12 and 13, it can be seen that the effect of HMCX on cluster #0 is significantly better than other methods. At the same time, it can be seen that both the catenary and LR have large error fluctuations.
From Figures 14 and 15, it can be seen that the error performance of LR on cluster #1 is worse than that of the catenary. The error fluctuations of HMCX, BayesRR, and XGBoost on cluster #1 are smaller.     It can be seen from Figures 16 and 17 that the error of the catenary on cluster #2 presents a reverse positive error, and the error of the catenary model is close to 20 m. The catenary model using the fixed horizontal stress is not suitable for the cluster. The HMCX#2 can well compensate for the errors caused by the catenary, and effectively improve the accuracy of the sag estimation of the cluster. The effects of HMCX, BayesRR, and XGBoost on cluster #2 are also acceptable.
For all data and three different clusters, the evaluation indicators of the above methods are shown in Table 6.  Table 6, the weak performance of the catenary can be seen from the R 2 of −0.156, which is also illustrated by the same TIC of 0.491. The catenary model may not be suitable for all lines due to parameter drift and variable uncertainty. HMCX can significantly improve the estimation performance compared to the catenary. For all data, HMCX has the smallest RMSE. In both clusters #0 and #1, HMCX achieves better performance than the full data. In cluster #2, the RMSE of HMCX was increased due to the large error of the catenary model, but the estimation performance was still better than that of LR, ByesRR, and catenary. In terms of time performance, since HMCX includes XGBoost and catenary, its running time is longer, but still acceptable. In terms of overall estimation performance, HMCX significantly outperforms catenary, LR, and BayesRR.

Conclusions
This paper uses corridor data to solve the problem of wide-area sag estimation in the power inspection field. A systematic data processing and analysis framework for aircraftbased inspection is constructed to preprocess multi-source data. The proposed HMCX method combines the adaptability of catenary with the sparsity awareness of XGBoost. The feasibility and effectiveness of the proposed HMCX are verified by practical data. The proposed HMCX method outperforms cantenary, LR, and BayesRR involved in this study and shows a promising prospect in wide-area sag estimation of the power dispatching and inspection.
According to the above analysis results, it is effective to reduce the heterogeneity of corridor data by improving the similarity between data by clustering. The accuracy of wide-area sag estimation is improved compared to the catenary. However, from the perspective of performance indicators, the impact of the catenary on the HMCX cannot be ignored. Therefore, the future work can be considered from the following aspects: (1) The reasons for the errors of the catenary model in different clusters need to be found and eliminated through further analysis of the line clustering. (2) More suitable model parameters or models can be selected according to different clustering features, to reduce the influence of the calculation bias of the mechanism part on the estimation. (3) The recommendation algorithm can be used to impute the missing data of wide-area lines, improve the matching accuracy of span similarity, and reduce the impact of data heterogeneity. (4) Heuristic optimization algorithms can be introduced to transform subset selection into an optimization problem to find the optimal subset of features for the model. Acknowledgments: Beyond the authors of the paper, I would like to express my special thanks to my senior Jun-yi Li, my schoolmate Wenshuai Lin, and my junior Zhengganzhe Chen. They provided invaluable advice on my research and helped me during difficult times. Their help greatly contributed to the smooth completion of the project. Besides, many thanks of gratitude to my industrial advisor Huamin Zhou, and all the leaders who helped me. They gave me the opportunity to participate in grid technology projects and provided valuable information. I would also like to thank my family and friends for their support. I really appreciate them.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: