Enhancement Methods of Hydropower Unit Monitoring Data Quality Based on the Hierarchical Density-Based Spatial Clustering of Applications with a Noise–Wasserstein Slim Generative Adversarial Imputation Network with a Gradient Penalty

In order to solve low-quality problems such as data anomalies and missing data in the condition monitoring data of hydropower units, this paper proposes a monitoring data quality enhancement method based on HDBSCAN-WSGAIN-GP, which improves the quality and usability of the condition monitoring data of hydropower units by combining the advantages of density clustering and a generative adversarial network. First, the monitoring data are grouped according to the density level by the HDBSCAN clustering method in combination with the working conditions, and the anomalies in this dataset are detected, recognized adaptively and cleaned. Further combining the superiority of the WSGAIN-GP model in data filling, the missing values in the cleaned data are automatically generated by the unsupervised learning of the features and the distribution of real monitoring data. The validation analysis is carried out by the online monitoring dataset of the actual operating units, and the comparison experiments show that the clustering contour coefficient (SCI) of the HDBSCAN-based anomaly detection model reaches 0.4935, which is higher than that of the other comparative models, indicating that the proposed model has superiority in distinguishing between the valid samples and anomalous samples. The probability density distribution of the data filling model based on WSGAIN-GP is similar to that of the measured data, and the KL dispersion, JS dispersion and Hellinger’s distance of the distribution between the filled data and the original data are close to 0. Compared with the filling methods such as SGAIN, GAIN, KNN, etc., the effect of data filling with different missing rates is verified, and the RMSE error of data filling with WSGAIN-GP is lower than that of other comparative models. The WSGAIN-GP method has the lowest RMSE error under different missing rates, which proves that the proposed filling model has good accuracy and generalization, and the research results in this paper provide a high-quality data basis for the subsequent trend prediction and state warning.


Introduction
As the significance of hydropower units in modern power systems grows, the urgency for accurate condition monitoring and prediction increases.However, the quality of monitoring data in hydropower units is often compromised due to interference, abnormalities or failures in data acquisition and transmission links, leading to issues like data anomalies and accurately addressing the temporal dependencies in data series.This gap underscores the need for more advanced and nuanced approaches to data imputation.
To overcome these limitations, Generative Adversarial Networks (GANs) [47][48][49][50] have been introduced in the field of data generation, capable of learning data distributions and generating synthetic data with features of real data.GANs can learn the distribution of data and generate synthetic data with real data features.In the data augmentation field, many studies have shown that GANs can be used to generate more data samples.Specifically, GANs can be trained on the original dataset to obtain a generative model.Then, the generative model is used to generate more data samples, which can be used to augment the original dataset, increase the number of data samples and improve the model's generalization ability.However, traditional GANs have problems such as unstable training and mode collapse during the training process.To overcome the issues of traditional GANs, Wasserstein Generative Adversarial Networks (WGANs) were introduced [51][52][53].By introducing the Wasserstein distance to measure the distance between generated data and real data, WGANs improve training stability and the quality of generated data.Furthermore, to increase the diversity of generated data, Gradient Penalty was introduced into WGAN, forming Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) [54][55][56][57].Based on the GAN and WGAN architectures, dedicated generative imputation networks were developed for data imputation, namely, the Generative Adversarial Imputation Network (GAIN) [58] and Wasserstein Generative Adversarial Imputation Network (WGAIN) [59].The Slim Generative Adversarial Imputation Network (SGAIN), a lightweight generative imputation network architecture without a hint matrix, was proposed as an improvement on the GAIN.To address the issues of traditional GANs in SGAIN, the Wasserstein Slim Generative Adversarial Imputation Network (WSGAIN) was further improved, along with the Wasserstein Slim Generative Adversarial Imputation Network with Gradient Penalty (WSGAIN-GP) [59].
In this study, we introduce a novel approach for enhancing the quality and usability of the condition monitoring data of hydroelectric units, addressing the limitations of traditional methods, such as inadequate accuracy, disregard for temporal dependencies and obscured data distribution characteristics.Our methodology uniquely integrates two advanced data processing techniques: anomaly detection using the HDBSCAN clustering method and data imputation through the WSGAIN-GP generative model.This combination not only retains the intrinsic characteristics of the data but also significantly improves their completeness and utility.The HDBSCAN clustering method effectively groups monitoring data according to density levels, enabling the precise identification of outliers, which is crucial for accurate data enhancement.Following this, the WSGAIN-GP generative model, utilizing unsupervised self-learning, adeptly approximates the distribution characteristics of real monitoring data.This is instrumental in generating high-quality substitutes for missing data, thereby addressing the gap left by traditional methods.Our contribution is noteworthy in that we are the first to apply these sophisticated methods to the realm of hydropower unit condition monitoring.By doing so, we not only preserve the fidelity of the data but also augment its integrity and applicability.The enhanced data quality and accuracy provided by our approach lay a solid foundation for the more reliable condition monitoring and prediction of hydropower units.This advancement is a step forward in realizing intelligent warnings for hydropower unit conditions, ultimately contributing positively to the maintenance and operational efficiency of these units.This paper delves into the specifics of our quality enhancement methodology, using the HDBSCAN clustering method and the WSGAIN-GP generative model, and presents experimental evidence demonstrating its significant impact on improving data quality and accuracy in hydroelectric unit condition monitoring.

HDBSCAN Clustering Approach
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), an extension of the DBSCAN algorithm proposed by Campello et al. [60,61], is a densitybased clustering method particularly effective for datasets with varying densities.Unlike DBSCAN, which relies on a uniform density threshold across the dataset, HDBSCAN identifies clusters of different densities by constructing a density tree, providing robustness against noise and outliers.The workflow of the HDBSCAN algorithm is illustrated in Figure 1.

HDBSCAN Clustering Approach
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), an extension of the DBSCAN algorithm proposed by Campello et al. [60,61], is a density-based clustering method particularly effective for datasets with varying densities.Unlike DBSCAN, which relies on a uniform density threshold across the dataset, HDB-SCAN identifies clusters of different densities by constructing a density tree, providing robustness against noise and outliers.The workflow of the HDBSCAN algorithm is illustrated in Figure 1.The HDBSCAN algorithm follows these steps [32][33][34][35]: Definition: The core distance  () represents the distance from the current point  to its k-th nearest neighbor.For a border point (BP),  () is infinite because  () <  in 's Eps, making it meaningless.For a core point (CP),  () is the minimum radius that ensures exactly MinPts samples in 's Eps, i.e., the Euclidean distance between point x and the k-th nearest neighbor ( ()) that satisfies  () ≥ .__ defines the minimum number of samples for a cluster.If the number of samples in a formed cluster is below this threshold, it is considered a noise (outlier) point.min _ represents the minimum number of samples that must be included in the neighborhood of one point when calculating the density and minimum distance.__ and min_ are hyperparameters that need to be set for the HDBSCAN algorithm, as shown in Equations ( 1) and (2).

𝑐𝑜𝑟𝑒 (𝑥) = 𝑑(𝑥, 𝑁 (𝑥))
(1) Reachable Distance (RD): Point  is a CP, and  is any point.The RD between  and  , denoted as d_  (, ) , is the maximum value between the core distance  () of x and the Euclidean distance between  and , as shown in Equation (3).
Mutual Reachable Distance (MRD): MRD requires both points  and  to be CPs; otherwise, it is meaningless.It represents the maximum value between the core distances  () and  () of the two points and the Euclidean distance between them, as shown in Equation (4).The HDBSCAN algorithm follows these steps [32][33][34][35]: Definition: The core distance core k (x) represents the distance from the current point x to its k-th nearest neighbor.For a border point (BP), core k (x) is infinite because N D Eps (x) < MinPts in x ′ s Eps, making it meaningless.For a core point (CP), core k (x) is the mini- mum radius that ensures exactly MinPts samples in x ′ s Eps, i.e., the Euclidean distance between point x and the k-th nearest neighbor (N k (x)) that satisfies N D Eps (x) ≥ MinPts.min_cluster_size defines the minimum number of samples for a cluster.If the number of samples in a formed cluster is below this threshold, it is considered a noise (outlier) point.min_sample represents the minimum number of samples that must be included in the neighborhood of one point when calculating the density and minimum distance.min_cluster _size and min_sample are hyperparameters that need to be set for the HDBSCAN algorithm, as shown in Equations ( 1) and (2).
Reachable Distance (RD): Point x is a CP, and p is any point.The RD between x and p, denoted as d_d RD k (x, p), is the maximum value between the core distance core k (x) of x and the Euclidean distance between x and p, as shown in Equation (3).
Mutual Reachable Distance (MRD): MRD requires both points x and p to be CPs; otherwise, it is meaningless.It represents the maximum value between the core distances core k (x) and core k (p) of the two points and the Euclidean distance between them, as shown in Equation (4).
By transforming the dataset in space, the distances between dense points are reduced, while the distances involving sparse points are enlarged.
Step 1. Space Transformation: Utilizing density estimation to segregate low-density data from high-density data, reducing the impact of noise.
Step 2. Minimum Spanning Tree Construction: Building a tree graph from the weighted graph of transformed data points.
Step 3. Build a Hierarchical Clustering Structure: Creating a hierarchical structure by sorting and categorizing the edges in the tree.
Step 4. Tree Pruning and Compression: Limiting clusters based on min_cluster_size, refining the tree structure.
Step 5. Cluster Extraction: Selecting the most stable clusters based on local density estimates and cluster stability calculations.The goal of density-based clustering algorithms is to find the region with the highest density.The local density effective estimate λ for point x can be represented by the reciprocal of core k (x), as shown in Equation (5).
λ C i max (x) denotes the maximum λ value of point x departing from cluster C i , and λ C i min (x) represents the minimum λ value of point x belonging to cluster C i .The stability σ(C i ) of cluster C i is defined as shown in Equation (6): Select the final data clusters based on stability, generate clustering results and identify outlier points based on the clustering results.
The calculation formula for the data missing rate is defined as Equation (7): where n is the number of existing data samples, and N is the number of data samples that should exist based on the date-time interval and sampling storage interval.Since it is difficult to directly determine whether the measured data are an abnormal sample, in order to objectively evaluate the effect of anomaly detection, the Silhouette Coefficient Index (SCI) is used as a quantitative evaluation index.The SCI allows for a quantitative comparison of clustering results from the perspective of data distribution in cases lacking support from true data labels.It is defined as Equation ( 8): where x i represents the average distance between the i-th sample and other samples in the same cluster, reflecting the cohesion of samples within the cluster.y i represents the average distance between the i-th sample and all samples in the nearest neighboring cluster, reflecting the separation between clusters.SCI ∈ [−1, 1] and a larger SCI value indicate a smaller intra-cluster distance and larger inter-cluster distance, representing a better clustering effectiveness.

WSGAIN-GP Algorithm
The development of WSGAIN-GP as an advanced tool for data estimation and imputation is grounded in the progressive evolution of Generative Adversarial Networks Sensors 2024, 24, 118 6 of 25 (GAN) and their variants.This method is an extension of the foundational GAIN model, further refined by subsequent iterations such as SGAIN and WSGAIN, culminating in a sophisticated approach for data imputation.
Originally, the GAIN network introduced a generator to create missing data and a discriminator to distinguish between real and imputed data.This adversarial training process involves the discriminator minimizing classification loss, while the generator aims to maximize the misclassification rate of the discriminator.In this framework, GAIN's discriminator receives additional data through a 'hint' mechanism, albeit at the cost of increased computational demands.SGAIN, a more streamlined version of GAIN, eliminates the Hint Generator and the associated Hint Matrix, thereby simplifying the architecture [62,63].This approach adopts a two-layer neural network structure for both the generator and discriminator, in contrast to GAIN's three-layer setup, as detailed by Goodfellow et al. [40].
Building upon SGAIN, WSGAIN addresses challenges such as pattern collapse and gradient vanishing.It does so by incorporating the Wasserstein distance to measure discrepancies between real and generated data, thereby enhancing the stability of the training process.Further improving upon WSGAIN, the WSGAIN-GP model introduces a gradient penalty technique, moving away from weight clipping.This modification, as part of the loss function, enhances the overall efficacy of the network [59].The network architecture of WSGAIN-GP is depicted in Figure 2 of the manuscript.
Sensors 2024, 24, x FOR PEER REVIEW 6 of 26 (GAN) and their variants.This method is an extension of the foundational GAIN model, further refined by subsequent iterations such as SGAIN and WSGAIN, culminating in a sophisticated approach for data imputation.
Originally, the GAIN network introduced a generator to create missing data and a discriminator to distinguish between real and imputed data.This adversarial training process involves the discriminator minimizing classification loss, while the generator aims to maximize the misclassification rate of the discriminator.In this framework, GAIN's discriminator receives additional data through a 'hint' mechanism, albeit at the cost of increased computational demands.SGAIN, a more streamlined version of GAIN, eliminates the Hint Generator and the associated Hint Matrix, thereby simplifying the architecture [62,63].This approach adopts a two-layer neural network structure for both the generator and discriminator, in contrast to GAIN's three-layer setup, as detailed by Goodfellow et al. [40].
Building upon SGAIN, WSGAIN addresses challenges such as pattern collapse and gradient vanishing.It does so by incorporating the Wasserstein distance to measure discrepancies between real and generated data, thereby enhancing the stability of the training process.Further improving upon WSGAIN, the WSGAIN-GP model introduces a gradient penalty technique, moving away from weight clipping.This modification, as part of the loss function, enhances the overall efficacy of the network [59].The network architecture of WSGAIN-GP is depicted in Figure 2 of the manuscript.

Assuming a d-dimensional spatial dataset
) is a random variable taking values in the dataset χ, and its distri- bution is defined as P(X).The mask vector Then, as shown in Equation ( 9): where * does not belong to any X i and represents an unobserved value.Therefore, the mask vector M indicates which elements of X have been observed, and M can be used to recover X. Define the dataset D = x i , m i , where x i is a copy of X and m i corresponds to the recovered M. The goal of data estimation imputation is to supplement every unobserved value in X i based on the conditional distribution P(X X = x i ) .Define X as the output vector estimated for each {X i } d 1 and X as the final estimated result vector, as shown in Equation (10).
Generator (G) Model: The generator (G) takes M and random noise variable Z as inputs and outputs the estimated matrix  11) and ( 12): The loss function L(G) is expressed as shown in Equation (13[M1] ): Discriminator (D) Model: The discriminator (D) is a crucial component in adversarial games.Its task is to receive samples from the generator or from the real dataset and attempt to classify them as either real samples or fake samples (samples generated by the generator).The goal is to correctly classify samples, i.e., accurately distinguish between real and generated samples.Training improves the discriminator's ability to discern real from fake.Since the goal of WSGAIN-GP is to eliminate weight clipping due to weight trimming [64], there is no weight trimming.To improve training, a gradient penalty is included as a component of the loss function L(D), as shown in Equations ( 14) and (15) The detailed steps of the WSGAIN-GP algorithm are shown in Algorithm 1 [59].

Enhancement Methodology Flow of Hydropower Unit Condition Monitoring Data Based on HDBSCAN-WSGAIN-GP
A monitoring data quality enhancement method based on HDBSCAN-WSGAIN-GP improves the quality and usability of hydropower unit condition monitoring data by combining the advantages of density clustering and a generative adversarial network.The quality enhancement process of hydropower unit monitoring data based on HDBSCAN-WSGAIN-GP is shown in Figure 3.The detailed steps are as follows: Step 1. Data pre-processing: Anomaly data detection cleaning based on HDBSCAN.
Step 2. Initialize the network parameters for the generator and discriminator.
Step 3. Define the inputs and outputs for the generator and the discriminator.
Step 4. Define the loss functions, including the losses for the generator, discriminator and gradient penalty.
Step 5. Mark missing values and construct mask vector M.
Step 6.During model training, alternate between training the generator network and the discriminator network, updating model parameters by optimizing the loss functions, which include both the generator and discriminator losses.The generator is used to generate estimated values for imputing missing data, while the discriminator evaluates the difference between generated data and real data.
Step 7.After each training epoch, evaluate the performance of the model, including the imputation effectiveness of the generator and the discrimination accuracy of the discriminator.
Step 8. Based on the evaluation results, adjust the model's hyperparameters or structure to further optimize its performance, ultimately obtaining an efficient WSGAIN-GP model for imputing missing data.
Step 9. Perform missing value imputation: using the trained WSGAIN-GP model, merge the imputed data generated by the generator with the existing data in the original dataset to obtain a complete dataset, thereby achieving the imputation of missing data.To quantitatively assess the consistency of the filled data sequence with the original data sequence in terms of the distribution and characteristics, KL Divergence, JS Divergence and Hellinger Distance are introduced to quantify the similarity between two distributions, as shown in Equations ( 16)- (18): ) KL Divergence, JS Divergence and Hellinger Distance are all non-negative.KL Divergence ranges from 0 to ∞, while JS Divergence and Hellinger Distance range from 0 to 1. Smaller values of these three metrics indicate a greater similarity between two distributions, with a value of 0 indicating complete similarity.

Data Collection
In order to validate the effectiveness of the proposed method, this study conducted an analysis using the actual operational monitoring data of a mixed-flow hydropower unit in a fengtan hydropower station in the Central China region.The data anomaly detection and data imputation methods were tested separately.The unit parameters are presented in Table 1.Step 2. Initialize the WSGAIN-GP network parameters for the generator and discriminator.
Step 3. Define the inputs and outputs for the generator and the discriminator.

Cleaned dataset
Step 4. Define the loss functions, including the losses for the generator, discriminator, and gradient penalty.
Step 7. Evaluate the performance of the model.
Step 8. Adjust the model's hyperparameters or structure to further optimize its performance.
Step 9. Output：Perform missing value imputation.To quantitatively assess the consistency of the filled data sequence with the original data sequence in terms of the distribution and characteristics, KL Divergence, JS Divergence and Hellinger Distance are introduced to quantify the similarity between two distributions, as shown in Equations ( 16)-( 18): KL Divergence, JS Divergence and Hellinger Distance are all non-negative.KL Divergence ranges from 0 to ∞, while JS Divergence and Hellinger Distance range from 0 to 1. Smaller values of these three metrics indicate a greater similarity between two distributions, with a value of 0 indicating complete similarity.

Data Collection
In order to validate the effectiveness of the proposed method, this study conducted an analysis using the actual operational monitoring data of a mixed-flow hydropower unit in a fengtan hydropower station in the Central China region.The data anomaly detection and data imputation methods were tested separately.The unit parameters are presented in Table 1.
The unit is equipped with an online monitoring system, which utilizes monitoring equipment to continuously collect, monitor and automatically record various state parameters such as the vibration, deflection, pressure pulsation, air gap, stator temperature, oil level, active power, water level and rotational speed during the operation of the unit.This enables real-time access to the current operating status of the unit.The time series grouping of online monitoring data for this unit is illustrated in Figure 4.
To demonstrate the process of the proposed method and validate its effectiveness, this study has chosen the swing monitoring parameters as an example.The same procedure can be applied to other parameters.Specifically, we obtained the upper guide swing deflection data (S), corresponding head (H) and active power (P) of the unit from 3 April 2020 to 4 August 2021 from the unit's condition online monitoring system of the Fengtan hydropower station.As illustrated in Figure 5, due to the characteristic of having a high sampling frequency but low data storage frequency in the online monitoring system of hydroelectric units, the actual time interval of the stored measured data is 15 min, amounting to a total of 26,245 samples.The overall missing rate δ of the original dataset is calculated to be 0.438.The missing data within the dataset are represented as NAN.The unit is equipped with an online monitoring system, which utilizes monitoring equipment to continuously collect, monitor and automatically record various state parameters such as the vibration, deflection, pressure pulsation, air gap, stator temperature, oil level, active power, water level and rotational speed during the operation of the unit.This enables real-time access to the current operating status of the unit.The time series grouping of online monitoring data for this unit is illustrated in Figure 4. To demonstrate the process of the proposed method and validate its effectiveness, this study has chosen the swing monitoring parameters as an example.The same procedure can be applied to other parameters.Specifically, we obtained the upper guide swing deflection data (S), corresponding head (H) and active power (P) of the unit from 3 April 2020 to 4 August 2021 from the unit's condition online monitoring system of the Fengtan hydropower station.As illustrated in Figure 5, due to the characteristic of having a high sampling frequency but low data storage frequency in the online monitoring system of hydroelectric units, the actual time interval of the stored measured data is 15 min, amounting to a total of 26,245 samples.The overall missing rate δ of the original dataset is calculated to be 0.438.The missing data within the dataset are represented as NAN.From Figure 5, it can be observed that the amplitude variation pattern of the upper guide swing deflection (S) is complex.This complexity is attributed to its coupling with factors such as changes in operational parameters and noise.The water head (H) exhibits long-term low-frequency fluctuation characteristics, primarily influenced by seasonal changes in the external environment.The active power output of the unit (P) displays short-term high-frequency fluctuation characteristics.This is mainly due to the fact that the unit's output is dynamically and flexibly adjusted in real time based on the load demand from the grid side.During this adjustment process, the vibration characteristics of the unit are taken into consideration, aiming to avoid operating conditions with intense vibrations.Therefore, when conducting anomaly detection for unit parameters, it is imperative to consider the coupled analysis of the actual operational characteristics of the unit.In this study, the influence of operational parameters H and P on the operational state parameter S is simultaneously considered.By combining these operational parameters, the measured data of the hydroelectric unit are structured into a three-dimensional dataset Ω = (H, P, S), as illustrated in Figure 6.From Figure 5, it can be observed that the amplitude variation pattern of the upper guide swing deflection (S) is complex.This complexity is attributed to its coupling with factors such as changes in operational parameters and noise.The water head (H) exhibits long-term low-frequency fluctuation characteristics, primarily influenced by seasonal changes in the external environment.The active power output of the unit (P) displays short-term high-frequency fluctuation characteristics.This is mainly due to the fact that the unit's output is dynamically and flexibly adjusted in real time based on the load demand from the grid side.During this adjustment process, the vibration characteristics of the unit are taken into consideration, aiming to avoid operating conditions with intense vibrations.Therefore, when conducting anomaly detection for unit parameters, it is imperative to consider the coupled analysis of the actual operational characteristics of the unit.In this study, the influence of operational parameters H and P on the operational state parameter S is simultaneously considered.By combining these operational parameters, the measured data of the hydroelectric unit are structured into a three-dimensional dataset Ω 0 = (H, P, S), as illustrated in Figure 6.

Anomaly Detection
The original data samples were fed into the HDBSCAN model, and the computed results are depicted in Figures 7a and 8a.The HDBSCAN method adaptively identified effective data clusters and marked outliers, including singular points and anomalous points, as noise points.By removing these noise points from the original data samples, a denoised and valid dataset Ω = (H , P , S ) was obtained.This denoised dataset will be further input into the WSGAIN-GP model for missing data imputation.
To validate the effectiveness of the HDBSCAN anomaly detection method, this study compared the results with those of other methods such as DBSCAN, OPTICS, LOF, HAC and K-Means.Based on the methodology principles and references [24][25][26][27][28][29][30], some of the initial parameters for these comparative methods are shown in Table 2.The performance of different methods is evaluated in terms of the silhouette coefficient index, and the parameters are optimally tuned by a grid search method.

Anomaly Detection
The original data samples were fed into the HDBSCAN model, and the computed results are depicted in Figures 7a and 8a.The HDBSCAN method adaptively identified effective data clusters and marked outliers, including singular points and anomalous points, as noise points.By removing these noise points from the original data samples, a denoised and valid dataset Ω 1 = (H 1 , P 1 , S 1 ) was obtained.This denoised dataset will be further input into the WSGAIN-GP model for missing data imputation.
To validate the effectiveness of the HDBSCAN anomaly detection method, this study compared the results with those of other methods such as DBSCAN, OPTICS, LOF, HAC and K-Means.Based on the methodology principles and references [24][25][26][27][28][29][30], some of the initial parameters for these comparative methods are shown in Table 2.The performance of different methods is evaluated in terms of the silhouette coefficient index, and the parameters are optimally tuned by a grid search method.Different detection methods are used to detect anomalies for the 3D dataset Ω X = (H, P, S X ), consisting of a head, active and upward-guided X-axis swing, and Ω Y = (H, P, S Y ), consisting of a head, active and upward-guided Y-axis swing, respec- tively, and the computation of the Profile Coefficient Indicator (SCI) standardized example is shown in Table 3.  Different detection methods are used to detect anomalies for the 3D dataset Ω = (H, P, S ) , consisting of a head, active and upward-guided X-axis swing, and Ω = (H, P, S ), consisting of a head, active and upward-guided Y-axis swing, respectively, and  Combined with the clustering effect graph and the contour coefficient comparison data table, the HDBSCAN method has the highest contour coefficient in the comparison of the two dataset calculations, and at the same time, it can detect and recognize the effective data and abnormal data better, and the clusters divided into clusters are more in line with the actual operating conditions.

Data Filling
The missing rate δ 1 = 0.492 of the noise reduced dataset Ω 1 detected by the HDBSCAN method and the noise reduced dataset are further input into the trained WSGAIN-GP model for missing value filling.
The literature [56] shows that when the missing rate is greater than 50%, the accuracy of the WSGAIN-GP method of filling is significantly greater than that of other methods such as KNN and spline interpolation.According to the literature and several experiments, the network parameters of the WSGAIN-GP model involving the generator and the adversary are shown in Table 4.The initialization assignments of the main parameters of the WSGAIN-GP model are shown in Table 5.
Based on the model training, incomplete data sequences were input for data filling, and the results of the WSGAIN-GP-filled Upper Guide swing, head and active are shown in Figure 9.
The relative frequency distribution of the measured data after filling the enhancement is shown in Figure 10.
Combined with the filling results in Figures 9 and 10, it can be seen that the distribution of the data after filling by the WSGAIN-GP model is closer to the distribution of the measured values, which indicates that the filling method is able to better safeguard the data characteristics and distribution, and the quantitative indexes of the differences in the data distribution are shown in Table 6.The KL dispersion, JS dispersion and Hellinger distance of the data distributions before and after the filling of the upper guide swing, head and active power are close to 0, which indicates that the WSGAIN-GP filling model is able to learn the distribution and characteristics of the real data better and guarantees the accuracy of the generated data.Based on the model training, incomplete data sequences were input for data filling, and the results of the WSGAIN-GP-filled Upper Guide swing, head and active are shown in Figure 9.The relative frequency distribution of the measured data after filling the enhancement is shown in Figure 10.Combined with the filling results in Figures 9 and 10, it can be seen that the distribution of the data after filling by the WSGAIN-GP model is closer to the distribution of the measured values, which indicates that the filling method is able to better safeguard the data characteristics and distribution, and the quantitative indexes of the differences in the data distribution are shown in Table 6.The KL dispersion, JS dispersion and Hellinger distance of the data distributions before and after the filling of the upper guide swing, head and active power are close to 0, which indicates that the WSGAIN-GP filling model is able to learn the distribution and characteristics of the real data better and guarantees the accuracy of the generated data.In order to further validate the superiority of the proposed data filling method and at the same time study the performance of different data filling methods with different missing rates, the short-term complete state sequence dataset of hydropower units in the case study is selected for comparative analysis.For the comparison experimental dataset,  In order to further validate the superiority of the proposed data filling method and at the same time study the performance of different data filling methods with different missing rates, the short-term complete state sequence dataset of hydropower units in the case study is selected for comparative analysis.For the comparison experimental dataset, the sampling interval is 30 min, and the sampling start and end time is from 7:00 p.m. on 12 June 2020 to 7:00 p.m. on 25 July 2020, with a total length of 43 days and a total of 2064 measured complete data samples, as shown in Figure 11.Non-complete sequences with different missing rates were generated by random missing, the missing rates were taken as 10%, 30%, 50% and 70%, respectively, considering the engineering reality, and the comparison methods were chosen as SGAIN, GAIN and KNN.
the sampling interval is 30 min, and the sampling start and end time is from 7:00 p.m. on 12 June 2020 to 7:00 p.m. on 25 July 2020, with a total length of 43 days and a total of 2064 measured complete data samples, as shown in Figure 11.Non-complete sequences with different missing rates were generated by random missing, the missing rates were taken as 10%, 30%, 50% and 70%, respectively, considering the engineering reality, and the comparison methods were chosen as SGAIN, GAIN and KNN.The Root Mean Square Error (RMSE) was used to measure the accuracy of the data filling results, as shown in Equation (19).
where  denotes the number of samples,  represents the filled value output by the model and  denotes the measured true value.
The filling results of the different methods for different missing rates for the upper guide X-axis swing, head and active power of the hydroelectric units are shown in Figures 12-15, and the horizontal X-axis serial numbers represent the serial numbers of the original samples in chronological order.
In order to eliminate the effect of non-complete sequence differences generated randomly according to the missing rate, set the number of repetitive trials to 1000 times, each time, respectively, according to a different missing rate-generated non-complete sequences, using different methods to fill, observe and record the error of each experiment and, ultimately, the average error of the 1000 experiments as the final error.The error results are shown in Table 7.The Root Mean Square Error (RMSE) was used to measure the accuracy of the data filling results, as shown in Equation (19).
where N denotes the number of samples, ŷi represents the filled value output by the model and y i denotes the measured true value.
The filling results of the different methods for different missing rates for the upper guide X-axis swing, head and active power of the hydroelectric units are shown in Figures 12-15, and the horizontal X-axis serial numbers represent the serial numbers of the original samples in chronological order.The statistical plots of the filling errors of each method for the upper guide X-axis swing, upper guide Y-axis swing, head and active power of the hydropower units are shown in Figure 14.
Based on the comparison with Figures 11-14 and Table 7, it is evident that the WSGAIN-GP method for missing data imputation consistently yields the lowest error across various data missing rates, with an average reduction in the root mean square error of 0.0936, thereby demonstrating the highest accuracy.This method shows commendable effectiveness in filling gaps for hydroelectric unit state parameters with a high randomness, such as the pendulum swing, as well as operational parameters like the active power and head.The WSGAIN-GP model employs a generative adversarial network structure, wherein the interplay between the generator and discriminator approximates the distribution of real data.By generating data, the model is able to

Conclusions
In addressing issues such as anomalies and missing data that compromise the quality of condition monitoring datasets in hydropower units, this chapter introduces a method for enhancing data quality through the integration of HDBSCAN-WSGAIN-GP.This method capitalizes on the strengths of density clustering and generative adversarial networks to enhance the reliability and utility of the condition monitoring data.In order to eliminate the effect of non-complete sequence differences generated randomly according to the missing rate, set the number of repetitive trials to 1000 times, each time, respectively, according to a different missing rate-generated non-complete sequences, using different methods to fill, observe and record the error of each experiment and, ultimately, the average error of the 1000 experiments as the final error.The error results are shown in Table 7.The statistical plots of the filling errors of each method for the upper guide X-axis swing, upper guide Y-axis swing, head and active power of the hydropower units are shown in Figure 14.
Based on the comparison with Figures 11-14 and Table 7, it is evident that the WSGAIN-GP method for missing data imputation consistently yields the lowest error across various data missing rates, with an average reduction in the root mean square error of 0.0936, thereby demonstrating the highest accuracy.This method shows commendable effectiveness in filling gaps for hydroelectric unit state parameters with a high randomness, such as the pendulum swing, as well as operational parameters like the active power and head.The WSGAIN-GP model employs a generative adversarial network structure, wherein the interplay between the generator and discriminator approximates the distribution of real data.By generating data, the model is able to complete the missing parts of the monitoring data, resulting in a more complete and continuous dataset.

Conclusions
In addressing issues such as anomalies and missing data that compromise the quality of condition monitoring datasets in hydropower units, this chapter introduces a method for enhancing data quality through the integration of HDBSCAN-WSGAIN-GP.This method capitalizes on the strengths of density clustering and generative adversarial networks to enhance the reliability and utility of the condition monitoring data.
Initially, the HDBSCAN clustering method categorizes the monitoring data based on density levels, aligned with operational conditions, to adaptively detect and cleanse anomalies in the dataset.Furthermore, the WSGAIN-GP model, through its data imputation capabilities, employs unsupervised learning to understand and replicate the features and distribution patterns of actual monitoring data, thereby generating values for missing data.
The validation analysis, conducted using an online monitoring dataset from real operational units, provides compelling evidence of the method's effectiveness: (1) Comparative experiments reveal that the clustering contour coefficient (SCI) of the anomaly detection model based on HDBSCAN achieves 0.4935, surpassing those of other comparative models, thereby demonstrating its superior ability in distinguishing between valid and anomalous samples.
(2) The probability density distribution of the data imputation model based on WSGAIN-GP closely mirrors that of the measured data.Notably, the Kullback-Leibler (KL) divergence, Jensen-Shannon (JS) divergence and Hellinger's distance metrics, when comparing the distribution between the imputed and original data, approach values near zero, indicating a high degree of accuracy in data representation.
(3) Through comparative analyses with other filling methods, including SGAIN, GAIN and KNN, the WSGAIN-GP model demonstrates superior effectiveness in data imputation across various rates of missing data.The Root Mean Square Error (RMSE) of the WSGAIN-GP consistently outperforms other models, particularly noted in its lowest RMSE across different missing data rates.This confirms the high accuracy and generalization capability of the proposed imputation model.
The findings and methodologies presented in this study lay a robust foundation for high-quality data, crucial for subsequent trend prediction and state warnings in the context of hydropower unit monitoring.

Figure 2 .
Figure 2. Structure of the WSGAIN-GP network.Figure 2. Structure of the WSGAIN-GP network.

Figure 2 .
Figure 2. Structure of the WSGAIN-GP network.Figure 2. Structure of the WSGAIN-GP network.
and missing values in {X i } d 1 are replaced by random noise values.N = (N 1 , • • • , N d ) represents an output of a function that samples random values from a continuous uniform distribution, commonly configured to use the interval [−0.01, +0.01], as shown in Equations (

Figure 3 .
Figure 3. Quality Enhancement Process of Hydropower Unit Monitoring Data Based on HDBSCAN-WSGAIN-GP.

Figure 4 .
Figure 4. Time series of online monitoring data for hydropower units.

Figure 4 .
Figure 4. Time series of online monitoring data for hydropower units.

Figure 6 .
Figure 6.Three-dimensional dataset of upper guide swing and operating condition parameters from Fengtan Hydropower Station Unit 2.

Figure 6 .
Figure 6.Three-dimensional dataset of upper guide swing and operating condition parameters from Fengtan Hydropower Station Unit 2.

Figure 7 .
Figure 7. Clustering results of different anomaly detection methods for the upper guide X-axis swing.

Figure 7 .Figure 8 .
Figure 7. Clustering results of different anomaly detection methods for the upper guide X-axis swing.

Figure 8 .
Figure 8. Clustering results of different anomaly detection methods for the upper guide Y-axis swing.

Figure 10 .
Figure 10.Relative frequency distributions after filling in the enhancement with measured data.

Figure 10 .
Figure 10.Relative frequency distributions after filling in the enhancement with measured data.

Figure 11 .
Figure 11.A complete short-term state sequence dataset for hydropower units.

Figure 11 .
Figure 11.A complete short-term state sequence dataset for hydropower units.

Figure 14 .Figure 15 .
Figure 14.Filling results of each method for active power at different missing rates.

Table 2 .
Initialization parameters for different detection methods.

Table 2 .
Initialization parameters for different detection methods.

Table 3 .
Comparison table of the results of different testing methods.

Table 6 .
Difference in distribution between post-fill and measured data.

Table 6 .
Difference in distribution between post-fill and measured data.

Table 7 .
Evaluation of the effectiveness of different methods for filling errors in hydropower unit monitoring data.

Table 7 .
Evaluation of the effectiveness of different methods for filling errors in hydropower unit monitoring data.