4.1. Sample Construction
This section uses a four-machine two-area system integrated with wind generation, interfacing with the CloudPSS platform through Python 3.9 scripts for automated model modification and batch time-domain simulations. The goal is to ensure comprehensive and uniform coverage of the sample space, enhancing the generality and representativeness of the simulation results. Key factors influencing the assessment of renewable energy oscillation hosting capacity are considered, including disturbance location, frequency, amplitude, and overall system load level.
Periodic sinusoidal disturbances are applied to different generators in the system. These disturbances act either on the mechanical torque of the prime mover or on the terminal voltage of the generator, exciting wide-band oscillatory responses under various system configurations. The system load level varies between 90% and 110% of the nominal load, in increments of 5%. The disturbance amplitude ranges from 0.1 p.u. to 0.5 p.u., with steps of 0.05 p.u. The disturbance frequency spans two ranges, 10–40 Hz and 70–100 Hz, with a resolution of 0.2 Hz. These variations ensure a broad exploration of operating conditions and disturbance impacts, providing a robust sample space for oscillation hosting capacity analysis.
Python-based automation scripts invoke the CloudPSS time-domain simulation engine to generate the required samples. For each simulation case, measurements are collected from the generator terminal buses, including active power, reactive power, voltage magnitude, and current. These measurements are then analyzed to determine the system’s stability under the applied disturbances.
Time-domain simulations use a fixed time step of 1 ms, corresponding to a sampling frequency of 1000 Hz, ensuring strict adherence to the Nyquist–Shannon sampling theorem. Given that the highest frequency component of the applied disturbances is 100 Hz, the minimum required sampling frequency would be 200 Hz. The chosen sampling rate of 1000 Hz, with a Nyquist frequency of 500 Hz, is well above the minimum, fully eliminating any risk of aliasing and ensuring accurate capture of the frequency content in the 10–100 Hz range. For each simulation case, 500 consecutive data points are recorded, corresponding to a 0.5-s data window. This window length provides sufficient time resolution for capturing oscillations. For a 10 Hz oscillation, the window captures approximately 5 full cycles, and for a 100 Hz oscillation, approximately 50 full cycles, ensuring reliable modal identification and feature extraction.
The disturbance frequency bands of 10–40 Hz and 70–100 Hz are selected based on the dynamics of high-penetration renewable energy systems. The 10–40 Hz range typically encompasses sub-synchronous and super-synchronous oscillations, often associated with shaft torsional interactions, converter-grid impedance interactions, and control-induced resonances. The 70–100 Hz range targets high-frequency oscillations potentially excited by power electronic switching characteristics and associated control dynamics. For the studied four-machine two-area system with a direct-drive wind turbine, oscillation modes and interactions relevant to oscillation hosting capacity are expected to occur within these frequency ranges or be influenced by them, making the selected frequency bands both physically meaningful and relevant to the dynamics under study.
Sample labeling is performed based on the system’s operating conditions and the disturbance configuration. Specifically, the damping coefficient (
) is extracted from the active power response at the generator terminal using the Prony method. The damping coefficient, an important parameter reflecting the oscillation characteristics and trends of the system, is calculated as follows:
where
is the modal amplitude,
f is the oscillation frequency,
is the initial phase,
k is the number of sampling points,
m is the data window length,
is the sampling period, and
is the current sampling time.
The damping coefficient,
, is then computed by comparing the modal amplitudes extracted from two adjacent time windows (data segments) of the generator terminal active-power response.
Based on the calculated damping coefficient, the labeling rule is as follows: if (indicating positive damping), the system is stable and can “host” the disturbance, and the sample is labeled “1” (stable). If (indicating non-positive damping), the system is unstable, and the sample is labeled “0” (unstable). This process completes the construction of the dataset, which is then used for analysis and modeling.
For example, if the damping coefficient is found to be , the oscillation shows a decaying trend, indicating that the system is stable under the given operating conditions and disturbance, and is labeled “1” (stable). Conversely, if the damping coefficient is , the system exhibits instability, and the sample is labeled “0” (unstable).
4.2. Electrical Feature Dimensionality Reduction Based on Variational Autoencoder
To accelerate the convergence of neural network training and reduce the overall computational cost, data normalization is an effective preprocessing technique. In this work, a max–min normalization approach is adopted to scale the collected electrical measurements into the interval
. The normalization process is expressed as:
where
denotes the value of the
x-th sample at time
t,
is the corresponding normalized value, and
and
represent the maximum and minimum values of the
x-th sample, respectively.
The variational autoencoder (VAE) is then applied to the wide-band oscillation waveform dataset to perform dimensionality reduction. The network is trained using the Adam optimizer, where the learning rate governs the magnitude of parameter updates. A batch size of 32 is adopted, indicating that 32 samples are fed into the network during each training iteration. The total number of training epochs is set to 500, ensuring that the entire dataset is repeatedly traversed to achieve stable convergence.
To effectively extract compact yet informative features from the high-dimensional oscillation waveforms, a two-stage feature extraction and fusion process is designed.
Stage 1: Deep Feature Encoding. For each electrical quantity (active power
P, reactive power
Q, voltage magnitude
U, and current
I), its 500-point time series (spanning 0.5 s) is independently processed by a dedicated VAE encoder. The encoder maps the high-dimensional sequence into a 25-dimensional latent vector
, which compactly captures the essential oscillatory patterns and dynamic characteristics of the original waveform. This intermediate latent representation ensures the representativeness of the extracted features.
Figure 3 visually illustrates such a 25-dimensional latent vector generated for an exemplary electrical quantity.
Stage 2: Feature Fusion and Aggregation. To transform the information-rich 25-dimensional latent representation into a fixed-dimensional scalar feature suitable for subsequent machine learning models (e.g., XGBoost) while preserving its core information, a feature fusion step is introduced. Specifically, the first principal component (FPC) is extracted from each 25-dimensional latent vector , serving as the final scalar feature f for that electrical quantity. The first principal component represents the direction of maximum variance in the latent space, thereby summarizing the most significant variation pattern with a single value.
Mathematically, PCA is performed on mean-centered latent vectors. Let
denote the mean of
over the training fold, and define
. The scalar feature is then computed as
where
is the eigenvector corresponding to the largest eigenvalue of the covariance matrix of
estimated on the training fold. The same
and
are then applied to the corresponding test fold to avoid information leakage.
Through this two-stage process, each electrical quantity’s 500-point sequence is transformed into a single, physically meaningful scalar feature (, , , ). These four scalars are then concatenated to form a 4-dimensional feature vector for each oscillation sample. Consequently, for a dataset with M samples, the final feature matrix input to XGBoost has a shape of . This design enables a direct and interpretable assessment of the importance of the four original electrical quantities.
The VAE is implemented with a symmetric encoder–decoder structure. All data normalization parameters, VAE models, and PCA-based aggregation operators are fitted exclusively using the training folds in each cross-validation iteration, and then applied to the corresponding test fold. The encoder for each electrical quantity consists of fully connected (Dense) layers that progressively reduce the dimensionality from 500 to 25. The decoder mirrors this structure, reconstructing the original 500-point sequence from the 25-dimensional latent vector. After obtaining the latent vector, an additional aggregation layer (implemented via principal component analysis) is applied to produce the final scalar feature. The detailed layer-by-layer architecture for processing a single electrical quantity sequence is summarized in
Table 1.
As illustrated in
Figure 4, the reconstructed waveform closely matches the original waveform, demonstrating that the encoder effectively captures essential oscillatory characteristics and that the decoder successfully restores waveform details. This confirms that the proposed method can reliably compress and reconstruct wide-band oscillation signals while extracting compact and meaningful latent features.
4.3. Key Feature Extraction Based on XGBoost
The feature matrix
obtained from the process described in
Section 4.2—where each row is a sample’s directly interpretable feature vector
—serves as the input to the XGBoost model. This setup allows for the evaluation of the relative contribution of each of the four original electrical quantity features to the classification of oscillation hosting capacity.
It should be noted that no independent hold-out test set is used in this study. All reported test results correspond to the test folds in the five-fold cross-validation procedure. K-fold cross-validation, originally proposed by Seymour Geisser, is a widely used method for model evaluation and selection. As illustrated in
Figure 5, this study adopts five-fold cross-validation (i.e.,
). Compared with choosing
or
, selecting
achieves a desirable balance between bias and variance, thereby improving the accuracy and reliability of model assessment while keeping computational complexity at a reasonable level. The procedure of K-fold cross-validation can be summarized as follows:
- 1.
The dataset is randomly partitioned into K subsets of equal size.
- 2.
Each subset is used once as the test fold, while the remaining subsets are combined to form the training folds.
- 3.
Each iteration yields a model and its corresponding prediction error. The cross-validation score is then obtained by averaging the K prediction errors.
4.3.1. Selection of the Number of Trees for Model Evaluation
To evaluate the performance of the XGBoost model, the Random Forest (RF) algorithm is used as a comparative baseline. Random Forest is also a widely adopted and powerful machine learning method within the family of ensemble learning algorithms. It is composed of multiple decision trees, and aggregates their predictions to enhance overall accuracy and generalization capability.
Five-fold cross-validation is employed to assess the model performance. The cross-validation scores of XGBoost and Random Forest under different numbers of trees (
n_estimators) are shown in
Figure 6. As illustrated in the figure, the five-fold cross-validation score of XGBoost consistently exceeds that of Random Forest. Notably, even when
n_estimators = 1, XGBoost already achieves a relatively high score of approximately 0.8. The highest cross-validation score for XGBoost occurs at
n_estimators = 18, reaching approximately 0.9826. Therefore, the number of trees in this study is set to 18.
4.3.2. Learning Curve Analysis
To further evaluate the performance of XGBoost, the learning_curve function is employed to generate the learning curves. This function directly outputs the number of training samples, the training scores, and the validation scores, illustrating how the model’s performance on both the training set and the validation set changes with increasing training data size. The learning curve provides an effective means to assess the model’s generalization ability and to determine whether it suffers from overfitting or underfitting.
The learning curves of XGBoost and Random Forest are shown in
Figure 7 and
Figure 8, respectively. The training score represents the performance obtained on each training subset, while the validation score corresponds to the five-fold cross-validation score for the same subset size. As shown in the figures, both the training and validation scores of XGBoost and Random Forest improve as the number of training samples increases. Moreover, the gap between the training and validation scores gradually narrows, indicating that increasing the training data significantly enhances the generalization capability of both models and that neither model exhibits overfitting nor underfitting.
However, for any given training set size, the training and validation scores of XGBoost remain consistently higher than those of Random Forest, demonstrating that XGBoost provides superior predictive performance in this task.
4.3.3. Feature Importance Ranking
The larger the importance score of an electrical feature, the greater its contribution to the model’s predictive decision making, indicating a stronger correlation with the target variable. As shown in
Figure 9, active power exhibits the highest importance score (0.422), reflecting its crucial role in capturing the system’s real work. While active power is highly informative, it should not be regarded as a complete substitute for multi-signal measurements. Other features such as voltage, current, and reactive power together provide a more comprehensive understanding of the system’s behavior, making them preferred when available.
In contrast, line current shows the lowest importance score (0.204), possibly because variations in current exert only limited direct influence on the overall system stability; current fluctuations are often mitigated by the system’s internal control and regulation mechanisms. The importance scores of bus voltage and reactive power fall between these two extremes.
From a theoretical perspective, both reactive and active power are products of voltage and current, meaning that their coupling inherently contains complete voltage–current information. However, reactive power receives a lower importance score than active power. This may be attributed to its primary role in voltage regulation rather than directly reflecting oscillatory behavior. These observations suggest that, although both quantities arise from voltage–current coupling, the strength and nature of their coupling coefficients lead to different degrees of relevance to oscillation-source locations, and thus different importance scores.
Within each cross-validation iteration, feature importance estimation and threshold-based feature selection are performed using the training folds only. Since the model parameters have already been learned, the threshold serves as the criterion for determining which features are considered important—only those whose importance scores exceed the threshold are retained.
As shown in
Table 2, when the threshold is set to 0.162, all features with importance scores greater than 0.162 are selected, resulting in four retained features: voltage
U, current
I, active power
P, and reactive power
Q. When the threshold is increased to 0.422, only active power
P remains above the threshold and is therefore considered as a single-feature baseline for evaluating model performance.
The results in
Table 2 reveal that using three features
yields even higher accuracy than using all four features
. This suggests that reducing the feature set not only decreases model complexity and accelerates training, but also improves prediction accuracy by removing less informative features. However, it is emphasized that, when multi-signal measurements are available, the combination of
should be prioritized, as it provides a more robust and accurate model. In contrast, when resources are limited, active power
P can serve as a baseline, but it is explicitly acknowledged that performance will degrade when used alone. Moreover, when only feature
P is selected, the model achieves an accuracy of 96.54%, which is approximately 1% lower than when all features are used. This suggests that using
P alone incurs a measurable performance loss compared with the best-performing multi-feature input
, although it still provides a reasonably competitive baseline.
Therefore, is preferred when multi-signal measurements are available, whereas P alone is reported only as a lightweight baseline (or a fallback option under strict measurement/communication constraints), with the associated performance loss explicitly acknowledged.
Overall, a leakage-free evaluation pipeline is adopted. For each fold of the five-fold cross-validation, the dataset is split into training and test folds. All preprocessing steps, including normalization, VAE training, latent feature aggregation, feature importance estimation, and threshold-based feature selection, are performed exclusively on the training folds. The trained models are then evaluated on the corresponding test fold. Final performance metrics are obtained by averaging the results across all folds.
4.4. Ablation Study and Analysis
This section conducts a two-level ablation study to validate (i) the effectiveness and interpretability of the proposed VAE–PCA feature engineering scheme and (ii) the independent contribution of each module in the overall pipeline. All experiments are performed on the same simulation dataset using an identical training strategy and a nested -fold cross-validation protocol to ensure a fair comparison.
We first compare different feature construction strategies under a controlled setting where each method produces a four-dimensional feature vector, and each dimension explicitly corresponds to one electrical quantity in . Specifically, three feature sets are considered:
Baseline A (Mean Features): The arithmetic mean of each 500-point time series is used as the feature value, yielding .
Baseline B (Damping-Coefficient Features): The Prony method is applied to the oscillatory response of each quantity to extract the damping coefficient of the dominant mode, yielding .
Proposed (VAE-PCA Features): Following the two-stage procedure in
Section 4.2, we extract
, where each
f is an aggregated latent descriptor that preserves a direct correspondence to the original quantity.
Table 3 reports the average performance. The results show that VAE–PCA features consistently outperform both intuitive statistics and physics-inspired damping descriptors, indicating that the proposed representation captures more discriminative dynamics while maintaining per-quantity interpretability.
To quantify the marginal benefit of each component—participation-factor (PF) screening, VAE-based representation learning, PCA-based aggregation, and the XGBoost-based ranking/classification stage—we construct several pipeline variants by progressively removing or replacing key modules, while keeping the dataset, training protocol, and evaluation procedure unchanged.
We consider four representative configurations: (A) An end-to-end 1D-CNN that directly ingests the raw wide-band oscillation waveforms of size
, serving as a deep-learning baseline without physics priors or explicit feature engineering; (B)PF + XGBoost, which bypasses the VAE stage and feeds PF-screened signals into XGBoost to isolate the contribution of representation learning; (C) PF + VAE + SVM, which replaces XGBoost with SVM to assess the discriminability of the learned latent representations without XGBoost’s nonlinear ranking capability; and (D) the full proposed pipeline PF + VAE–PCA + XGBoost. The results are summarized in
Table 4.
The results demonstrate that each module contributes measurably to the final performance. In particular, removing the VAE stage leads to a clear degradation, while replacing XGBoost with SVM also reduces performance, confirming the necessity of combining physics-guided screening with learned representations and robust nonlinear ranking/classification.
4.5. Generalization Study on the IEEE 39-Bus System
To further evaluate the generalizability of the proposed framework beyond the four-machine two-area benchmark, an additional case study is conducted on the IEEE 39-bus system. This system is widely adopted as a representative large-scale interconnected power network and exhibits more complex topology, diverse generator dynamics, and richer modal interactions, thereby providing a suitable testbed for assessing scalability to large and heterogeneous power systems.
In this study, the IEEE 39-bus system is configured to emulate a renewable-rich operating scenario by replacing a subset of conventional synchronous generators with renewable energy units interfaced via power electronic converters. Importantly, the sample construction procedure, disturbance design, sampling strategy, and labeling rule are kept exactly the same as those described in
Section 4.1 to ensure a fair and consistent evaluation across different systems. Specifically, periodic sinusoidal disturbances are injected at selected generator buses, acting either on the mechanical torque or the terminal voltage. The system load level varies from 90% to 110% of the nominal value (step: 5%), the disturbance amplitude ranges from 0.1 p.u. to 0.5 p.u. (step: 0.05 p.u.), and the disturbance frequency spans two bands (10–40 Hz and 70–100 Hz) with a resolution of 0.2 Hz.
Time-domain simulations are performed with a fixed time step of 1 ms, corresponding to a sampling frequency of 1000 Hz. For each disturbance scenario, 500 consecutive samples (0.5 s window) of terminal active power P, reactive power Q, voltage magnitude U, and current I are recorded. Sample labels are generated using the same Prony-based damping criterion as in the benchmark system: samples with positive damping are labeled as stable (label ), while those with non-positive damping are labeled as unstable (label ).
The proposed two-stage feature extraction pipeline (VAE encoding followed by PCA-based aggregation) and the XGBoost classifier are applied without any modification to the network architecture, hyperparameters, or training strategy. A leakage-free five-fold cross-validation protocol is adopted. In each fold, all preprocessing steps—including normalization, VAE training, PCA-based aggregation, feature-importance estimation, and threshold-based feature selection—are fitted exclusively on the training folds and then applied to the corresponding test fold.
Table 5 summarizes the classification performance on the IEEE 39-bus system, along with the benchmark results for comparison. The results show that the proposed method maintains high performance on a larger and more heterogeneous system. Moreover, the feature-importance ranking exhibits the same trend as that observed in the four-machine two-area system: active power remains the most influential feature, followed by voltage magnitude and reactive power, while current is relatively less informative. Although a slight performance degradation is observed due to the increased system complexity and richer modal interactions, the decrease is marginal, supporting the scalability and robustness of the proposed framework.