1. Introduction
Distribution system state estimation (DSSE) is a key component of modern power-system operation. It is critical to ensuring stable and efficient grid management [
1]. The existing state estimation (SE) techniques are widely adopted in transmission systems, but they face notable challenges when they are applied to distribution networks [
2]. First, the core challenge of the DSSE task lies in the dynamic changes in the network topology. In practical scenarios, the distribution network system will undergo dynamic changes in its topological structure due to load switching, line maintenance, and other events. Traditional methods were originally designed for simple transmission networks, which will render the predicted grid states invalid when the network structure changes [
1]. Second, unlike transmission systems, which rely on a large number of redundant measurements, distribution systems often require higher accuracy and real-time performance. In distribution systems, the strong nonlinear relationships between voltage drop and power loss make it difficult use the direct application of traditional algorithms for measurement. Moreover, distribution systems have insufficient data, so traditional methods will produce a large number of estimation errors under complex operating conditions [
3]. Furthermore, the problem of persistent observability in distribution systems remains unsolvable. Due to cost constraints, distribution networks often deploy measurement equipment only at key nodes. Therefore, a large number of nodes need to obtain pseudomeasurements from historical data or predictive models, which leads to the occurrence of further uncertainties [
4].
Consider the weighted least squares (WLS) method, the prevailing SE technique in transmission systems. Under ideal conditions—where measurement noise is purely white and Gaussian—WLS acts as an unbiased minimum-variance estimator, delivering optimal performance [
5]. However, in distribution systems, measurement noise amplifies with network uncertainty, directly compromising the observability of active distribution networks [
6,
7]. Moreover, WLS suffers from two critical drawbacks for DSSE: its iterative computation is computationally intensive, and its performance degrades drastically under noisy measurements [
8]. To mitigate these issues, researchers have proposed robust variants such as the maximum normalized residual test [
9], weighted least absolute values [
10], least median of squares [
11], and least trimmed squares [
12]. Despite these advancements, these models remain constrained by parametric dependencies and unresolved convergence/sensitivity tradeoffs, limiting their practical applicability to distribution systems with sparse measurements.
The Kalman filter (KF), leveraging temporal correlation through state-space modeling, offers an alternative by incorporating prior state estimates as auxiliary information to improve accuracy and convergence speed under low-observability scenarios [
13,
14]. However, its reliance on linear system assumptions clashes with the highly non-linear nature of real-world power systems, where power-flow equations exhibit strong non-linearities that degrade robustness and estimation accuracy in practical DSSE applications [
15].
To address the problem of system observability, transmission grids rely on dense sensor deployments for real-time state measurements, a strategy that directly fulfills observability requirements [
16]. However, this approach is infeasible for DSSE, where the limited number of installed measuring instruments leads to insufficient data redundancy. Specifically, the scarcity of measurements creates an underdetermined problem, making it challenging to iteratively derive accurate state estimates from sparse data—a fundamental observability limitation in DSSE [
17].
To mitigate this issue, pseudo-measurements are employed. They are generated through observability-driven meter placement techniques: first, the unobservability index (UI) is calculated using information entropy to identify critical measurement locations; then, instruments are installed based on UI rankings. Historical data are then leveraged to derive predictive values for unmeasured states, a strategy that significantly reduces hardware costs compared to full sensor retrofitting. Despite these advantages, the generalizability of this method remains unvalidated for two critical scenarios: unbalanced multi-phase distribution systems and multi-time instance state estimation, where temporal dynamics and phase asymmetry introduce additional complexities.
Owing to the limitations of traditional methods, with the continuous advancement of artificial intelligence, a growing body of research has focused on applying deep learning to DSSE tasks. The limitations of traditional DSSE methods in terms of measurement sparsity, topological dynamism, nonlinear modeling, and multi-source data fusion stem essentially from their core model-driven logic, which relies on manual simplification and assumptions of the physical laws governing power grids. However, the complexity of modern distribution networks—characterized by high-penetration distributed generation, flexible loads, and multi-source data—has exceeded the limits of manual modeling capabilities.
Centered on a data-driven paradigm, deep learning can specifically address the intractable challenges faced by traditional methods through its capabilities such as automatic feature extraction, topological adaptability, robust nonlinear fitting, and multi-source data fusion. As such, it has become an indispensable technical approach used to enhance the accuracy, efficiency, and adaptability of DSSE. Its indispensability is evident not only in its use to solve existing problems but also in that it provides an extensible technical framework for the refined state estimation of future distribution networks.
Deep learning frameworks, which are capable of adaptively learning task-specific patterns from DSSE data, have demonstrated promising performance in recent studies. Current approaches include leveraging artificial neural networks (ANNs) to model specific power-system components alongside convolutional neural networks (CNNs) [
18], multi-layer perceptrons (MLPs) [
19], generative adversarial networks (GANs) [
20], and Bayesian networks [
15,
21]. However, a critical bottleneck persists: the effectiveness of these neural networks hinges on the availability of large-scale, high-quality training datasets. For DSSE—where real-world measurements are often sparse, noisy, or incomplete—the challenge of ensuring data validity (e.g., addressing missing values, bad data, and non-stationary distributions) remains unresolved. While recent works attempt to mitigate this by incorporating inductive biases (e.g., topological priors) or physics-informed constraints [
22,
23], these hybrid methods still require substantial amounts of labeled data to achieve reliable generalization, limiting their applicability in scenarios with limited measurement resources.
Given the limitations of purely model-based or data-driven approaches, hybrid frameworks integrating physical insights with data-driven learning have emerged as a promising direction for DSSE [
13]. Zhang et al. [
24] proposed embedding physical regularization terms into deep neural networks to enforce power-system constraints, while Kumar et al. [
25] developed an artificial neural network tailored for non-Gaussian noise corrupted by bad data—though their method requires prior knowledge of precise equipment states and measurement baselines. Rui et al. [
26] introduced a Tapered DNN architecture that incorporates maximum entropy principles into DSSE, achieving accurate estimation via layer-wise unsupervised feature learning. Duan et al. [
27] adopted a hybrid DL–ML strategy, fusing CNN and random forest models to extract dynamic features from time-series measurements. Gotti et al. [
28] proposed a PCA–DBN framework: principal component analysis isolates noise-robust features, which are then fed into a deep belief network for topological structure identification; this approach demonstrates strong resilience to data loss and measurement noise. Ostrometzky et al. [
29] developed a physics-informed dynamic DSSE framework that uses power-flow equations as regularization constraints, while Wang et al. [
30] replaced the decoder of an autoencoder with a physical model to enable hybrid state estimation.
However, a common limitation of these methods is their neglect of explicit network-topology modeling, a critical shortcoming given the structural complexity of distribution grids. Graph Neural Networks (GNNs), by contrast, leverage network topology as inductive bias, inherently addressing the curse of dimensionality and demonstrating robust performance under topological perturbations [
31,
32]. Recent advances include EleGNN, as presented by Liu et al. [
33], which improves traditional GNNs by incorporating physical connectivity and using node-edge feature propagation to model complex grid interactions. Madbhavi et al. [
34] designed a GNN-based estimator that takes measurement matrices/tensors as input, introducing feature scaling and a pseudo-measurement generation module to improve generalization. Ngo et al. [
35] further integrated knowledge of the physical field with the GNN architectures, allowing more effective processing of structural data to capture latent dependencies of the topological state.
In distribution-system operations, frequent topological changes [
36] pose a critical challenge: retraining models to adapt to new configurations incurs substantial time and computational costs. Moreover, relying exclusively on graph structures for modeling often results in loss of critical node-specific attributes, as distribution systems are inherently multi-source and heterogeneous—each component carries rich attribute information beyond purely structural connections [
1]. To address the limitations of traditional graph neural networks, which struggle to balance topological extraction and multi-source feature modeling in heterogeneous environments, this paper employs General Attributed Multiplex Heterogeneous Network Embedding (GATNE) [
37]. By integrating diverse node attributes and structural multiplexity, GATNE effectively captures the nonlinear dependencies within distribution systems, overcoming the information loss that occurs in purely topological modeling. This approach enables the model to dynamically adapt to topological variations without full retraining while leveraging heterogeneous attributes to enhance the accuracy and robustness of DSSE. These features are critical to handling complex, real-world scenarios in distribution grids.
To tackle the critical dependency of DSSE on data quality, this paper integrates soft power-flow equations into the loss function, enforcing physical consistency by penalizing predictions that deviate from power-flow constraints. Unlike hard-constraint methods, this approach softly regularizes outputs to lie within the feasible operating region defined by power-flow dynamics, discarding implausible solutions that violate fundamental electrical laws. This ensures not only that the model’s estimates are mathematically consistent but also that they maintain engineering viability, significantly enhancing robustness in handling noisy, incomplete, or uncertain real-world data.
Additionally, a cross-modal attention module is proposed to model the intricate interactions between input measurements and edge features, explicitly capturing the latent relationships between observed data and topological connectivity. The model adaptively weights the informative characteristics in heterogeneous modalities during the state estimation by merging the measurement inputs and the topological graph G within the GATNE framework. The output embeddings are fed into a power-flow constraint layer, which acts as a physics-informed filter to refine predictions against actual grid dynamics. Key contributions of this work include the following:
(1) A GATNE-based DSSE architecture is proposed. By modeling the multi-source heterogeneous structure of distribution systems and enabling inductive learning using measurement data, it can effectively capture the nonlinear relationships between nodes, thereby improving the model’s accuracy and robustness.
(2) A cross-modal attention module is proposed to learn the correlations between model inputs and topological structure attributes, and it uses this correlation to better enable the model to mine hidden features, thereby improving the accuracy of the model in DSSE.
(3) The power-flow equations are introduced into the neural network architecture, combining the characteristics of data-driven models and restricting admissible solutions within a certain range to ensure that the model’s output aligns with the objective laws of real-world physical scenarios, thereby enhancing the robustness and generalization capability of model predictions.
This paper introduces the proposed method in
Section 2 and
Section 3, which respectively describe the application of the GATNE model to this task.
Section 4 presents a case study that compares the performance of baseline algorithms and other data-driven models. Finally,
Section 5 concludes the work.
4. Case Study
This paper will carry out tests on the 14-bus CIGRE MV distribution grid shown in
Figure 4a activated with photovoltaic (PV) and wind distributed energy resources (DER) [
39], the 179-bus Oberrhein grid (
Figure 4b), and the 70-bus Oberrhein MV/LV sub-grid (
Figure 4c) to provide information on the proposed method and evidence of its effectiveness [
40]. A visualization of test cases is shown in
Figure 4.
4.1. Experiment Setups
To capture realistic demand dynamics, 8640 hourly load samples were collected over a representative period of one year. Each scenario comprises 24 consecutive hourly snapshots, which reflect diurnal load cycles. These scenarios were synthesized by Monte Carlo perturbation of standard load curves, incorporating a 15% uncertainty margin to emulate both forecasting and measurement errors [
41].
All simulations assume steady-state operation, with AC power flows solved using PandaPower [
39] under Python 3.10. Matlab has significant advantages in numerical computation and visualization, making it suitable for simulations of small-scale distribution networks. However, it is less flexible than Python in terms of training deep learning models; Spice focuses more on circuit-level simulations and has low adaptability for system-level state estimation of distribution networks. Based on the above considerations, this study ultimately chose to use Python 3.10/PyTorch 2.7.0 for experiments. The experimental environment employed in this study consists of an Intel Core i7-13700KF CPU, a single NVIDIA RTX 4090 GPU, 32GB of RAM, a 2TB SSD, and the Windows 11 operating system. Distribution-system bus injections are often dominated by spurious measurements with accuracies below 50%, so we adopt a conservative 1% error bound for voltage readings. Additive zero-mean white noise is applied across all sensors [
42], yielding deviations of 0.5–2.0% in voltage and current measurements and 1–5% in active and reactive power injections. These modeling choices ensure that our DSSE evaluation faithfully reflects the uncertainty levels encountered in real-world distribution networks.
The dataset is divided into a training set, a validation set, and a test set in a ratio of 8:1:1. Here, z denotes the input at the measurement locations, and the complete state of the system is represented by the label y.
One metric of the experiment is Root Mean Squared Error (RMSE) [
43]:
where
is the actual values and
is predicted values of the
i-th observation,
n is the total number of observations.
The comparative models selected in this paper include the standard SE WLS, a supervised ANN model, Message Passing Neural Network(MPNN) [
44], and the GATNE algorithm proposed in this paper. All these methods are implemented in PyTorch. The hyperparameter penalty factors were set as follows: batch size, 64; dropout rate, 0.4; learning rate
, 0.003; and soft limitation
fixed. The AdamMax optimizer is adopted, the grid search range is
, the layer dimension is
, and the number of layers is set as
. During the GATNE process, the model initializes using the Xavier uniform distribution with the dimension set to 40 and takes information such as bus voltage magnitude and active/reactive power injections as input features. Additionally, the number of attention heads is set to four; for the message-aggregation part, neighborhood mean pooling is adopted along with a learnable matrix. The selected hyperparameters are shown in
Table 2.
4.2. Comparison Experiments
To evaluate the scalability and robustness of our approach, this paper conducts case studies on the IEEE 14-bus, 70-bus and 179-bus systems.
Table 3 summarizes these studies.
As shown in
Table 3, on the 14-bus CIGRE dataset, in small-scale power grids, the voltage RMSE of GATNE is
, while that of the traditional WLS algorithm is
. GATNE’s voltage RMSE is only 45.3% of WLS’s. Compared with MPNN, which has a voltage RMSE of
, GATNE achieves better performance. This may be because MPNN treats edge features homogeneously without differentiation; in contrast, GATNE can better represent network heterogeneity, thereby achieving superior results. The artificially designed ANN uses multi-layer MLPs for prediction, which is essentially the superposition of multiple linear fitting functions. With a sufficient number of layers, ANNs can more easily achieve better performance in this metric. However, in other indicators such as line-loading RMSE, the ANN scores 41.38%. This is likely due to the fact that purely data-driven ANNs do not incorporate physical constraints, leading to deviations of branch power estimation from actual operating laws. In contrast, GATNE integrates soft power-flow residual regularization, which significantly reduces the deviation in load estimation. To present the data in more detail,
Figure 5a displays the voltage RMSE for each bus and the RMSE loading for each line. It can be observed that the voltage RMSE of GATNE is below the green dashed line (0.5) for all buses, indicating that GATNE meets the qualified performance standard.
Additionally,
Figure 5b shows the load RMSE for different lines. In comparison, GATNE is better able to learn coupled data than the WLS and the ANN, which may be because cross-modal attention represents coupled data more effectively than other models. However, the RMSE estimation increases on lines indices 12 and 13, likely due to oversimplified modeling of transformers, which results in accuracy loss.
In addition to the above metrics, convergence speed, accuracy, and computation time are also critical indicators, as shown in
Table 4. In the 70-bus system, the voltage RMSE of GATNE (
) is 92.6% lower than that of WLS (
) and 49.1% lower than that of MPNN (
). In the 179-bus system, the voltage RMSE of GATNE (
) is 29.9% that of WLS (
) and 41.9% lower than that of MPNN (
). In large-scale networks, GATNE’s cross-modal attention mechanism can fuse multi-source heterogeneous data such as distributed generation output and load fluctuations more efficiently. In contrast, MPNN relies solely on node message passing, making it difficult to handle information redundancy and noise interference in complex networks. In addition, in the comparison of the two indicators, computational efficiency and convergence rate, the WLS becomes less capable of full coverage as the network scale expands, easily leading to iterative divergence. The computational complexity of GATNE’s graph-embedding process exhibits a linear relationship with the number of nodes, making it more suitable for use in modern distribution-network architectures.
The proposed GATNE algorithm outperforms WLS in all metrics. To validate the accuracy of the GATNE algorithm, this paper selects measured buses 24 and 85 in the 70-bus system to estimate voltage levels under normal sampling conditions using both WLS and GATNE, as shown in
Figure 6.
As shown in
Table 3 and
Table 4, the proposed GATNE reduces the computation time by 13 times, 4 times, and 24 times compared to WLS on the 14-bus, 70-bus, and 179-bus systems, respectively. The GATNE model outperforms both WLS and MPNN on large-scale grid datasets (70-bus and 179-bus) in terms of various metrics. Specifically, the Voltage RMSE for GATNE is 2.91% in the 70-bus data set and 3.28% in the 179-bus data set, while MPNN achieves 4.36% and 5.11%, respectively. Both neural network models demonstrate superior performance compared to the traditional algorithms. This could be due to the ability of neural networks to effectively fit the data and produce better results. In contrast, the WLS algorithm, which is sensitive to redundancy and noise, performs poorly.
In addition to the RMSE metrics, the convergence rate is also an important indicator of whether the algorithm can converge to the optimal solution within a given time. As shown in the table, GATNE achieves a convergence rate of 100% on both datasets, meaning that GATNE can quickly and stably find the optimal solution in each run, ensuring accuracy and stability in computation.
In contrast, the WLS method has a convergence rate of only 25% on the 70-bus Oberrhein dataset, indicating significant convergence issues when handling large-scale datasets. This likely occurs because, during the iteration process, it struggles to compute useful information, leading to a decline in performance. This can be attributed to the fact that for larger systems, WLS using the Newton–Raphson iterative method requires more iterations to converge. Although MPNN also achieves a convergence rate of 100%, GATNE demonstrates more stable convergence performance and is overall superior.
Minimum voltage and total power loss are core metrics for evaluating system operational safety and economic efficiency. Because the largest estimation errors occur in the WLS method, its calculated minimum voltage is the lowest across all three datasets, and it simultaneously yields the highest total power loss. The typical voltage safety threshold is 0.95 p.u., yet WLS results consistently fall below this critical value in all three cases.
In contrast, the GATNE-based method proposed in this work delivers state estimates closest to the true system state. Consequently, it achieves the highest minimum voltage values, aligning with physical expectations. Furthermore, WLS generates the highest total power-loss estimates, which may mislead dispatchers during optimization decision-making. Conversely, GATNE computes the lowest power loss values, significantly enhancing operational reliability.
As the network scale expands and noise inputs increase, WLS becomes more difficult to iterate, leading to a significant rise in computation time. Ultimately, owing to its superior accuracyin state estimation, the GATNE method derives more precise and robust operational metrics, demonstrating clear engineering advantages.
4.3. Noise Experiment
To verify the robustness of the model proposed in this paper, this paper conducts a comparison of measurement performance under noise interference on a 70-bus network. This paper directly adds Gaussian noise with a standard deviation of
to the measured values and divides it into three different noise levels. The default noise refers to applying 1% noise to the voltage and current and adding 2% noise to the active and reactive power measurements; the low-level noise is 0.5% and 1%, and the high-level noise is 3% and 5%. Under the three different conditions, this paper presents a comparison between the traditional WLS algorithm and the GATNE algorithm. The comparison between the evaluation values of WLS and GATNE at bus 24 is shown in the
Figure 7.
In
Figure 7, it can be seen that GATNE can effectively remove noise in the presence of noise, while WLS is vulnerable to noise, resulting in a large value for the measurement of the voltage deviation. In addition, the performance of voltage RMSE and line-loading RMSE for GATNE and WLS are shown in
Figure 8. It is clearly evident from the figure that GATNE is more robust in the presence of noise.
4.4. Missing Values and Error Measurements Experiments
In order to further verify the robustness of GATNE, this paper studies the stability of different algorithms when missing values and error measurements occur in the 70-bus network. The experimental steps in this paper are derived from the literature [
1]. Under the same settings, as shown in
Figure 9, GATNE performs better than WLS, which fully demonstrates that GATNE is more robust and less affected by noise, missing values, and incorrect values. This may be because the graph-aggregation algorithm of GATNE can better analyze the network topology and the existence of the attention mechanism can automatically assign weights, resulting in more robustness for GATNE.
4.5. Hyperparameters Analysis
In this section, this paper analyzed the hyperparameters mentioned in
Section 2.3, which mainly include penalty term hyperparameters
and the accuracy deviation
of power-flow equations.
This paper sets the physical penalty term as follows: . This is because such a setup aims to strike a balance among physical consistency, computational efficiency, and model stability. If there are significant differences between different hyperparameters, certain constraints may be over-emphasized or under-emphasized, thereby disrupting the inherent balance between physical laws. For instance, if positive voltage constraints are set to be far stronger than negative , that could lead to distorted model results. Additionally, in large-scale systems like the 70-bus and 179-bus networks, individual hyperparameters would significantly increase computational costs.
Then,
is set to 1. As shown in
Table 5, the performance of voltage RMSE on the 14-bus CIGRE dataset under different hyperparameters is presented. In general, adopting a value of
yields the best results. For more detailed optimization, a grid-search method should be employed to select hyperparameter combinations.
Additionally, to validate the role of the soft constraints proposed in this paper, we preform validation. Specifically,
Table 6 presents the performance of GATNE on the 14-bus CIGRE dataset under different values of
. As shown in the table, the value of
should neither be too high nor too low. When
is too low, the constraints on the model are weak, leading to larger result deviations. When
is too high, the constraints on the model become overly strict, which also causes a decline in performance.
4.6. Ablation Studies
To validate that the proposed model, adopted modules, and loss function have positive effects, this paper conducts ablation studies on each component. The baseline is a model built with GNN, and the specific data are shown in
Table 7.
The baseline model (with all components disabled) achieves a voltage RMSE of and a line-loading RMSE of 14.60%, reflecting the estimation accuracy when one relies solely on the basic GNN architecture. When the GATNE framework is enabled individually (the second row), voltage RMSE and line-loading RMSE decrease to and 11.23%, respectively, indicating that GATNE optimizes information aggregation between nodes through graph attention mechanisms, significantly enhancing the model’s ability to model the graph structure of power grids. Further introduction of physical soft constraints (the third row) reduces line-loading RMSE substantially to 9.47%, while voltage RMSE drops to . This demonstrates that physical constraints, by embedding prior knowledge such as Kirchhoff’s laws and power conservation, effectively regulate the physical rationality of model predictions, particularly yielding more pronounced optimization for system-level constrained indicators like line loading.
When cross-modal attention is introduced individually (the fourth row), the voltage RMSE decreases to and the line load RMSE drops to 10.25%, demonstrating the advantage of this module in the integration of multimodal data features such as voltage, current, and line parameters. This validates that cross-modal interaction plays a more critical role in improving the estimation accuracy of node-level states. When the GATNE framework is combined with physical soft constraints (the fifth row), line-loading RMSE further decreases to 8.06%, showing that the integration of data-driven graph modeling and physical priors forms a complementarity at the system-level constraint dimension. Conversely, the combination of GATNE and cross-modal attention (the sixth row) reduces voltage RMSE to , indicating that graph structure modeling and multi-source feature fusion achieve synergistic optimization for node-state estimation.
Notably, even without enabling the GATNE framework, the combination of physical soft constraints and cross-modal attention (the seventh row) still reduces both metrics, though performance lags behind configurations including GATNE, highlighting that GATNE serves as the foundational architecture supporting the effectiveness of other modules. When all components are enabled (the eighth row), the voltage RMSE and the line-loading RMSE reach and 7.86%. This validates the synergistic enhancement of the GATNE framework, physical soft constraints, and cross-modal attention. Specifically, GATNE enables efficient graph-structure representation learning, physical constraints ensure that predictions remain consistent with grid operation laws, and cross-modal attention strengthens the interaction between input z and node attributes , collectively yielding a high-precision and robust state estimation model.
In summary, the ablation experiments demonstrate that each component makes an indispensable positive contribution to model performance and that their combination achieves optimal estimation through mechanistic complementarity, thereby providing a reliable modular design basis for state estimation in complex power grid environments.