1. Introduction
With the increasing penetration of distributed generation, renewable resources, and electrified loads, the operating environment of distribution networks has become increasingly complex and uncertain. Photovoltaic units, wind power, and electric vehicles introduce bidirectional and highly fluctuating power flows, while frequent switching operations for routine maintenance, fault isolation, and load transfer lead to continuously changing topologies. These dynamics not only impose stricter requirements on supply reliability but also create new challenges for maintaining energy efficiency and operational health on the distribution side. Consequently, ensuring a secure power supply while improving efficiency has become a central objective in modern distribution network operation and management [
1,
2].
Among various operational indicators, line loss remains a fundamental parameter for assessing the efficiency and health of distribution networks [
3]. While normal losses are mainly caused by resistive dissipation, deviations beyond reasonable ranges indicate abnormal line losses. Unlike conventional methods that merely detect anomalies, line loss anomaly classification provides further insight into their root causes [
4,
5]. Technical categories such as infrastructure degradation, parameter or documentation errors, and metering malfunctions, as well as non-technical causes such as electricity theft, can be distinguished. This categorization is of practical importance: it guides operators toward targeted actions—maintenance, record correction, device replacement, or anti-theft enforcement—rather than generic interventions. Without classification, anomaly sources may be obscured, leading to ineffective remedies, distorted efficiency assessments, and elevated risks in settlement and regulation. Therefore, developing effective methods for line loss anomaly classification is of great significance, as it enhances energy efficiency, improves operational and managerial decision-making, ensures fair metering and settlement, and strengthens the secure and reliable operation of distribution networks [
6,
7].
In recent years, research has gradually shifted from simple anomaly detection toward more refined line loss anomaly classification, which aims not only to identify anomalies but also to reveal their underlying causes [
8,
9]. Existing approaches can be broadly categorized into several groups. Rule-based and statistical methods rely on predefined thresholds or statistical learning of line loss rates to separate technical from non-technical categories [
10,
11]. Physics-based methods employ power flow equations, DistFlow models, or impedance-based calculations to distinguish infrastructure-related issues from discrepancies caused by metering or documentation errors. Data-driven machine learning methods, such as decision trees, random forests, and support vector machines (SVM), learn discriminative patterns from historical operational data, whereas deep learning models, including convolutional networks, autoencoders, and graph neural networks (GNNs), exploit spatiotemporal correlations and topology information to infer fine-grained anomaly categories [
12,
13]. More recently, semi-supervised learning and hierarchical classification frameworks have been introduced to alleviate label scarcity and to align the classification process with practical diagnostic hierarchies in utilities [
14,
15].
Although significant progress has been made in line loss classification, existing approaches still exhibit notable limitations in practical applications. First, most methods remain constrained to a binary separation between technical and non-technical losses, lacking the capability to distinguish finer-grained categories such as infrastructure degradation, documentation errors, metering faults, and theft, which limits their operational usefulness [
16]. Second, current models typically rely on fixed topology assumptions, whereas real distribution networks undergo frequent reconfigurations due to maintenance, load transfer, and fault isolation, making static models unsuitable for dynamic environments. Third, the uncertainty of measurement data, including noise, missing values, and inconsistent device accuracy, is often insufficiently modeled, undermining the stability of classification outcomes under complex conditions [
17]. Moreover, many approaches ignore physics consistency constraints during training, relying purely on data-driven features, which may lead to predictions that contradict power balance or loss mechanisms [
18,
19]. Finally, interpretability remains a critical challenge, as most classifiers can only output labels without providing transparent “class–evidence” explanations, thereby reducing their value for practical decision-making [
20,
21,
22].
These limitations highlight the urgent need for a line loss classification framework that can achieve multi-class granularity, account for topology dynamics and data uncertainty, incorporate physics-consistent constraints, and deliver interpretable results, in order to meet the comprehensive requirements of robustness, reliability, and operational applicability in distribution networks.
To address the limitations of existing line loss classification methods—namely, their coarse binary separation between technical and non-technical losses, reliance on fixed topology assumptions, insufficient modeling of measurement uncertainty, and lack of interpretability—this paper proposes a physics-consistent spatiotemporal graph neural network framework, as shown in
Figure 1, termed PC-LossGNN. In the encoding stage, the framework integrates static priors with dynamic topology features, introduces power-flow-based physical regularization, and employs residual-guided attention to extract discriminative spatiotemporal features under noisy and incomplete measurements. In the decoding stage, a dual-path classifier is designed, combining Softmax-based probabilistic prediction with prototype embedding, to enable fine-grained classification of five categories: normal, infrastructure anomaly, documentation anomaly, metering anomaly, and theft. Furthermore, physics-informed constraints derived from power balance, voltage drop, and Ohmic loss residuals are incorporated, while a curriculum-inspired optimization strategy progressively reinforces constraint learning, thereby improving robustness and ensuring physical consistency.
The methodological novelty of PC-LossGNN does not lie in independently using adaptive graph learning, spatiotemporal graph convolution, prototype classification, or physics-informed regularization. Instead, it lies in reformulating line-loss anomaly classification as a physics-constrained edge-level diagnostic task, where topology adaptation, spatiotemporal representation learning, residual-guided attention, and class-evidence decoding are jointly coupled around the physical mechanisms of line-loss deviations. Compared with generic STGNN-based anomaly classifiers, PC-LossGNN explicitly links the learned graph representation to branch power-flow residuals, embeds physical consistency into the training objective rather than only using physics-related quantities as auxiliary inputs, and produces prototype-based class evidence for five operationally meaningful line-loss states.
The main contributions of this paper are summarized as follows:
- (1)
A physics-consistent edge-level STGNN framework is developed for five-class line-loss anomaly diagnosis. Unlike generic STGNN anomaly classifiers, the proposed PC-LossGNN embeds branch-level power balance, voltage-drop, and ohmic-loss residuals into both representation learning and model optimization, thereby aligning learned features with the physical mechanisms of line-loss deviations.
- (2)
A topology-adaptive spatiotemporal encoder is designed by fusing the static electrical topology prior with a measurement-adaptive graph. This design enables the model to capture operating-condition-dependent coupling among branches while retaining physically feasible connectivity constraints.
- (3)
A dual-path decoder combining probabilistic prediction and prototype-based class evidence is introduced for fine-grained anomaly interpretation. The decoder not only improves classification performance under severe class imbalance but also provides interpretable evidence for distinguishing Normal, Infrastructure, Documentation, Metering, and Theft states.
The remainder of this paper is organized as follows.
Section 2 establishes the preliminaries for five-class line-loss anomaly classification, including the problem formulation, notation, taxonomy, and physics-based residuals used in distribution networks.
Section 3 details the proposed PC-LossGNN methodology, covering topology priors, measurement-adaptive graph construction, residual-guided attention, physics-consistent regularization, and the training protocol.
Section 4 reports the experimental validation on real utility data and benchmark models, together with ablation studies, robustness and calibration analyses, interpretability assessments, and computational efficiency.
Section 5 concludes the paper and outlines limitations and potential avenues for future research.
3. Methodology
A physics-consistent spatiotemporal graph neural network, PC-LossGNN, is developed for five-way edge classification of line-loss states (Normal, Infrastructure, Documentation, Metering, Theft). The architecture is built upon physics-aware and dynamically adaptive encoders, in which a static-prior plus measurement-driven dynamic graph, an interactive spatiotemporal fusion encoder based on odd-even temporal splitting, and confidence-aware inputs are employed. Predictions are regularized by explicit line-loss formulations to preserve physical consistency under sparse or noisy measurements.
3.1. Task Formulation and Feature Construction
Multi-source node/edge measurements together with confidence channels are organized as shown in
Figure 4 to form the confidence-aware inputs
and
. As shown in
Figure 4, node measurements and edge measurements are organized as two parallel input tensors over the same sliding historical window from
t −
T + 1 to
t. The time axis denotes only the temporal dimension within each tensor and does not indicate any sequential dependency between node and edge measurements.
A distribution system is modeled as a graph
with
buses and
branches. At time
, node-wise and edge-wise measurements within a sliding window of length
are arranged as
Here, denotes the number of branches, T denotes the length of the sliding temporal window, and and denote the numbers of node-measurement and edge-measurement feature channels, respectively. This definition clarifies the dimensional meaning of the node and edge features before constructing the confidence-aware inputs.
Measurement heterogeneity is encoded by confidence channels (from device accuracy and pseudo-measurements) that are broadcast in time and concatenated with raw features, yielding confidence-aware inputs , .
For each branch
and time
, a label
(normal/4 anomaly types) is assigned. Edge features include static parameters and dynamic flows together with loss cues,
so that both measured and ohmic losses are available to the model. The objective is a mapping
, where
is the five-class probability vector for branch
at time
.
3.2. Physics-Consistent Spatiotemporal Graph Modeling
To reflect electrical connectivity while adapting to operating conditions, a static prior adjacency
(covering all feasible lines) is combined with a dynamic adjacency inferred from measurements. The dynamic graph is parameterized by node embeddings
obtained from
:
Here, ⊙ denotes element-wise Hadamard multiplication, SoftMax is applied row-wise, and the mask induced by is used to exclude non-electrical links. The same ⊙ notation is used in the subsequent even-odd temporal interaction and endpoint-embedding interaction equations.
It should be noted that the adaptive adjacency matrix in this study is not an unconstrained statistical correlation graph, nor is it interpreted as a direct substitute for the impedance matrix or analytical power-flow sensitivity matrix. The static topology prior and the electrical connectivity mask first restrict the feasible support of graph learning by excluding node pairs without electrical connections. Within this physically feasible support, the measurement-driven adaptive adjacency further adjusts the coupling strength between neighboring nodes or branches according to voltage, power-flow, current, and loss-related features observed in the current time window. Therefore, the learned adaptive edge weights can be understood as a data-driven approximation of operating-condition-dependent physical interactions: network topology and line impedance determine the possible coupling paths, whereas load/generation variations, voltage drops, and power-flow redistribution change the effective influence strength along these paths. In this sense, the adaptive adjacency complements the static topology by representing dynamic coupling variations that cannot be fully captured by a fixed graph, rather than replacing the electrical parameters in conventional power-flow models.
The resulting block is schematized in
Figure 5. The topology prior and the adaptive adjacency are fused into the final graph structure, where the former fixes the physically feasible connectivity support and the latter reweights coupling strengths within this support according to the current measurement window. Forward/backward multi-hop diffusion with learnable weights and residual-based gating then produces the updated node states.
Spatiotemporal encoding is performed by STHGC (spatiotemporal hybrid graph convolution) together with an interactive temporal branching module. After temporal convolution over the sliding window, diffusion-style aggregation with forward/backward walks is applied:
where the degree matrix corresponds to the graph structure used in diffusion. To capture complementary dependencies between adjacent and interleaved time steps, the temporal window is split into even and odd subsequences, each processed by STHGC and then coupled through cross-stream multiplicative interaction and recombination:
then upsampled and interleaved to yield
. Edge-time embeddings
are obtained by temporal encoding of
.
The interactive spatiotemporal fusion encoder (ISTFE) block is illustrated in
Figure 6, where even/odd temporal streams undergo STHGC, cross-stream multiplicative interaction, residual recombination, and upsampling-interleaving to produce the final node-time embeddings.
It should be noted that ISTFE is not intended to constitute a rigorous hierarchical multiresolution decomposition in the signal-processing sense. Unlike wavelet decomposition, which separates frequency bands using predefined basis functions, temporal pyramids, which construct hierarchical temporal scales through multiple layers or window lengths, or frequency-domain methods, which analyze periodic components through spectral transforms, ISTFE adopts a lightweight odd-even two-stream interaction. The purpose of this design is to enhance the modeling of interactions among adjacent samples, interleaved samples, and short-term abrupt patterns while retaining end-to-end trainability and low computational overhead. Therefore, in the revised manuscript, ISTFE is described as an interactive temporal encoding module rather than as a strict multiresolution signal decomposition method.
3.3. Edge Representation and Five-Class Decoding
A symmetric edge representation is formed by fusing end-node and edge features:
where
is an MLP with gated linear units and
is the edge-time feature from
. To encourage physical plausibility, attention weights are modulated by branch residuals (defined in
Section 3.4):
with
a scalar scorer and
.
Five-class probabilities are produced by a hybrid decoder combining a softmax head and a prototype (center) head:
where
are learnable prototypes for Normal/Infrastructure/Documentation/Metering/Theft,
is a temperature and
.
The overall decoder is schematized in
Figure 7: the symmetric edge representation in Equation (25) is gated by physics residuals in Equation (26) and passed to two parallel heads Equations (27) and (28); their outputs are mixed as in Equation (29), while an auxiliary regression toward the ohmic target
contributes to
.
3.4. Learning Objective, Regularization, and Training Protocol
Physics residuals are constructed from branch identities (active/reactive conservation, voltage drop, ohmic loss equivalence). With
,
,
,
and
,
Although and both involve the branch active-power difference and the ohmic-loss term, is used to measure branch-level active-power balance with downstream injection, whereas directly measures the consistency between the measured active-power drop and the theoretical ohmic loss.
An auxiliary scalar head predicts
from
to fit
. The physics loss is
where
reflects measurement confidence.
For five-class supervision, a class-balanced focal cross-entropy is employed to handle skewed priors (e.g., Theft being rare):
with class counts
,
,
. In a highly imbalanced dataset, standard cross-entropy is easily dominated by Normal samples. Because the Normal class accounts for the overwhelming majority of instances, optimization tends to reduce the empirical risk of the majority class first, shifting the decision boundary toward the minority anomaly classes. This behavior may preserve high overall accuracy but suppress the recall of Infrastructure, Documentation, Metering, and Theft samples. The class-balanced weights used in this study increase the effective gradient contribution of minority classes, while the focal modulation further down-weights easily classified Normal samples and emphasizes hard or rare anomaly samples. Therefore, the loss function is not only intended to improve aggregate performance, but also to mitigate decision-boundary bias caused by severe class imbalance. To encourage temporal consistency while allowing sharp transitions when physics deviates, a residual-aware smoothing is added:
where
is a normalized mixture of residuals in (30) and (33). A prototype separation term sharpens class structure in the embedding space:
The overall objective adopts uncertainty-based weighting and a curriculum on physics:
Figure 8 summarizes the end-to-end pipeline. Confidence-aware node/edge windows are prepared and a fused adjacency
is obtained from the static prior and the adaptive graph. Node embeddings
(ISTFE + STHGC) and edge embeddings
are combined to form edge representations
, which are modulated by physics-residual–aware attention. Two parallel heads (MLP softmax and prototype) are mixed to produce the five-class probability
, while an auxiliary regression toward the ohmic target
and residual channels
shape the physics loss. The total objective aggregates
,
,
,
with uncertainty-based weighting as in Equation (38).
Under this formulation, Infrastructure anomalies tend to present increased ohmic inconsistencies () and voltage-drop deviations (); Documentation anomalies are characterized by persistent topology/parameter mismatch reflected as systematic residuals despite stable temporal patterns; Metering anomalies co-locate with confidence-weighted residual spikes and local disagreement between measured and ohmic losses; Theft typically yields directional flow discrepancies (large unexplained by ) with sharp temporal onsets, which the residual-aware attention and smoothing are designed to expose.
4. Experiments
This section presents a series of comprehensive experiments designed to rigorously evaluate the performance of the proposed Physics-Consistent Spatiotemporal Graph Neural Network (PC-LossGNN) for line loss anomaly classification and categorization. The experimental exposition is structured as follows.
Section 4.1 details the dataset, implementation specifics, and the experimental environment.
Section 4.2 specifies the evaluation metrics and the baseline models employed for comparative analysis. The core performance results of our proposed model are presented and analyzed in
Section 4.3. To comprehensively benchmark the model’s efficacy,
Section 4.4 provides a comparative analysis against several baseline models. In
Section 4.5, ablation studies are conducted to validate the contributions of the model’s key components, such as the physics-consistent loss and spatiotemporal encoders.
Section 4.6 further investigates the model’s robustness under varying degrees of measurement noise. Finally,
Section 4.7 offers a visual analysis of the learned feature representations to intuitively demonstrate the effectiveness of our model.
4.1. Data Source and Implementation Details
The dataset used in this study was constructed from real-world operational records of a distribution network in China collected from May 2022 to November 2022. The network segment includes 15 primary feeders, 1175 buses, and 1210 branches, serving residential, industrial, and commercial loads. Static data were obtained from topology records and asset-management files of the utility, including the base adjacency matrix, branch connectivity, and the resistance and reactance parameters of each line. Dynamic data were collected at a 15 min resolution, including bus-side voltage magnitudes, currents, active/reactive power, and branch-related power-flow and loss quantities. Node features were constructed from bus-side measurements and confidence channels, whereas edge features were mapped to the corresponding branches by associating terminal measurements, line parameters, and power-flow/loss quantities with each branch. This procedure yields node-time and edge-time samples consistent with the model inputs.
The supervisory labels were constructed from historical maintenance records, manual inspection reports, metering anomaly work orders, and physically constrained samples generated for rare events. For real event records, we extracted the event type, affected device or branch, start and end times, and inspection outcome, and then mapped each event to the corresponding branch index and 15 min time intervals. If a branch-time instance fell within a confirmed event window, it was labeled as Infrastructure, Documentation, Metering, or Theft according to the five-class taxonomy. Instances without evidence from maintenance, inspection, or anomaly work orders and with normal physical residuals were labeled as Normal. Records with ambiguous time boundaries, conflicting event types, or insufficient evidence were not directly used as supervised labels, thereby reducing the impact of label noise on model training.
Because theft events and some metering anomalies are rare in real systems, require long verification cycles, and are difficult to label exhaustively, physically constrained anomaly injection based on real operating windows was used to supplement the training samples. The injection process does not alter the base topology or line parameters; instead, it introduces anomaly-consistent perturbations within the empirical ranges of real load, voltage, and power-flow conditions. For example, theft samples are generated by introducing unmetered load or branch-side active-power deficits, which create persistent directional discrepancies between measured power drop and expected ohmic loss. Metering samples are generated by introducing metering bias, scaling errors, or short-term reading anomalies, which create local inconsistency between measurements and physics-based residuals. The injection parameters were sampled from empirical operating ranges and screened using power-balance, voltage-drop, and ohmic-loss residual checks to avoid physically implausible samples. The final experimental dataset contains approximately 10 million time-step-branch instances, with the following class distribution: Normal 98.5%, Metering 0.8%, Infrastructure 0.4%, Documentation 0.25%, and Theft 0.05%, reflecting the severe class imbalance typical of practical anomaly detection tasks.
All experiments were conducted on a workstation equipped with an Intel Core i7-12700K CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3080 Ti GPU. The PC-LossGNN model was implemented in Python 3.9 using PyTorch and the PyTorch Geometric library. The dataset was chronologically partitioned into training (first 4 months), validation (5th month), and testing (final month) sets. Key hyperparameters were optimized via grid search on the validation set, including a learning rate of 0.001, a sliding window length T of 96 steps (24 h), and a GNN layer depth of 3. To strengthen the statistical reliability of the reported results, all experiments in this study were repeated five times using distinct random seeds that control the initialization of model parameters, the order of mini-batch sampling, and the stochastic behavior of dropout layers. Unless otherwise stated, the performance metrics reported in the following subsections represent the mean and standard deviation computed across these five independent runs.
4.2. Evaluation Metrics and Baselines
Given the highly imbalanced nature of the classification task, a comprehensive suite of evaluation metrics was employed. These include Accuracy, Precision, Recall, and the F1-Score. We focus on the macro-averaged versions of these metrics, which assign equal weight to each class, thereby providing a more faithful assessment of the model’s performance on minority classes.
In this task, accuracy mainly reflects the model’s ability to identify the Normal class, because Normal instances account for 98.5% of all samples. If accuracy alone is used, a model may still obtain a high score even when most minority anomalies are missed. Therefore, this study emphasizes macro-averaged precision, recall, and F1-score, and further reports per-class F1-scores to examine whether minority anomaly recognition is genuinely improved. In particular, the Theft and Documentation classes have very limited samples and partially overlapping physical signatures with other anomalies; hence, per-class recall and F1-score are more informative than overall accuracy for evaluating practical operational value.
To benchmark our proposed PC-LossGNN model, we compare its performance against a set of representative baseline algorithms. These include a non-GNN model (XGBoost) trained on flattened time-series features; a purely temporal model (LSTM) that ignores spatial correlations; a purely spatial model (GCN) that lacks a temporal modeling component; and a well-established spatiotemporal GNN (STGCN) adapted for the edge classification task. Following the recommendation of the reviewers, two recent physics-aware spatiotemporal GNN baselines are additionally included to more rigorously position the proposed framework against the state of the art. The first is ST-RGNN [
23], a spatiotemporal recurrent graph neural network designed for fault diagnostics in power distribution systems. ST-RGNN combines GRU-based temporal encoding with graph convolution and employs a hierarchical detection, classification, and location structure that is architecturally analogous to the present task. The second is PA-STGCN [
24], a physics-aware spatiotemporal graph convolutional network that integrates power-flow equation residuals as auxiliary input features into the graph learning process. Both models were re-implemented in PyTorch Geometric using the architectures and hyperparameter settings reported in the original publications and were trained on the same dataset splits, preprocessing pipeline, and evaluation protocol as all other baselines.
4.3. Performance of the Proposed Model
The proposed PC-LossGNN model demonstrated robust performance on the test set, with the aggregated results over five independent runs summarized in
Table 2. The model achieved a mean overall accuracy of 0.9914 ± 0.0001, a metric largely driven by its reliable classification of the dominant Normal class. More informative for the imbalanced setting are the macro-averaged metrics, which assign equal weight to each of the five categories. The model attained a macro-averaged F1-Score of 0.8503 ± 0.0039, indicating a competent and well-balanced classification performance across all five classes, including the rare anomaly types. The small standard deviation confirms that this result is stable across different random initializations and is not dependent on a single favorable training trajectory. The corresponding 5 × 5 confusion matrix of the proposed model on the test set is shown in
Figure 9.
To situate the performance of PC-LossGNN within a broader context, it was benchmarked against six baseline models, including two recent physics-aware spatiotemporal GNNs. The quantitative results are presented in
Table 3, and the distributions of macro F1-Scores across five independent runs are visualized in
Figure 10 and
Figure 11.
The analysis reveals a clear performance hierarchy that provides insight into the relative importance of different modeling capabilities. The non-spatiotemporal baselines achieved macro F1-Scores of 0.6659, 0.7086, and 0.7465 for XGBoost, LSTM, and GCN, respectively. This progression confirms that explicitly modeling spatial or temporal dependencies individually yields incremental gains, but addressing either dimension in isolation is insufficient for capturing the complex dynamics of line loss anomalies.
Among the spatiotemporal models, vanilla STGCN attains a macro F1-Score of 0.8016 ± 0.0075, which already represents a substantial improvement over the single-domain baselines. The two recently proposed physics-aware GNNs build upon this foundation. ST-RGNN achieves 0.8099 ± 0.0068, an improvement of 0.8 percentage points over STGCN. This gain can be attributed to its GRU-based recurrent temporal encoding, which captures longer-range dependencies in anomaly signatures compared to the fixed convolutional kernels used in STGCN. PA-STGCN further advances the performance to 0.8212 ± 0.0048, representing a 2.0 percentage point improvement over STGCN. The additional gain of PA-STGCN over ST-RGNN demonstrates that incorporating power-flow residuals as auxiliary input features provides useful inductive bias for distinguishing anomaly types whose electrical signatures differ in their physical mechanisms.
Nonetheless, PC-LossGNN substantially outperforms both physics-aware baselines, achieving a macro F1-Score of 0.8503. This represents an improvement of 5.2 percentage points over STGCN and 2.9 percentage points over PA-STGCN. Three factors account for this additional gain. First, PC-LossGNN does not merely use physics residuals as input features but embeds physics consistency directly into the training objective through Lphys, which constrains the learned representations to satisfy power balance and ohmic loss relationships. Second, the measurement-adaptive adjacency mechanism allows the graph topology to evolve with operating conditions, whereas both ST-RGNN and PA-STGCN rely on fixed graph structures that cannot capture transient topological changes. Third, the dual-path decoder with class-balanced focal loss and synthetic theft injection provides targeted improvements for the extreme minority classes, which constitute a challenge not specifically addressed by the architectures of either baseline.
Beyond the mean performance, the variance across runs provides additional insight into model reliability. As shown in
Figure 10, PC-LossGNN exhibits the narrowest interquartile range among all models, with a standard deviation of only 0.0039 for macro F1-Score. By contrast, XGBoost and LSTM display notably wider distributions, suggesting greater sensitivity to initialization conditions. This stability can be attributed to the physics-consistent regularization in PC-LossGNN, which constrains the solution space to physically plausible regions and thereby reduces the dependence on random initialization. The consistently narrow box in
Figure 10 confirms that the performance advantage of PC-LossGNN is robust and not an artifact of a single favorable run. A detailed statistical significance analysis comparing PC-LossGNN with STGCN through McNemar’s test and bootstrap confidence intervals is provided in
Section 4.5.
Figure 12 presents the validation loss curves of PC-LossGNN and the LSTM, GCN, and STGCN baselines, where the proposed model converges faster and stabilizes at a lower validation loss.
4.4. Statistical Significance Analysis
To rigorously verify that the performance improvement of PC-LossGNN over the strongest baseline STGCN is not attributable to random variation, two complementary statistical tests were conducted on the test set predictions.
The first test is McNemar’s test, which examines whether two classifiers produce systematic differences in their sample-level predictions. A 2 × 2 contingency table was constructed by cross-tabulating the correctness of each model’s prediction for every test instance, as shown in
Table 4. Let b denote the number of instances correctly classified by STGCN but misclassified by PC-LossGNN, and let c denote the number of instances in the converse situation. With b = 4091 and c = 8044, the continuity-corrected McNemar statistic yields χ
2 = 1287.05, corresponding to
p < 10
−10. This result provides overwhelming evidence that the two models differ significantly in their classification behavior. Moreover, the ratio c/b ≈ 2.0 indicates that PC-LossGNN correctly reclassifies nearly twice as many instances that STGCN misses, compared to the reverse direction. The difference is therefore not merely statistically significant but also practically meaningful, reflecting a consistent improvement rather than a symmetric exchange of errors.
The second analysis employs stratified parametric bootstrap resampling to quantify the uncertainty of the macro-averaged performance gap. In each of 10,000 iterations, the per-class recall for both models was resampled from the asymptotic normal distribution justified by the Central Limit Theorem, and the macro-averaged difference ΔF1 was recorded.
Figure 13 presents the resulting bootstrap distribution. The mean of ΔF1 is 0.0902, and the 95% confidence interval spans from 0.0808 to 0.0995. Because the entire interval lies well above zero, the improvement of PC-LossGNN over STGCN is confirmed to be statistically robust and not sensitive to the particular composition of the test set.
Figure 14 further disaggregates the comparison to the individual class level. PC-LossGNN consistently outperforms STGCN across all five categories. The two models achieve comparable accuracy on the Normal class, which is expected given the overwhelming prevalence of normal instances. The largest absolute improvements are observed for the Documentation and Theft categories, where PC-LossGNN improves per-class accuracy by approximately 13 and 14 percentage points, respectively. These two categories are precisely the ones most likely to benefit from the physics-informed residual attention mechanism, which is designed to detect persistent discrepancies between measured and expected power flow. It is worth noting that the Theft class exhibits the widest confidence intervals for both models, which is a natural consequence of its extremely low prevalence in the test set, with only approximately 825 instances. Nevertheless, the improvement remains clearly visible even under this elevated uncertainty.
The per-class analysis in
Figure 14 provides further insight into the source of these improvements. All four spatiotemporal models achieve comparable performance on the Normal class, with F1-Scores exceeding 0.994. The differentiating factor is their ability to classify rare anomaly types. PC-LossGNN achieves per-class F1-Scores of 0.862 for Infrastructure, 0.818 for Documentation, 0.894 for Metering, and 0.683 for Theft. These values consistently exceed those of all baselines, with the most pronounced gains appearing in the Theft category. PC-LossGNN improves Theft F1-Score by 7.8 percentage points over PA-STGCN and by 15.5 percentage points over STGCN. This large margin reflects the synergy between the physics-residual attention mechanism, which is designed to detect the sustained directional power discrepancy characteristic of energy diversion, and the synthetic data augmentation and focal loss strategies that ensure the model receives sufficient and well-balanced training signal for this extremely rare category.
From the perspective of class imbalance, this result indicates that the improvement of PC-LossGNN is not dominated by the Normal class. The Normal class is already nearly saturated across all compared models, whereas the main performance differences arise from the four anomaly classes, especially the minority Documentation and Theft categories. This observation suggests that physics-residual attention, class-balanced focal loss, and rare-class sample supplementation jointly improve the separability of minority classes, allowing the model to maintain high Normal-class accuracy while avoiding excessive absorption of anomaly samples into the majority-class decision region.
4.5. Ablation Studies
The most substantial performance degradation was observed upon removal of the physics-consistent regularization term. Without this component, the macro F1-Score decreased from 0.8503 ± 0.0039 to 0.7718 ± 0.0077. This finding confirms that embedding physical laws from the branch power flow model is fundamental to the model’s success, as it enhances generalization and enables the distinction between complex anomaly types, particularly under noisy measurement conditions.
The study further validates the necessity of the integrated spatiotemporal architecture. Excising either the temporal encoder or the spatial GNN encoder resulted in significant performance drops, with macro F1-Scores falling to 0.7496 ± 0.0058 and 0.7238 ± 0.0125, respectively. This demonstrates that jointly modeling spatial and temporal dependencies is critical for comprehensively capturing the complex dynamics of line loss behavior. This ablation result supports the contribution of the interactive temporal encoder to anomaly classification, but it should not be interpreted as evidence that ISTFE performs a strict hierarchical multiresolution decomposition. Accordingly, we define its role as lightweight modeling of adjacent and interleaved temporal dependencies, while leaving more rigorous multiresolution designs such as wavelet, temporal-pyramid, or frequency-domain decomposition as possible extensions for future work.
Finally, the contribution of the adaptive adjacency mechanism was assessed. While a model variant using only the static base adjacency matrix still performed competently at 0.8264 ± 0.0037, incorporating the measurement-driven dynamic adjacency provided a discernible boost that raised the full model to 0.8503 ± 0.0039. This result also indicates that the adaptive adjacency does not merely provide additional statistical correlation, but captures operating-condition-dependent variations in physical coupling strength under the constraint of the static electrical topology. When loads, distributed generation outputs, and branch-loss distributions vary over time, such adaptive reweighting helps represent power-flow interactions and line-loss anomaly propagation patterns that are difficult to express using a fixed topology alone.
To further disentangle the contributions of the two strategies specifically designed to address class imbalance, two additional ablation experiments were conducted. In the first variant, the class-balanced focal loss defined in Equation (35) was replaced with a standard cross-entropy loss while keeping all other components unchanged. In the second variant, the synthetic theft data injection was removed from the training pipeline so that the model was trained exclusively on the original labeled theft instances. The macro-level results are reported in the last two rows of
Table 5, and the per-class breakdown is provided in
Table 6. The corresponding distributions are visualized in
Figure 15,
Figure 16 and
Figure 17.
The results reveal that the two strategies play clearly complementary roles. Removing the focal loss leads to a broad performance decline across all four anomaly categories, with the macro F1-Score falling from 0.8503 ± 0.0039 to 0.7887 ± 0.0082. The per-class analysis in
Figure 15 shows that Documentation suffers the most severe decline at 7.3 percentage points, followed by Theft at 10.8, Infrastructure at 4.3, and Metering at 3.3. This pattern is consistent with the theoretical function of the focal loss, which modulates the gradient contribution of easily classified samples. Because the Normal class accounts for 98.5% of all instances, standard cross-entropy training allows the Normal gradient to dominate the optimization landscape. The focal loss addresses this by down-weighting confidently classified Normal samples, thereby allocating more effective training signal to the underrepresented anomaly categories. Without this reweighting, the optimizer converges toward a decision boundary that favors the majority class, and the consequence is a broad suppression of minority-class recall.
In contrast, removing the synthetic theft injection produces a highly concentrated effect. As shown in
Figure 16, the F1-Score for the Theft class drops sharply from 0.682 ± 0.005 to 0.393 ± 0.019, a decline of nearly 29 percentage points, while the other four categories remain largely unaffected with changes of less than 2 percentage points. This dramatic collapse reflects the extreme scarcity of real theft labels in the training set. Electricity theft accounts for only 0.05% of all instances, corresponding to fewer than 300 real theft events across the four training months. Without synthetic augmentation, the model lacks sufficient positive examples to learn the distinctive physics-residual signature of theft, which manifests as a persistent directional discrepancy between measured power drop and expected ohmic loss. The resulting model achieves high recall on the other anomaly types but frequently misclassifies theft events as Normal or Metering anomalies. It should be emphasized that the synthetically injected samples are used to improve training coverage for rare anomalies, rather than to replace labels confirmed by real inspections and work orders. This design enables the model to learn the physics-residual pattern of theft events while keeping the annotation procedure grounded in operational records, maintenance logs, and manual verification results.
An important observation from these results is that the two mechanisms are not redundant and cannot substitute for each other. Even when the focal loss is present, it cannot compensate for the absence of theft training samples, because reweighting an empty or near-empty class gradient does not create informative gradients from nothing. Conversely, augmenting the theft data alone does not prevent the broad suppression of Infrastructure, Documentation, and Metering performance that occurs when the majority class dominates training through standard cross-entropy. Only the joint deployment of both strategies produces the synergistic effect observed in the full model. These results further indicate that class imbalance affects not only the aggregate training error but also the effective decision regions of different anomaly classes. The class-balanced focal loss mainly mitigates the global boundary bias caused by Normal-class gradient dominance, whereas synthetic theft injection mainly compensates for the insufficient intra-class representation of the extremely rare Theft category. Their joint use allows the model to improve the separability of minority anomaly classes while maintaining high recognition performance for the Normal class.
The variance across runs also reveals an interesting pattern, as illustrated in
Figure 17. The variant without the spatial encoder exhibited the largest standard deviation of 0.0125 for macro F1-Score, indicating that the absence of spatial graph convolution not only degrades average performance but also makes the model more sensitive to random initialization. By contrast, the full PC-LossGNN maintained the smallest standard deviation of 0.0039, further corroborating the stabilizing effect of jointly leveraging physics constraints and spatiotemporal encoding. The complete model produces the most compact and highest-positioned box in the plot, confirming that all architectural components contribute to both performance and consistency.
4.6. Robustness Analysis
A model’s robustness to noisy measurements is a key indicator of its practical utility. To comprehensively evaluate this aspect, we analyzed the performance degradation of PC-LossGNN against the LSTM, GCN, STGCN, ST-RGNN, and PA-STGCN baselines under varying levels of injected Gaussian noise.
Figure 18 plots the macro F1-Score for each model as a function of the noise-to-signal ratio applied to the sensor measurements in the test set.
The results compellingly demonstrate the superior robustness of the proposed architecture. While the performance of all models naturally degrades as noise intensity increases, their degradation rates differ substantially. The PC-LossGNN model exhibits the most graceful decline, maintaining a high F1-score even under significant noise levels. In contrast, the baseline models show much greater vulnerability. The GCN and LSTM models, lacking the ability to process integrated spatiotemporal information, experience a rapid drop in performance. The STGCN, while more resilient than the purely spatial or temporal models, still degrades more steeply than PC-LossGNN. This superior resilience can be largely attributed to PC-LossGNN’s physics-consistency loss (), which acts as a strong regularizer by constraining predictions to adhere to fundamental power flow equations, effectively filtering out implausible, noise-induced fluctuations and leveraging the inherent redundancy in the network’s physical structure.
4.7. Interpretability Analysis via Class-Evidence Visualization
To intuitively understand the impact of the model’s representation learning and to demonstrate the interpretability afforded by the prototype embedding head, the t-Distributed Stochastic Neighbor Embedding algorithm was employed to visualize the high-dimensional edge embeddings from the final layer.
Figure 19 presents the t-SNE projection of the learned embeddings from PC-LossGNN, with the five learnable class prototypes μ
c overlaid as star markers. Normal samples form a large, dense, and well-separated cluster in the upper-left region, while the four anomaly categories form distinct but more diffuse clusters that are clearly separated from the Normal group. Normal samples form a large, dense, and well-separated cluster in the upper-left region. The four anomaly categories form distinct but more diffuse clusters that are clearly separated from the Normal group. The prototypes are located near the geometric center of their respective clusters, confirming that the prototype separation loss in Equation (37) successfully anchors the embedding geometry to physically meaningful class structures. The partial overlap between the Infrastructure and Documentation clusters is consistent with the per-class confusion reported in the confusion matrix of
Figure 9 and reflects the genuinely overlapping electrical signatures of these two anomaly types. Both categories can produce persistent residual biases under steady-state conditions, which makes them intrinsically difficult to distinguish. The dashed lines in
Figure 19 connect several ambiguous Documentation samples to both the Documentation and Infrastructure prototypes, illustrating that their embedding positions fall between the two class regions. These are precisely the cases where the class-evidence decomposition provided by the prototype head is most valuable to human operators.
To demonstrate how the prototype similarity scores function as class-evidence in practice,
Figure 20 illustrates two representative cases. In
Figure 20a, a correctly classified Theft sample (T − 1) exhibits a clear unimodal similarity profile, with the Theft prototype receiving a similarity of approximately 0.68 and no other prototype exceeding 0.09. This sharp concentration provides unambiguous evidence for the classification decision. In contrast,
Figure 20b shows a misclassified Documentation sample (D-3) predicted as Infrastructure: the Infrastructure prototype receives 0.47 while Documentation receives 0.36. Although the prediction is incorrect, the true class retains the second-highest similarity, meaning an operator inspecting the class-evidence would immediately recognize the prediction as uncertain. This evidence decomposition directly supports the operational workflow envisioned in this paper: high-confidence predictions can be acted upon automatically, while ambiguous cases are routed to expert review.
Figure 20 further contrasts the prototype similarity profiles for a correctly classified and a misclassified sample in bar chart form. Panel (a) shows sample T − 1, where the Theft prototype dominates with a sharp peak. The physics residuals for this sample exhibit a persistent directional discrepancy between measured power drop and expected ohmic loss, which is the canonical signature of energy diversion that the Theft prototype has learned to represent. Panel (b) shows sample D-3, where the Infrastructure and Documentation prototypes are nearly tied. This flat profile reflects the fundamental physical ambiguity in distinguishing parameter-mismatch anomalies from equipment-degradation anomalies when both produce similar systematic residual biases. This type of evidence decomposition directly supports the operational workflow envisioned in this paper. High-confidence predictions, where one prototype clearly dominates, can be acted upon automatically. Ambiguous cases, where two or more prototypes share similar scores, are routed to expert review. Standard softmax-only classifiers lack this capability because their output probabilities are often poorly calibrated and do not convey how the classification evidence distributes across competing hypotheses.