A Hybrid Machine Learning Framework for Electricity Fraud Detection: Integrating Isolation Forest and XGBoost for Real-World Utility Data
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper is interesting to read and to see also the integration between XGBoost and Isolation Forest. However, there are many fundamental things that might improve the readability of the paper and also improve the flow of information in the paper. Here are suggested points for improvements:
- The abstract is somehow not reflected all the points that mentioned in the paper. For example, how the proposed framework showed the robustness of the integrations.
- The literature review in Section 2.2 is generic and not focused and not well documented. It should be more focused on directly related to hybrid and ensemble methods for fraud detection, I believe there are many papers and must be cited properly.
- As claimed in the paper, the integration of forest and xgboost is showing promising results. However, the clarity of the integration is not well understood and not well explained in section 3.3. I suggest rewriting the emphasis on how these integrations work together.
- The authors used simple split in the dataset without any justification (75/25) for training etc. Now, what I usually suggest is that what if we go for 80/20, 50/50, 70/30, and 75/25, just to see if these will improve the learning. Alos, what about cross validation
- I suggest that for the authors to emphasize and elaborate more on what do they mean by “Real-world” and “practical operation value”, how do they define these?
- The authors should also calculate and discuss the Lift or provide a cost-benefit analysis at different thresholds. There are many things confusing here and what if analysis
- There is a critical discrepancy in the AUC-ROC values. For example, Table 1 reports that an AUC-ROC of 0.947 for a threshold of 0.5, but the text later and other rows in the same table claim it is 0.999. This inconsistency must be resolved, as it undermines the credibility of the reported results in the paper. Which also means that there are inconsistent results and no solid foundation
- The comparative analysis in Table 2 is insufficient and detailed is needed to understand the paper. The proposed model should be compared against established, simpler baselines for example Random Forest, Logistic Regression, or a simple rule based. The failure of Scenario 4 TSFRESH is noted but not adequately explained and diagnosed and not addressed well the limitations of the main approach.
- Also, section 3.6 lists the "carefully tuned" final hyperparameters but there is no detailed explanations. The authors must describe the method used in the papers. I suggest more elaboration.
- The paper uses terminologies such as electricity fraud and electricity theft but do not clearly well defined and addressed the Ambiguity of these in real-word example
- The description of the SMOTETomek fallback mechanism (Section 3.5) is confusing. Please elaborate more and add well defined steps
- I suggest also to review the notations again as it was hard to understand
- In conclusion, please the limitations of the approach and how to overcome these as future works
Author Response
The paper is interesting to read and to see also the integration between XGBoost and Isolation Forest. However, there are many fundamental things that might improve the readability of the paper and also improve the flow of information in the paper. Here are suggested points for improvements:
1. The abstract is somehow not reflected all the points that mentioned in the paper. For example, how the proposed framework showed the robustness of the integrations.
Reply:
Thank you for this valuable comment. The abstract has been revised to more accurately reflect the key contributions presented in the paper, including how the proposed framework demonstrates robustness. We now explicitly state that robustness was assessed through a 5-fold cross-validation procedure, ensuring consistent performance across multiple data partitions. Additionally, we included the optimized operating threshold (0.6) and the main performance indicators (AUC-ROC = 0.999; F1-score = 0.77), which collectively highlight the effectiveness and stability of the hybrid integration.
2. The literature review in Section 2.2 is generic and not focused and not well documented. It should be more focused on directly related to hybrid and ensemble methods for fraud detection, I believe there are many papers and must be cited properly.
Reply:
Thank you for this insightful comment. In response, we have substantially revised and expanded Section 2.2 to ensure a more focused and well-documented literature review. The section now includes two dedicated subsections—“Data-Driven Electricity Theft Detection” and “Hybrid and Ensemble Methods”—which directly contextualize prior work related to our proposed approach. We incorporated several recent and highly relevant studies (e.g., Badr et al., 2023; Mehdary et al., 2024; Hussain et al., 2022) that examine ensemble learning strategies, hybrid architectures, and advanced deep-learning techniques for fraud detection. These additions strengthen the theoretical grounding of the manuscript and clearly position our Isolation Forest + XGBoost framework within the landscape of state-of-the-art hybrid solutions.
3. As claimed in the paper, the integration of forest and xgboost is showing promising results. However, the clarity of the integration is not well understood and not well explained in section 3.3. I suggest rewriting the emphasis on how these integrations work together.
Reply:
Thank you for highlighting this important point. We agree that the integration between Isolation Forest and XGBoost required clearer explanation. Accordingly, we have revised Section 3.3 and the associated methodological background to better articulate how the two components operate together. The revised text clarifies that the Isolation Forest model is used as an unsupervised feature extractor, generating an anomaly score—derived from average path lengths—which is then incorporated as an additional high-level input feature to the supervised XGBoost classifier. This results in a stacked hybrid architecture in which XGBoost benefits from the global anomaly patterns captured by the unsupervised stage, thereby enhancing its discriminatory power for fraud detection. These revisions improve the conceptual clarity and make the workflow of the hybrid integration more transparent.
4. The authors used simple split in the dataset without any justification (75/25) for training etc. Now, what I usually suggest is that what if we go for 80/20, 50/50, 70/30, and 75/25, just to see if these will improve the learning. Alos, what about cross validation
Reply:
Thank you for pointing this out. While we maintained the 75/25 hold-out split for the final independent testing to mimic a production deployment scenario, we have clarified in Section 4.1 (Experimental Setup) that during the training phase, we employed 5-fold cross-validation. This ensures that the hyperparameters were optimized robustly and that the model's performance is stable across different subsets of the training data, mitigating the bias of a single split.
5. I suggest that for the authors to emphasize and elaborate more on what do they mean by “Real-world” and “practical operation value”, how do they define these?
Reply:
We have added a dedicated definition in the Introduction (Section 1). We define "real-world operational value" as the model's capability to be deployed within existing utility IT infrastructures with minimal latency, producing inspection lists that align with field crew capacity (limited budget), and maintaining stability against data drift over time. This distinguishes our approach from purely theoretical models that may not account for operational constraints.
6. The authors should also calculate and discuss the Lift or provide a cost-benefit analysis at different thresholds. There are many things confusing here and what if analysis
Reply:
We have addressed this concern by adding Section 4.4 (Lift and Cost-Benefit Analysis) to the manuscript. In this section, we conducted a Lift analysis to quantify the operational value of the model under realistic resource constraints. Specifically, the value of 5% was established as a simulation of the utility company's inspection capacity, representing a scenario where resources allow for the verification of only the top tier of customers with the highest fraud probability scores generated by the model. Within this targeted segment, the model achieved a Capture Rate (Recall) of 80%, a result derived empirically from the Cumulative Gains Curve calculated on the test set. This performance indicates that the model successfully concentrated the vast majority of fraudulent cases within the highest risk percentiles. Consequently, the 16x Lift metric was calculated by comparing this performance against a random baseline; since a random inspection of 5% of the population would statistically capture only 5% of the frauds, the efficiency gain is determined by the ratio of the model’s capture rate to the random baseline (80 / 5= 16). This result demonstrates that the proposed framework is 16 times more effective than random inspections, providing a clear and quantifiable economic advantage for utility operations.
7. There is a critical discrepancy in the AUC-ROC values. For example, Table 1 reports that an AUC-ROC of 0.947 for a threshold of 0.5, but the text later and other rows in the same table claim it is 0.999. This inconsistency must be resolved, as it undermines the credibility of the reported results in the paper. Which also means that there are inconsistent results and no solid foundation
Reply: We apologize for this confusion. We have thoroughly reviewed the results and corrected Table 1 to consistently reflect the accurate performance metrics derived from the final optimized model. The AUC-ROC is consistently reported as 0.999 across the evaluated thresholds, reflecting the strong separability achieved by the hybrid model on the test set. The discrepancies were due to a versioning error in the table generation which has now been resolved.
8. The comparative analysis in Table 2 is insufficient and detailed is needed to understand the paper. The proposed model should be compared against established, simpler baselines for example Random Forest, Logistic Regression, or a simple rule based. The failure of Scenario 4 TSFRESH is noted but not adequately explained and diagnosed and not addressed well the limitations of the main approach.
Reply:
Thank you for this important observation. We have substantially expanded the comparative analysis to address your concern. Table 2 has been revised to include evaluations against well-established baseline models—Logistic Regression and Random Forest—providing a clearer benchmark for assessing the effectiveness of the proposed hybrid approach. The updated results show that our IF + XGBoost framework achieves superior performance (F1-score = 0.77), outperforming both Random Forest (F1-score = 0.71) and Logistic Regression (F1-score = 0.45), thereby demonstrating the added value of combining unsupervised anomaly scoring with a supervised classifier.
Regarding Scenario 4 (TSFRESH), we added a detailed explanation in Section 4.5. We clarify that its poor performance (F1 = 0.010) stems from the high dimensionality of the automatically extracted features and the severe class imbalance of the dataset. In this context, generic temporal descriptors are overshadowed by sparse and highly localized fraud patterns, while domain-informed behavioral features (e.g., abrupt consumption drops, prolonged low-usage periods, tariff changes) provide much stronger discriminatory power. This analysis also allowed us to explicitly acknowledge a limitation of automated feature-extraction approaches in NTL detection and reinforce why hybrid, domain-guided architectures remain necessary for this problem.
9. Also, section 3.6 lists the "carefully tuned" final hyperparameters but there is no detailed explanations. The authors must describe the method used in the papers. I suggest more elaboration.
Reply:
We have refined Section 4.1 (Experimental Setup) to explicitly state that hyperparameters were optimized using 5-fold cross-validation on the training set. This process involved systematically validating parameters to maximize the F1-score, ensuring the "carefully tuned" values are not arbitrary but the result of rigorous validation.
10. The paper uses terminologies such as electricity fraud and electricity theft but do not clearly well defined and addressed the Ambiguity of these in real-word example
Reply:
We have added a clarification in the Introduction. We define "electricity fraud" in the broad sense (administrative and technical irregularities) and "electricity theft" as the specific subset involving physical manipulation (e.g., bypassing meters). The paper focuses on the broader category of fraud detection using data patterns that may indicate either type of irregularity.
11. The description of the SMOTETomek fallback mechanism (Section 3.5) is confusing. Please elaborate more and add well defined steps
Reply: We have rewritten the description in Section 3.5 to include a structured, step-by-step explanation of the fallback mechanism. We clarify that if the minority class count is too low (<6 samples) to support the k-nearest neighbors required for SMOTE, the pipeline automatically skips resampling and defaults to using the scale_pos_weight parameter in XGBoost. This ensures the pipeline remains robust and does not crash during automated retraining on smaller data subsets.
12. I suggest also to review the notations again as it was hard to understand
Reply: We have reviewed and standardized the mathematical notations in Section 2.3. We explicitly defined $N$ as the total number of samples and $n$ as the number of estimators/trees, ensuring consistency across the XGBoost and Isolation Forest equations to improve readability and mathematical rigor.
13. In conclusion, please the limitations of the approach and how to overcome these as future works
Reply: We have included a dedicated Section 5.1 (Limitations and Future Directions). We explicitly acknowledge limitations such as the computational cost of the SMOTETomek resampling on very large datasets and the need for periodic retraining to handle concept drift. We suggest Incremental Learning and Federated Learning as specific future research directions to address scalability and privacy concerns.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe peer-reviewed (scientific) paper proposes a hybrid machine-learning framework for detecting electricity fraud within the broader context of Non-Technical Losses (NTLs) in power-distribution systems.
The problem of detecting fraud and energy theft presented in the manuscript is relevant. The authors took into account many aspects related to the probability of overfitting the model, preparation (standardization) of data, elimination of class imbalance and factors such as seasonality of consumption.
The approach considered is quite interesting, but at the same time, there are a number of suggestions and comments, therefore, the study needs to be finalized.
- There is virtually no literature review of previous research in this area. Section 2, implying an overview, uses only five sources. Are there really no other studies, or if you believe this is a very narrow area, then indicate this in the text. In addition, references are given in the text of the manuscript only 8 sources in total.
- A limitation of the study is that only the final "bona fide" consumers are considered, and theft during the transmission of energy to the consumer (to the meter) is not considered, but sometimes reaches large proportions. Although meters, especially intermediate ones, are still installed in a relatively small number of consumers across countries and the world. Please indicate whether you will apply your approach to this area of energy loss (theft)?
- It would be interesting to see the practical application of this approach. After all, in the analyzed dataset, fraudsters were identified through random checks and, therefore, not completely. That is, training took place, including on data that was considered non-fraudulent, only because there was no real verification of this consumer – the likely fraudster.
- Include all abbreviations used in the text in the Abbreviations list, for example: SVM and CNN, ASGD and CSGD.
- Describe the variables "N" (line 72) and "n" (formulas 1 and 2) in the text.
- Line 363: "an Isolation Forest model was applied to generate an anomaly score." Describe in detail the Isolation Forest model.
- Perhaps you could indicate at least the main categories of clients (tariffs) in the manuscript?
- For clarity, it is worth adding to Table 1 the indicators at thresholds 0.65 and 0.70.
- Line 324: "Year-over-year percentage variation in FP consumption for the same month, useful for capturing seasonality." Right that the authors took into account the seasonality of consumption, but please, describe the meaning and impact of this indicator in more detail.
- In my opinion, a diagram (graph) of abnormal and average consumption would very clearly demonstrate the study.
- Reference in the text to Figure 3 after the figure itself.
Author Response
The peer-reviewed (scientific) paper proposes a hybrid machine-learning framework for detecting electricity fraud within the broader context of Non-Technical Losses (NTLs) in power-distribution systems.
The problem of detecting fraud and energy theft presented in the manuscript is relevant. The authors took into account many aspects related to the probability of overfitting the model, preparation (standardization) of data, elimination of class imbalance and factors such as seasonality of consumption.
The approach considered is quite interesting, but at the same time, there are a number of suggestions and comments, therefore, the study needs to be finalized.
1. There is virtually no literature review of previous research in this area. Section 2, implying an overview, uses only five sources. Are there really no other studies, or if you believe this is a very narrow area, then indicate this in the text. In addition, references are given in the text of the manuscript only 8 sources in total.
Reply:
Thank you for this crucial observation. We have significantly expanded Section 2 ("Related Work and Theoretical Framework") to provide a more comprehensive overview of the state-of-the-art. We added specific subsections discussing "Data-Driven Electricity Theft Detection" and "Hybrid and Ensemble Methods," citing recent and relevant works (e.g., Badr et al., 2023; Mehdary et al., 2024; Javaid et al., 2025). We also included a comparison with baseline models (Logistic Regression and Random Forest) to better contextualize our contribution within existing literature.
2. A limitation of the study is that only the final "bona fide" consumers are considered, and theft during the transmission of energy to the consumer (to the meter) is not considered, but sometimes reaches large proportions. Although meters, especially intermediate ones, are still installed in a relatively small number of consumers across countries and the world. Please indicate whether you will apply your approach to this area of energy loss (theft)?
Reply:
We agree that transmission-level theft is a significant issue. We have added a clarification in the Introduction (Section 1) explicitly stating that this work focuses on "last mile" consumption due to data availability. We acknowledged that while NTLs occur during transmission, the lack of intermediate metering infrastructure in the target region limits the immediate application of this specific data-driven approach to that segment. Our framework is designed for end-consumer smart meter data.
3. It would be interesting to see the practical application of this approach. After all, in the analyzed dataset, fraudsters were identified through random checks and, therefore, not completely. That is, training took place, including on data that was considered non-fraudulent, only because there was no real verification of this consumer – the likely fraudster.
Reply:
This is a valid point regarding the nature of fraud datasets (Positive-Unlabeled learning). We have addressed this in Section 3.2 (Data Description and Preparation), acknowledging that the "Non-Fraud" class likely contains undetected fraud cases. We clarified that we rely on the utility's verified ground truth for supervised training. Furthermore, to demonstrate practical application, we expanded Section 4.3 to include a "Lift and Cost-Benefit Analysis," showing that targeting the top 5% of risky customers captures approximately 80% of fraud, demonstrating operational value despite the labeling limitations.
4. Include all abbreviations used in the text in the Abbreviations list, for example: SVM and CNN, ASGD and CSGD.
Reply:
The Abbreviations list has been updated to include all acronyms used in the text, including SVM, CNN, ASGD, CSGD, AMI, and IF (Isolation Forest), ensuring consistency throughout the manuscript.
5. Describe the variables "N" (line 72) and "n" (formulas 1 and 2) in the text.
Reply:
We have updated Section 2.3.1 (XGBoost) to explicitly define these variables. N now denotes the total number of samples in the dataset, and n denotes the number of estimators (trees) or instances, depending on the summation context, ensuring mathematical clarity in the formulas presented.
6. Line 363: "an Isolation Forest model was applied to generate an anomaly score." Describe in detail the Isolation Forest model.
Reply:
We have added a new subsection, Section 2.3.2 (Unsupervised Anomaly Detection: Isolation Forest), which details the theoretical and mathematical basis of the model. This section now explains how the anomaly score s(x, n) and the average path length c(n) are calculated, providing the necessary depth to understand how the ANOMALY_SCORE_ISO feature is generated.
7. Perhaps you could indicate at least the main categories of clients (tariffs) in the manuscript?
Reply:
We have clarified in Section 3.3 (Feature Engineering) that the TARIFA variable uses One-Hot Encoding to represent the main customer categories, specifically citing "Commercial" and "Residential" as primary examples within the dataset.
8. For clarity, it is worth adding to Table 1 the indicators at thresholds 0.65 and 0.70.
Reply:
Table 1 has been updated as requested. We have added the performance metrics (Precision, Recall, F1-Score) for thresholds 0.65 and 0.70. This addition illustrates the trade-off between precision and recall, showing that higher thresholds yield higher precision but significantly lower recall, which informs the operational decision to select 0.60 as the optimal threshold.
9. Line 324: "Year-over-year percentage variation in FP consumption for the same month, useful for capturing seasonality." Right that the authors took into account the seasonality of consumption, but please, describe the meaning and impact of this indicator in more detail.
Reply:
We have expanded the description of this feature in Section 3.3. We explained that the Year-over-year percentage variation is critical for neutralizing natural annual cycles. By comparing the same month across years, we filter out expected seasonal behavior, allowing the model to highlight genuine anomalies that deviate from the user's historical seasonal pattern.
10. In my opinion, a diagram (graph) of abnormal and average consumption would very clearly demonstrate the study.
Reply::
We have included a new figure, Figure 2, titled "Temporal comparison of average consumption: Normal vs. Fraudster (2019-2022)". This graph visually demonstrates the distinct consumption drop exhibited by fraudulent users starting in 2020-2021 compared to the stable trend of normal users, providing clear visual evidence for the behavior captured by the model.
11. Reference in the text to Figure 3 after the figure itself.
Reply:
We have reviewed the LaTeX code to ensure that all figures are referenced in the text prior to their placement, correcting the flow to meet the journal's formatting standards.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors addressed all comments
Author Response
The changes were made.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript is markedly improved. The authors have made corrections and provided explanations in accordance with my comments.
There are 16 sources in References, but the text of the article only references 8.
Author Response
The references were updated.

