A Novel Feature-Engineered–NGBoost Machine-Learning Framework for Fraud Detection in Electric Power Consumption Data
Abstract
:1. Introduction
Contributions of the Proposed Theft-Detection System
2. Literature Review
3. Proposed Methodology
3.1. Stage-1: Data Preprocessing
3.2. Stage-2: Data Class Balance and Feature Engineering
3.2.1. Data Class Balancing
3.2.2. Proposed Feature-Engineering Method
3.3. Stage-3: Model Training and Evaluation Stage
3.3.1. Performance Evaluation Metrics
3.3.2. NGBoost Classification Algorithm: Theoretical Background
4. Performance Evaluation of Proposed Classifier
4.1. Confusion Matrix of the Proposed Model
4.2. Outcomes Interpretability Using Tree SHAP Algorithm
5. Results and Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Guerrero-Prado, J.S.; Alfonso-Morales, W.; Caicedo-Bravo, E.F. A data analytics/big data framework for advanced metering infrastructure data. Sensors 2021, 21, 5650. [Google Scholar] [CrossRef] [PubMed]
- Glauner, P.; Meira, J.A.; Valtchev, P.; State, R.; Bettinger, F. The challenge of non-technical loss detection using artificial intelligence: A survey. arXiv 2016, arXiv:1606.00626. [Google Scholar] [CrossRef] [Green Version]
- Northeast Group. Electricity Theft and Non-Technical Losses: Global Markets, Solutions and Vendors. 2017. Available online: http://www.northeast-group.com/reports/Brochure-Electricity%20Theft%20&%20Non-Technical%20Losses%20-%20Northeast%20Group.pdf (accessed on 18 October 2021).
- Fei, K.; Li, Q.; Zhu, C. Non-technical losses detection using missing values’ pattern and neural architecture search. Int. J. Electr. Power Energy Syst. 2022, 134, 107410. [Google Scholar] [CrossRef]
- Viegas, J.; Esteves, P.R.; Melicio, R.; Mendes, V.; Vieira, S.M. Solutions for detection of non-technical losses in the electricity grid: A review. Renew. Sustain. Energy Rev. 2017, 80, 1256–1268. [Google Scholar] [CrossRef] [Green Version]
- Jaiswal, S.; Ballal, M.S. Fuzzy inference based electricity theft prevention system to restrict direct tapping over distribution line. J. Electr. Eng. Technol. 2020, 15, 1095–1106. [Google Scholar] [CrossRef]
- Liao, C.; Ten, C.-W.; Hu, S. Strategic FRTU deployment considering cybersecurity in secondary distribution network. IEEE Trans. Smart Grid 2013, 4, 1264–1274. [Google Scholar] [CrossRef]
- Hussain, S.; Mustafa, M.W.; Jumani, T.A.; Baloch, S.K.; Saeed, M.S. A novel unsupervised feature-based approach for electricity theft detection using robust PCA and outlier removal clustering algorithm. Int. Trans. Electr. Energy Syst. 2020, 30, e12572. [Google Scholar] [CrossRef]
- Jeng, R.-S.; Kuo, C.-Y.; Ho, Y.-H.; Lee, M.-F.; Tseng, L.-W.; Fu, C.-L.; Liang, P.-F.; Chen, L.-J. Missing data handling for meter data management system. In Proceedings of the Fourth International Conference on Future Energy Systems, Berkeley, CA, USA, 21–24 May 2013; pp. 275–276. [Google Scholar]
- Roth, P.L.; Switzer, F.S. A Monte Carlo analysis of missing data techniques in a HRM setting. J. Manag. 1995, 21, 1003–1023. [Google Scholar] [CrossRef]
- Rahman, M.G.; Islam, M.Z. Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 2016, 46, 389–422. [Google Scholar] [CrossRef]
- Jung, S.; Moon, J.; Park, S.; Rho, S.; Baik, S.W.; Hwang, E. Bagging ensemble of multilayer perceptrons for missing electricity consumption data imputation. Sensors 2020, 20, 1772. [Google Scholar] [CrossRef] [Green Version]
- Efron, B. Missing data, imputation, and the bootstrap. J. Am. Stat. Assoc. 1994, 89, 463–475. [Google Scholar] [CrossRef]
- Joenssen, D.W.; Bankhofer, U. Hot deck methods for imputing missing data. In Machine Learning and Data Mining in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2012; pp. 63–75. [Google Scholar]
- Allison, P.D. Missing Data; Sage Publications: Thousand Oaks, CA, USA, 2001. [Google Scholar]
- Glauner, P.; Boechat, A.; Dolberg, L.; State, R.; Bettinger, F.; Rangoni, Y.; Duarte, D. Large-scale detection of non-technical losses in imbalanced data sets. In Proceedings of the 2016 IEEE Power and Energy Society Innovative Smart Grid Technologies Conference (ISGT), Minneapolis, MN, USA, 6–9 September 2016; pp. 1–5. [Google Scholar]
- Hasan, N.; Toma, R.N.; Nahid, A.-A.; Islam, M.M.M.; Kim, J.-M. Electricity theft detection in smart grid systems: A CNN-LSTM based approach. Energies 2019, 12, 3310. [Google Scholar] [CrossRef] [Green Version]
- Gunturi, S.K.; Sarkar, D. Ensemble machine learning models for the detection of energy theft. Electr. Power Syst. Res. 2021, 192, 106904. [Google Scholar] [CrossRef]
- Buzau, M.M.; Tejedor-Aguilera, J.; Cruz-Romero, P.; Gomez-Exposito, A. Detection of non-technical losses using smart meter data and supervised learning. IEEE Trans. Smart Grid 2019, 10, 2661–2670. [Google Scholar] [CrossRef]
- Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 2006, 30, 25–36. [Google Scholar]
- Adil, M.; Javaid, N.; Qasim, U.; Ullah, I.; Shafiq, M.; Choi, J.-G. LSTM and bat-based RUSBoost approach for electricity theft detection. Appl. Sci. 2020, 10, 4378. [Google Scholar] [CrossRef]
- Jindal, A.; Dua, A.; Kaur, K.; Singh, M.; Kumar, N.; Mishra, S. Decision tree and SVM-based data analytics for theft detection in smart grid. IEEE Trans. Ind. Inform. 2016, 12, 1005–1016. [Google Scholar] [CrossRef]
- Marimuthu, K.P.; Durairaj, D.; Srinivasan, S.K. Development and implementation of advanced metering infrastructure for efficient energy utilization in smart grid environment. Int. Trans. Electr. Energy Syst. 2018, 28, e2504. [Google Scholar] [CrossRef]
- Saeed, M.S.; Mustafa, M.W.; Sheikh, U.U.; Jumani, T.A.; Mirjat, N.H. Ensemble bagged tree based classification for reducing non-technical losses in multan electric power company of Pakistan. Electronics 2019, 8, 860. [Google Scholar] [CrossRef] [Green Version]
- Yan, Z.; Wen, H. Electricity theft detection base on extreme gradient boosting in AMI. IEEE Trans. Instrum. Meas. 2021, 70, 2504909. [Google Scholar] [CrossRef]
- Saeed, M.S.; Mustafa, M.W.; Sheikh, U.U.; Jumani, T.A.; Khan, I.; Atawneh, S.; Hamadneh, N.N. An efficient boosted C5.0 decision-tree-based classification approach for detecting non-technical losses in power utilities. Energies 2020, 13, 3242. [Google Scholar] [CrossRef]
- Pereira, L.A.M.; Afonso, L.C.S.; Papa, J.P.; Vale, Z.A.; Ramos, C.C.O.; Gastaldello, D.S.; Souza, A.N. Multilayer perceptron neural networks training through charged system search and its application for non-technical losses detection. In Proceedings of the 2013 IEEE PES Conference on Innovative Smart Grid Technologies (ISGT Latin America), Sao Paulo, Brazil, 15–17 April 2013; pp. 1–6. [Google Scholar]
- Jokar, P.; Arianpoo, N.; Leung, V.C.M. Electricity theft detection in AMI using customers’ consumption patterns. IEEE Trans. Smart Grid 2015, 7, 216–226. [Google Scholar] [CrossRef]
- Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
- Barua, S.; Islam, M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2014, 26, 405–425. [Google Scholar] [CrossRef]
- Nagi, J.; Yap, K.S.; Tiong, S.K.; Ahmed, S.K.; Mohamad, M. Nontechnical loss detection for metered customers in power utility using support vector machines. IEEE Trans. Power Deliv. 2010, 25, 1162–1171. [Google Scholar] [CrossRef]
- Punmiya, R.; Choe, S. Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing. IEEE Trans. Smart Grid 2019, 10, 2326–2329. [Google Scholar] [CrossRef]
- Barandas, M.; Folgado, D.; Fernandes, L.; Santos, S.; Abreu, M.; Bota, P.; Liu, H.; Schultz, T.; Gamboa, H. TSFEL: Time series feature extraction library. SoftwareX 2020, 11, 100456. [Google Scholar] [CrossRef]
- Razavi, R.; Gharipour, A.; Fleury, M.; Akpan, I. A practical feature-engineering framework for electricity theft detection in smart grids. Appl. Energy 2019, 238, 481–494. [Google Scholar] [CrossRef]
- Mafarja, M.; Mirjalili, S. Whale optimization approaches for wrapper feature selection. Appl. Soft Comput. 2018, 62, 441–453. [Google Scholar] [CrossRef]
- Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hussain, S.; Mustafa, M.W.; Jumani, T.A.; Baloch, S.K.; Alotaibi, H.; Khan, I.; Khan, A. A novel feature engineered-CatBoost-based supervised machine learning framework for electricity theft detection. Energy Rep. 2021, 7, 4425–4436. [Google Scholar] [CrossRef]
- Duan, T.; Avati, A.; Ding, D.Y.; Thai, K.K.; Basu, S.; Ng, A.Y.; Schuler, A. NBGoost: Natural gradient boosting for probabilistic prediction. arXiv 2020, arXiv:1910.03225. [Google Scholar]
- Seldon Technologies. Tree SHAP. 2019. Available online: https://docs.seldon.io/projects/alibi/en/stable/methods/TreeSHAP.html (accessed on 18 October 2021).
- Zheng, Z.; Yang, Y.; Niu, X.; Dai, H.-N.; Zhou, Y. Wide and deep convolutional neural networks for electricity-theft detection to secure smart grids. IEEE Trans. Ind. Inform. 2017, 14, 1606–1615. [Google Scholar] [CrossRef]
- Sharawi, M.; Zawbaa, H.M.; Emary, E. Feature selection approach based on whale optimization algorithm. In Proceedings of the Ninth International Conference on Advanced Computational Intelligence (ICACI), Doha, Qatar, 4–6 February2017; pp. 163–168. [Google Scholar]
- Leghari, Z.H.; Hassan, M.Y.; Said, D.M.; Memon, Z.A.; Hussain, S. An efficient framework for integrating distributed generation and capacitor units for simultaneous grid-connected and islanded network operations. Int. J. Energy Res. 2021, 45, 14920–14958. [Google Scholar] [CrossRef]
- Leghari, Z.H.; Hassan, M.Y.; Said, D.M.; Jumani, T.A.; Memon, Z.A. A novel grid-oriented dynamic weight parameter based improved variant of Jaya algorithm. Adv. Eng. Softw. 2020, 150, 102904. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, T.; Na, G.; Li, G.; Li, Y. Optimized extreme learning machine for power system transient stability prediction using synchrophasors. Math. Probl. Eng. 2015, 2015, 529724. [Google Scholar] [CrossRef] [Green Version]
- Messinis, G.; Hatziargyriou, N.D. Review of non-technical loss detection methods. Electr. Power Syst. Res. 2018, 158, 250–266. [Google Scholar] [CrossRef]
- Pereira, J.; Saraiva, F. Convolutional neural network applied to detect electricity theft: A comparative study on unbalanced data handling techniques. Int. J. Electr. Power Energy Syst. 2021, 131, 107085. [Google Scholar] [CrossRef]
- Asheghi, R.; Hosseini, S.A.; Saneie, M.; Shahri, A.A. Updating the neural network sediment load models using different sensitivity analysis methods: A regional application. J. Hydroinform. 2020, 22, 562–577. [Google Scholar] [CrossRef] [Green Version]






















| Problem Identified | Proposed Solution | 
|---|---|
| Missing and inconsistent entries in data [9,10,11,13,14,15] | Supervised ML-based random forest imputation technique [29] | 
| Data class imbalance [16,17,18,19,20] | Majority weighted minority oversampling technique algorithm [30] | 
| Irrelevant and redundant features [31,32] | Time series and statistical-technique-based novel feature extraction using TSFEL algorithm [33] | 
| High data dimensionality [26,34] | Feature selection using whale optimization algorithm [35] | 
| Model selection [34,36,37] | Natural gradient boosting trees algorithm [38] | 
| Model’s prediction interpretation | Tree SHAP additive explanations algorithm [39] | 
| Reliable evaluation | AUC metric, precision, recall, Matthew’s correlation coefficient, Cohen’s kappa | 
| Parameter Name | Description | Parameter Value | 
|---|---|---|
| learning_rate | Helps in setting weighting factors for the addition of new trees at each iteration to the classifier. | 0.1 | 
| n_estimatiors | The number of boosting iterations to be performed. | 100 | 
| subsample | The number of samples to be used for fitting the individual base learners. Optimal selection of this parameter can assist in setting bias and variance values. | 0.5 | 
| min_samples_split | The minimum number of samples to be present at a leaf/internal node. This parameter controls the model overfitting/underfitting related problems. | 5 | 
| min_samples_leaf | The minimum number of samples to be present at the leaf. Controlling this parameter helps in overfitting/underfitting-related issues. | 6 | 
| max_depth | Helps in building the structure of the regression tree. | 8 | 
| max_features | Number of features to be selected when searching for split. | 15 | 
| max_leaf_nodes | Optimal selection of this value facilitats reducing the impurity of trees. | 6 | 
| Tol | This value facilitates early stopping if there is no change in the loss. | 0.20 | 
| Base_learner | Used to describe the base component of multiple classifier systems. | Regression trees | 
| Probability_distribtuion | Normal distribution for continuous output, and Bernoulli for binary output. | Bernoulli | 
| Scoring_rule | Maximum likelihood or continuous ranked probability score. | Maximum likelihood estimation | 
| Performance Metric | Fold-1 | Fold-2 | Fold-3 | Fold-4 | Fold-5 | Fold-6 | Fold-7 | Fold-8 | Fold-9 | Fold-10 | Mean | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.93 | 0.94 | 0.94 | 0.93 | 0.94 | 0.94 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 
| Recall | 0.92 | 0.91 | 0.90 | 0.92 | 0.93 | 0.93 | 0.92 | 0.90 | 0.92 | 0.91 | 0.91 | 
| Precision | 0.95 | 0.96 | 0.93 | 0.96 | 0.95 | 0.94 | 0.95 | 0.93 | 0.95 | 0.96 | 0.95 | 
| Kappa | 0.86 | 0.88 | 0.89 | 0.95 | 0.88 | 0.88 | 0.86 | 0.87 | 0.9 | 0.9 | 0.89 | 
| Flscore | 0.93 | 0.91 | 0.90 | 0.89 | 0.90 | 0.91 | 0.93 | 0.94 | 0.93 | 0.94 | 0.92 | 
| AUC | 0.94 | 0.96 | 0.97 | 0.97 | 0.96 | 0.97 | 0.93 | 0.96 | 0.97 | 0.98 | 0.96 | 
| MCC | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.88 | 0.95 | 0.87 | 0.86 | 0.87 | 0.88 | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hussain, S.; Mustafa, M.W.; Al-Shqeerat, K.H.A.; Saeed, F.; Al-rimy, B.A.S. A Novel Feature-Engineered–NGBoost Machine-Learning Framework for Fraud Detection in Electric Power Consumption Data. Sensors 2021, 21, 8423. https://doi.org/10.3390/s21248423
Hussain S, Mustafa MW, Al-Shqeerat KHA, Saeed F, Al-rimy BAS. A Novel Feature-Engineered–NGBoost Machine-Learning Framework for Fraud Detection in Electric Power Consumption Data. Sensors. 2021; 21(24):8423. https://doi.org/10.3390/s21248423
Chicago/Turabian StyleHussain, Saddam, Mohd Wazir Mustafa, Khalil Hamdi Ateyeh Al-Shqeerat, Faisal Saeed, and Bander Ali Saleh Al-rimy. 2021. "A Novel Feature-Engineered–NGBoost Machine-Learning Framework for Fraud Detection in Electric Power Consumption Data" Sensors 21, no. 24: 8423. https://doi.org/10.3390/s21248423
APA StyleHussain, S., Mustafa, M. W., Al-Shqeerat, K. H. A., Saeed, F., & Al-rimy, B. A. S. (2021). A Novel Feature-Engineered–NGBoost Machine-Learning Framework for Fraud Detection in Electric Power Consumption Data. Sensors, 21(24), 8423. https://doi.org/10.3390/s21248423
 
         
                                                


 
       