# Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Methodology

#### 3.1. Multidimensional Bin Reduction

#### 3.2. Data Set Test

## 4. Numerical Study

#### 4.1. Data Set

#### 4.2. Feasibility Test

#### 4.3. Neural Network

#### 4.4. Training Set Generation Using MdBR

#### 4.5. Training Results

#### 4.6. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Malik, H.; Fatema, N.; Iqbal, A. Intelligent Data-Analytics for Condition Monitoring: Smart Grid Applications; Academic Press: Cambridge, MA, USA, 2021. [Google Scholar]
- Teimourzadeh Baboli, P.; Babazadeh, D.; Raeiszadeh, A.; Horodyvskyy, S.; Koprek, I. Optimal temperature-based condition monitoring system for wind turbines. Infrastructures
**2021**, 6, 50. [Google Scholar] [CrossRef] - Alzawaideh, B.; Baboli, P.T.; Babazadeh, D.; Horodyvskyy, S.; Koprek, I.; Lehnhoff, S. Wind Turbine Failure Prediction Model using SCADA-based Condition Monitoring System. In Proceedings of the 2021 IEEE Madrid PowerTech, Madrid, Spain, 28 June–2 July 2021; pp. 1–6. [Google Scholar]
- Berghout, T.; Benbouzid, M.; Bentrcia, T.; Ma, X.; Djurović, S.; Mouss, L.H. Machine Learning-Based Condition Monitoring for PV Systems: State of the Art and Future Prospects. Energies
**2021**, 14, 6316. [Google Scholar] [CrossRef] - Wani, S.A.; Rana, A.S.; Sohail, S.; Rahman, O.; Parveen, S.; Khan, S.A. Advances in DGA based condition monitoring of transformers: A review. Renew. Sustain. Energy Rev.
**2021**, 149, 111347. [Google Scholar] [CrossRef] - Lee, S.B.; Stone, G.C.; Antonino-Daviu, J.; Gyftakis, K.N.; Strangas, E.G.; Maussion, P.; Platero, C.A. Condition monitoring of industrial electric machines: State of the art and future challenges. IEEE Ind. Electron. Mag.
**2020**, 14, 158–167. [Google Scholar] [CrossRef] - Zainuddin, N.M.; Rahman, M.A.; Kadir, M.A.; Ali, N.N.; Ali, Z.; Osman, M.; Mansor, M.; Ariffin, A.M.; Rahman, M.S.A.; Nor, S.; et al. Review of Thermal Stress and Condition Monitoring Technologies for Overhead Transmission Lines: Issues and Challenges. IEEE Access
**2020**, 8, 120053–120081. [Google Scholar] [CrossRef] - Yüce, F.; Hiller, M. Condition Monitoring of Power Electronic Systems through Data Analysis of Measurement Signals and Control Output Variables. IEEE J. Emerg. Sel. Top. Power Electron.
**2021**. [Google Scholar] [CrossRef] - Gonzalez-Abreu, A.D.; Saucedo-Dorantes, J.J.; Osornio-Rios, R.A.; Arellano-Espitia, F.; Delgado-Prieto, M. Deep Learning based Condition Monitoring approach applied to Power Quality. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; Volume 1, pp. 1427–1430. [Google Scholar]
- Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Jun, H.; Kianinejad, H.; Patwary, M.; Ali, M.; Yang, Y.; Zhou, Y. Deep learning scaling is predictable, empirically. arXiv
**2017**, arXiv:1712.00409. [Google Scholar] - Lapedriza, A.; Pirsiavash, H.; Bylinskii, Z.; Torralba, A. Are all training examples equally valuable? arXiv
**2013**, arXiv:1311.6510. [Google Scholar] - Dhar, S.; Guo, J.; Liu, J.; Tripathi, S.; Kurup, U.; Shah, M. On-device machine learning: An algorithms and learning theory perspective. arXiv
**2019**, arXiv:1911.00623. [Google Scholar] [CrossRef] - Barandela, R.; Valdovinos, R.M.; Sánchez, J.S.; Ferri, F.J. The imbalanced training sample problem: Under or over sampling? In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR); Springer: Berlin/Heidelberg, Germany, 2004; pp. 806–814. [Google Scholar]
- Gong, Z.; Zhong, P.; Hu, W. Diversity in machine learning. IEEE Access
**2019**, 7, 64323–64350. [Google Scholar] [CrossRef] - Karystinos, G.N.; Pados, D.A. On overfitting, generalization, and randomly expanded training sets. IEEE Trans. Neural Netw.
**2000**, 11, 1050–1057. [Google Scholar] [CrossRef] [PubMed] - Bottou, L.; Lin, C.J. Support vector machine solvers. Large Scale Kernel Mach.
**2007**, 3, 301–320. [Google Scholar] - Balduin, S.; Oest, F.; Blank-Babazadeh, M.; Nieße, A.; Lehnhoff, S. Tool-assisted surrogate selection for simulation models in energy systems. In Proceedings of the 2019 Federated Conference on Computer Science and Information Systems (FedCSIS), Leipzig, Germany, 1–4 September 2019; pp. 185–192. [Google Scholar]
- Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. arXiv
**2020**, arXiv:2004.05439. [Google Scholar] [CrossRef] [PubMed] - Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
- Kalegele, K.; Takahashi, H.; Sveholm, J.; Sasai, K.; Kitagata, G.; Kinoshita, T. Numerosity reduction for resource constrained learning. J. Inf. Process.
**2013**, 21, 329–341. [Google Scholar] [CrossRef][Green Version] - Feurer, M.; Hutter, F. Hyperparameter optimization. In Automated Machine Learning; Springer: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar]
- Shahrokh Esfahani, M.; Dougherty, E.R. Effect of separate sampling on classification accuracy. Bioinformatics
**2014**, 30, 242–250. [Google Scholar] [CrossRef] - Deville, J.C.; Tillé, Y. Efficient balanced sampling: The cube method. Biometrika
**2004**, 91, 893–912. [Google Scholar] [CrossRef][Green Version] - Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res.
**2002**, 16, 321–357. [Google Scholar] [CrossRef] - Bachem, O.; Lucic, M.; Krause, A. Practical coreset constructions for machine learning. arXiv
**2017**, arXiv:1703.06476. [Google Scholar] - Wang, T.; Zhu, J.Y.; Torralba, A.; Efros, A.A. Dataset distillation. arXiv
**2018**, arXiv:1811.10959. [Google Scholar] - Ghojogh, B.; Crowley, M. Principal sample analysis for data reduction. In Proceedings of the 2018 IEEE International Conference on Big Knowledge (ICBK), Singapore, 17–18 November 2018; pp. 350–357. [Google Scholar]
- Mall, R.; Jumutc, V.; Langone, R.; Suykens, J.A. Representative subsets for big data learning using k-NN graphs. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; pp. 37–42. [Google Scholar]
- Maslove, D.M.; Podchiyska, T.; Lowe, H.J. Discretization of continuous features in clinical datasets. J. Am. Med. Inform. Assoc.
**2013**, 20, 544–553. [Google Scholar] [CrossRef][Green Version] - Dimić, G.; Rančić, D.; Pronić-Rančić, O.; Milošević, D. An approach to educational data mining model accuracy improvement using histogram discretization and combining classifiers into an ensemble. In Smart Education and e-Learning 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 267–280. [Google Scholar]
- Hacibeyoglu, M.; Arslan, A.; Kahramanli, S. Improving classification accuracy with discretization on data sets including continuous valued features. Ionosphere
**2011**, 34, 2. [Google Scholar] - Dougherty, J.; Kohavi, R.; Sahami, M. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995; Elsevier: Amsterdam, The Netherlands, 1995; pp. 194–202. [Google Scholar]
- Boyd, M.; Chen, T.; Doughert, B. NIST Campus Photovoltaic (PV) Arrays and Weather Station Data Sets; National Institute of Standards and Technology [Data Set]; U.S. Department of Commerce: Washington, DC, USA, 2017. [CrossRef]
- Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; Carlini, N. Deduplicating training data makes language models better. arXiv
**2021**, arXiv:2107.06499. [Google Scholar]

**Figure 3.**Plotted results of the feasibility test using the data of May 2017. The MdBs are sorted by prevalence.

**Figure 4.**Size of the reduced training sets with a different number of bins per feature. The number of different MdBs per data set is equal to the number of samples in the training set.

**Figure 5.**Distribution of the normalized output power in the training set after reducing with the respective number of bins per feature.

**Figure 6.**Comparison of the model average accuracy (

**a**) and training time (

**b**) using 10 to 80 bins per feature for reducing the training data set. Each model has been trained 20 times.

**Figure 7.**Comparison of the model average accuracy using 3 to 20 bins per feature for reducing the training data set. The number of different MdBs per data set is equal to the samples. Each model has been trained 20 times.

Features/Channels | Abbreviation |
---|---|

AC real power | PwrMtrP_kW |

Outdoor ambient temperature | SEWSAmbientTemp_C |

Module Temperature | SEWSModuleTemp_C |

Plane of array irradiance | SEWSPOAIrrad_Wm2 |

Inverter heatsink temperature | InvTempHeatsink_C |

Inverter operating status | InvOpState |

Artifact | Value |
---|---|

Period | May 2017 |

Number of samples | 2,671,113 |

Number of bins per feature | 50 |

Number of possible MdB | $1.56\times {10}^{10}$ |

Number of found MdB | 108,733 |

Max. sample in one MdB | 13,545 |

Artifact | Value |
---|---|

Number of training samples | 83,508,520 |

NRMSE | $3.63\pm 0.12\%$ |

Training Time | 4 h:24 m ± 1 h:42 m |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wibbeke, J.; Teimourzadeh Baboli, P.; Rohjans, S. Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach. *Energies* **2022**, *15*, 3092.
https://doi.org/10.3390/en15093092

**AMA Style**

Wibbeke J, Teimourzadeh Baboli P, Rohjans S. Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach. *Energies*. 2022; 15(9):3092.
https://doi.org/10.3390/en15093092

**Chicago/Turabian Style**

Wibbeke, Jelke, Payam Teimourzadeh Baboli, and Sebastian Rohjans. 2022. "Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach" *Energies* 15, no. 9: 3092.
https://doi.org/10.3390/en15093092