Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets
Abstract
1. Introduction
1.1. Related Works
1.2. Research Gap and Contributions
- This paper introduces methods of data reduction for time-series data based on previously established techniques for 2D image data;
- This paper shows, through experimentation, the benefits and drawbacks of varying amounts of data reduction on time-series data;
- This paper compares data reduction on the larger of imbalanced classes and data reduction on the entire dataset to identify the effects of data undersampling in conjunction with our novel data reduction strategies;
- This paper shows the correlation between class density and model and model performance after data reduction to show how data reduction may be suitable for a given dataset;
- This paper shows the suitability of dataset fusion for occupancy datasets, in combination with data reduction.
2. Materials and Methods
2.1. Dataset Preparation and Fusion
2.2. Centroid Distance Calculation
Algorithm 1 Centroid distance calculation |
|
2.3. Data Reduction Strategies
- Random exclusion—random datapoints are removed from the training set.
- Central exclusion—datapoints with the smallest class centroid distance are removed.
- Lateral exclusion—datapoints with the largest class centroid distance are removed.
- Data even—datapoints from the largest density of class centroid distances are removed. This effectively cuts the top off the tallest columns in the centroid distribution plots.
- Data squash—an amount of datapoints proportional to the density of each of 10 bins of data is removed from each bin. This effectively flattens all columns in the centroid distribution plots, proportionally to the size of each column.
2.4. Class Density Calculation
2.5. Metrics and Model
- Random Forest Algorithm (RF)
- −
- Maximum depth: unlimited;
- −
- Number of estimators: 100.
- XGBoost
- −
- Maximum depth: unlimited;
- −
- Number of estimators: 100;
- −
- Tree method: ‘approx’.
- Convolutional Neural Network (CNN)
- −
- Layer configuration: 3 convolutional layers with batch normalization; 2 fully connected final layers;
- −
- Learning rate: 0.001;
- −
- Optimiser: Adam;
- −
- Loss function: Binary cross-entropy;
- −
- Data window size: 6 datapoints.
- Long Short-Term Memory Network (LSTM)
- −
- Number of layers: 4 LSTM layers, 1 fully connected layer;
- −
- Hidden layer size: 250;
- −
- Bidirectional: False;
- −
- Learning rate: 0.001;
- −
- Optimiser: Adam;
- −
- Loss function: Binary cross-entropy;
- −
- Data window size: 6 datapoints.
2.6. Hardware and Power Calculation
- CPU: Intel i7-11700k;
- RAM: 16 GB DDR4;
- OS: Windows 10.
3. Results
3.1. Experiments on Individual Sites
3.1.1. Experimental Benchmark
3.1.2. Site Alpha
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.268 | 0.182 | 0.616 | 0.794 | 0.329 |
10% | 0.971 | 0.255 | 0.906 | 0.652 | 0.625 |
25% | 0.74 | 0.972 | 0.688 | 0.482 | 0.673 |
50% | 0.605 | 0.731 | 0.948 | 0.341 | 0.239 |
Max | 2.25 * | 2.63 * | 1.49 * | 1.59 * | 1.18 * |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.273 | 7.04 * | 4.42 * | 5.92 | 5.56 * |
10% | 3.51 * | 0.267 | 1.44 * | 0.16 | 3.48 * |
25% | 1.71 * | 5.21 * | 1.21 * | 1.98 * | 2.83 * |
50% | 3.96 * | 2.56 * | 2.64 * | 7.09 * | 2.11 * |
Max | 1.85 * | 3.14 * | 1.25 * | 1.73 * | 3.52 * |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.707 | 0.712 | 0.670 | 0.687 | 0.704 | 1.565 | 1.565 | 1.582 | 1.565 | 1.551 |
10% | 0.739 | 0.721 | 0.722 | 0.756 | 0.740 | 1.565 | 1.562 | 1.552 | 1.511 | 1.550 |
25% | 0.812 | 0.875 | 0.848 | 0.848 | 0.882 | 1.507 | 1.485 | 1.482 | 1.389 | 1.447 |
50% | 1.112 | 1.185 | 1.176 | 1.161 | 1.133 | 1.324 | 1.308 | 1.315 | 1.112 | 1.311 |
Max | 1.782 | 1.734 | 1.725 | 1.796 | 1.621 | 1.008 | 1.010 | 0.977 | 0.680 | 0.968 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.720 | 0.706 | 0.719 | 0.643 | 0.707 | 1.549 | 1.555 | 1.579 | 1.591 | 1.572 |
10% | 0.782 | 0.669 | 0.634 | 0.704 | 0.709 | 1.565 | 1.579 | 1.598 | 1.532 | 1.557 |
25% | 0.792 | 0.752 | 0.781 | 0.635 | 0.622 | 1.562 | 1.573 | 1.580 | 1.531 | 1.597 |
50% | 0.707 | 0.716 | 0.735 | 0.595 | 0.633 | 1.571 | 1.569 | 1.582 | 1.458 | 1.579 |
Max | 0.766 | 0.804 | 0.850 | 0.554 | 0.541 | 1.556 | 1.571 | 1.622 | 1.340 | 1.594 |
3.1.3. Site Beta
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.955 | 0.684 | 0.591 | 0.258 | 0.325 |
10% | 0.929 | 0.486 | 0.32 | 0.672 | 0.376 |
25% | 4.65 * | 4.82 * | 0.297 | 0.483 | 0.251 |
Max | 3.47 * | 0.589 | 5.31 | 1.71 * | 4.05 * |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 7.27 | 0.151 | 0.349 | 0.657 | 0.368 |
10% | 0.784 | 3.04 * | 0.389 | 0.564 | 2.12 * |
25% | 0.142 | 0.111 | 7.13 | 0.247 | 0.523 |
Max | 1.80 * | 0.124 | 1.38 * | 1.00 * | 9.75 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 1.111 | 1.132 | 1.112 | 1.114 | 1.126 | 1.172 | 1.153 | 1.164 | 1.157 | 1.149 |
10% | 1.144 | 1.157 | 1.162 | 1.149 | 1.158 | 1.137 | 1.146 | 1.146 | 1.119 | 1.120 |
25% | 1.277 | 1.276 | 1.279 | 1.263 | 1.298 | 1.055 | 1.047 | 1.035 | 1.004 | 1.021 |
Max | 1.383 | 1.345 | 1.371 | 1.345 | 1.387 | 0.963 | 1.002 | 0.970 | 0.917 | 0.952 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 1.083 | 1.097 | 1.087 | 1.091 | 1.071 | 1.195 | 1.180 | 1.198 | 1.180 | 1.189 |
10% | 1.087 | 1.088 | 1.094 | 1.062 | 1.079 | 1.191 | 1.194 | 1.192 | 1.187 | 1.177 |
25% | 1.057 | 1.078 | 1.085 | 1.047 | 1.061 | 1.205 | 1.198 | 1.186 | 1.156 | 1.185 |
Max | 1.085 | 1.138 | 1.114 | 1.052 | 1.068 | 1.190 | 1.184 | 1.218 | 1.126 | 1.174 |
3.1.4. Site Charlie
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.14 | 4.96 * | 3.08 * | 1.18 * | 0.215 |
10% | 2.05 * | 7.93 * | 6.49 * | 2.93 * | 8.29 * |
25% | 2.96 * | 2.23 * | 6.12 * | 1.41 * | 8.71 * |
50% | 6.60 * | 9.01 * | 1.74 * | 2.63 * | 3.41 * |
Max | 7.02 | 0.389 | 4.31 * | 0.365 | 2.89 * |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.138 | 0.116 | 0.299 | 1.71 * | 1.23 * |
10% | 0.233 | 0.461 | 0.3 | 0.396 | 0.924 |
25% | 0.863 | 0.705 | 0.994 | 2.66 * | 0.804 |
50% | 2.42 * | 0.255 | 0.194 | 6.01 * | 2.02 * |
Max | 3.48 * | 5.56 * | 2.32 * | 9.48 * | 3.73 * |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.663 | 0.662 | 0.664 | 0.663 | 0.661 | 1.512 | 1.524 | 1.519 | 1.511 | 1.513 |
10% | 0.689 | 0.689 | 0.690 | 0.693 | 0.695 | 1.492 | 1.493 | 1.491 | 1.487 | 1.489 |
25% | 0.794 | 0.793 | 0.793 | 0.791 | 0.791 | 1.449 | 1.423 | 1.421 | 1.403 | 1.414 |
50% | 1.046 | 1.033 | 1.046 | 1.048 | 1.042 | 1.260 | 1.271 | 1.222 | 1.202 | 1.251 |
Max | 1.458 | 1.466 | 1.460 | 1.457 | 1.467 | 0.985 | 0.981 | 0.991 | 0.891 | 0.957 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.636 | 0.637 | 0.640 | 0.637 | 0.636 | 1.531 | 1.541 | 1.531 | 1.529 | 1.530 |
10% | 0.638 | 0.639 | 0.639 | 0.632 | 0.637 | 1.534 | 1.532 | 1.531 | 1.525 | 1.528 |
25% | 0.635 | 0.635 | 0.638 | 0.636 | 0.635 | 1.538 | 1.537 | 1.526 | 1.512 | 1.522 |
50% | 0.639 | 0.636 | 0.635 | 0.633 | 0.633 | 1.533 | 1.538 | 1.526 | 1.489 | 1.512 |
Max | 0.634 | 0.646 | 0.638 | 0.631 | 0.631 | 1.556 | 1.563 | 1.554 | 1.445 | 1.509 |
3.1.5. Site Delta
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.704 | 8.17 | 0.711 | 0.396 | 0.287 |
10% | 0.199 | 0.35 | 7.70 | 0.582 | 1.59 * |
25% | 0.949 | 0.168 | 4.32 * | 0.757 | 0.212 |
50% | 0.415 | 0.553 | 0.807 | 0.473 | 0.184 |
75% | 0.375 | 5.83 | 0.778 | 0.802 | 0.891 |
Max | 0.113 | 0.254 | 5.27 | 0.444 | 3.14 * |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.198 | 0.918 | 0.625 | 0.226 | 0.591 |
10% | 0.869 | 0.144 | 0.155 | 0.967 | 0.686 |
25% | 2.07 * | 3.73 * | 3.20 * | 5.53 | 2.61 * |
50% | 2.98 * | 1.00 * | 4.39 * | 1.59 * | 1.42 * |
75% | 1.70 * | 2.11 * | 4.50 * | 8.12 * | 1.02 * |
Max | 5.72 * | 5.59 * | 2.34 * | 4.75 * | 3.29 * |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.578 | 0.561 | 0.586 | 0.556 | 0.559 | 1.566 | 1.573 | 1.553 | 1.549 | 1.550 |
10% | 0.633 | 0.586 | 0.603 | 0.633 | 0.590 | 1.534 | 1.559 | 1.544 | 1.504 | 1.523 |
25% | 0.671 | 0.679 | 0.662 | 0.675 | 0.645 | 1.475 | 1.520 | 1.503 | 1.414 | 1.468 |
50% | 0.932 | 0.969 | 0.938 | 0.954 | 0.914 | 1.334 | 1.412 | 1.352 | 1.152 | 1.294 |
75% | 1.354 | 1.494 | 1.484 | 1.493 | 1.420 | 1.052 | 1.052 | 1.103 | 0.737 | 0.957 |
Max | 1.506 | 1.456 | 1.583 | 1.531 | 1.509 | 1.074 | 1.062 | 0.969 | 0.685 | 0.909 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.533 | 0.544 | 0.558 | 0.577 | 0.578 | 1.577 | 1.572 | 1.573 | 1.553 | 1.555 |
10% | 0.563 | 0.516 | 0.561 | 0.542 | 0.557 | 1.605 | 1.600 | 1.579 | 1.549 | 1.557 |
25% | 0.627 | 0.545 | 0.546 | 0.518 | 0.567 | 1.581 | 1.577 | 1.572 | 1.512 | 1.540 |
50% | 0.602 | 0.634 | 0.559 | 0.515 | 0.537 | 1.663 | 1.591 | 1.573 | 1.416 | 1.538 |
75% | 0.615 | 0.569 | 0.581 | 0.464 | 0.632 | 1.639 | 1.704 | 1.634 | 1.299 | 1.498 |
Max | 0.573 | 0.562 | 0.663 | 0.445 | 0.629 | 1.688 | 1.593 | 1.615 | 1.278 | 1.474 |
3.1.6. Site Epsilon
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.761 | 0.618 | 2.51 * | 0.376 | 0.919 |
10% | 6.61 | 0.373 | 0.541 | 0.844 | 0.522 |
25% | 0.745 | 0.765 | 0.207 | 0.173 | 0.991 |
50% | 0.186 | 1.91 * | 8.94 | 1.57 * | 0.282 |
Max | 1.03 * | 8.73 * | 1.79 * | 1.47 * | 1.37 * |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.856 | 0.513 | 0.676 | 0.513 | 0.593 |
10% | 0.639 | 0.447 | 6.03 | 0.127 | 0.997 |
25% | 3.89 * | 4.75 * | 0.203 | 0.105 | 1.51 * |
50% | 7.18 * | 2.48 * | 1.58 * | 1.21 * | 5.37 * |
Max | 2.01 * | 1.07 * | 3.55 * | 8.09 * | 5.19 * |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.862 | 0.932 | 0.878 | 0.909 | 0.907 | 1.383 | 1.378 | 1.391 | 1.363 | 1.370 |
10% | 0.944 | 0.946 | 0.956 | 0.969 | 0.954 | 1.377 | 1.341 | 1.369 | 1.325 | 1.340 |
25% | 1.061 | 1.041 | 1.079 | 1.115 | 1.050 | 1.319 | 1.302 | 1.279 | 1.207 | 1.265 |
50% | 1.406 | 1.400 | 1.458 | 1.406 | 1.456 | 1.200 | 1.124 | 1.168 | 0.963 | 1.097 |
Max | 1.834 | 1.790 | 1.813 | 1.836 | 1.814 | 1.008 | 0.977 | 0.973 | 0.692 | 0.935 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 0.906 | 0.898 | 0.876 | 0.884 | 0.871 | 1.394 | 1.386 | 1.403 | 1.382 | 1.387 |
10% | 0.850 | 0.874 | 0.873 | 0.877 | 0.877 | 1.406 | 1.386 | 1.386 | 1.372 | 1.381 |
25% | 0.886 | 0.883 | 0.882 | 0.883 | 0.875 | 1.392 | 1.379 | 1.417 | 1.335 | 1.367 |
50% | 0.858 | 0.863 | 0.883 | 0.878 | 0.874 | 1.404 | 1.443 | 1.422 | 1.262 | 1.365 |
Max | 0.864 | 0.873 | 0.883 | 0.884 | 0.876 | 1.349 | 1.446 | 1.417 | 1.183 | 1.371 |
3.1.7. Site Fazbear
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.191 | 0.297 | 0.182 | 0.101 | 3.72 * |
10% | 0.993 | 9.94 | 0.834 | 0.435 | 0.489 |
Max | 6.08 | 7.90 | 0.384 | 0.594 | 0.207 |
Reduction | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
5% | 0.534 | 0.574 | 0.138 | 0.18 | 0.568 |
10% | 0.596 | 0.476 | 6.91 | 5.22 | 0.47 |
Max | 0.727 | 0.742 | 0.669 | 0.442 | 0.929 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 1.221 | 1.216 | 1.208 | 1.190 | 1.213 | 0.918 | 0.925 | 0.926 | 0.931 | 0.918 |
10% | 1.208 | 1.172 | 1.194 | 1.158 | 1.169 | 0.945 | 0.952 | 0.944 | 0.950 | 0.949 |
Max | 1.174 | 1.163 | 1.153 | 1.154 | 1.151 | 0.956 | 0.958 | 0.966 | 0.952 | 0.959 |
Class 0 (Not Occupied) | Class 1 (Occupied) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Data Reduced | R | C | L | E | S | R | C | L | E | S |
5% | 1.263 | 1.249 | 1.256 | 1.239 | 1.242 | 0.896 | 0.896 | 0.896 | 0.893 | 0.893 |
10% | 1.247 | 1.268 | 1.269 | 1.233 | 1.229 | 0.893 | 0.885 | 0.900 | 0.885 | 0.898 |
Max | 1.240 | 1.257 | 1.232 | 1.249 | 1.230 | 0.897 | 0.893 | 0.903 | 0.877 | 0.895 |
3.1.8. Discussion—Individual Site Datasets
3.2. Experiments on Fused Dataset
Number of Datapoints | Balanced Class Reduction | Total Dataset Reduction |
---|---|---|
2,816,518 | 55.972% | 38.864% |
Site | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
Alpha | 65.781% | 65.991% | 66.367% | 66.516% | 67.253% |
Beta | 75.323% | 75.803% | 75.014% | 74.929% | 75.320% |
Charlie | 67.267% | 67.361% | 67.207% | 67.176% | 67.469% |
Delta | 66.793% | 66.187% | 66.479% | 67.045% | 66.684% |
Espilon | 69.228% | 68.611% | 68.387% | 68.777% | 69.090% |
Fazbear | 87.051% | 87.095% | 86.941% | 86.728% | 87.337% |
Site | Random | Central | Lateral | Even | Squash |
---|---|---|---|---|---|
Alpha | 72.403% | 72.648% | 73.022% | 73.028% | 73.546% |
Beta | 75.904% | 76.346% | 75.626% | 75.445% | 75.834% |
Charlie | 77.896% | 77.831% | 77.713% | 77.688% | 77.979% |
Delta | 74.268% | 73.250% | 74.039% | 74.539% | 74.072% |
Espilon | 73.863% | 73.187% | 72.951% | 73.093% | 73.576% |
Fazbear | 86.914% | 86.969% | 86.807% | 86.599% | 87.220% |
Class 0 (Not Occupied) | Class 1 (Occupied) | ||||||||
---|---|---|---|---|---|---|---|---|---|
R | C | L | E | S | R | C | L | E | S |
1.040 | 1.039 | 1.027 | 1.029 | 1.032 | 0.995 | 0.975 | 1.007 | 0.8869 | 0.980 |
3.3. Discussion—Fused Dataset
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
HVAC | Heating, ventilation and air conditioning |
T | Temperature |
H | Humidity |
VOC | Volatile organic compound |
ML | Machine learning |
AI | Artificial intelligence |
PCA | Principle component analysis |
AUC-ROC | Area under the receiver operating characteristic curve |
RF | Random Forest |
CNN | Convolutional Neural Network |
LSTM | Long Short-Term Memory |
KNN | K-Nearest Neighbour |
csv | Comma-Separated Value |
References
- Erickson, V.L.; Carreira-Perpinan, M.A.; Cerpa, A.E. OBSERVE: Occupancy-based system for efficient reduction of HVAC energy. In Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks, Chicago, IL, USA, 12–14 April 2011; pp. 258–269. [Google Scholar]
- Ahmad, J.; Masood, F.; Shah, S.A.; Jamal, S.S.; Hussain, I. A Novel Secure Occupancy Monitoring Scheme Based on Multi-Chaos Mapping. Symmetry 2020, 12, 350. [Google Scholar] [CrossRef]
- Krug, S.; O’Nils, M. Modeling and Comparison of Delay and Energy Cost of IoT Data Transfers. IEEE Access 2019, 7, 58654–58675. [Google Scholar] [CrossRef]
- Shafran-Nathan, R.; Levy, I.; Levin, N.; Broday, D.M. Ecological bias in environmental health studies: The problem of aggregation of multiple data sources. Air Qual. Atmos. Health 2017, 10, 411–420. [Google Scholar] [CrossRef]
- Kubat, M. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997. [Google Scholar]
- ur Rehman, M.H.; Liew, C.S.; Abbas, A.; Jayaraman, P.P.; Wah, T.Y.; Khan, S.U. Big Data Reduction Methods: A Survey. Data Sci. Eng. 2016, 1, 265–284. [Google Scholar] [CrossRef]
- Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
- Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric AI perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
- Kaur, P.; Gosain, A. Comparing the Behavior of Oversampling and Undersampling Approach of Class Imbalance Learning by Combining Class Imbalance Problem with Noise. In ICT Based Innovations; Saini, A.K., Nayak, A.K., Vyas, R.K., Eds.; Springer: Singapore, 2018; pp. 23–30. [Google Scholar]
- Moser, B.B.; Raue, F.; Dengel, A. A Study in Dataset Pruning for Image Super-Resolution. Artif. Neural Netw. Mach. Learn.—ICANN 2024, 9, 351–363. [Google Scholar] [CrossRef]
- Paul, M.; Ganguli, S.; Dziugaite, G.K. Deep Learning on a Data Diet: Finding Important Examples Early in Training. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar] [CrossRef]
- Toneva, M.; Sordoni, A.; des Combes, R.T.; Trischler, A.; Bengio, Y.; Gordon, G.J. An Empirical Study of Example Forgetting during Deep Neural Network Learning. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Bessa, M.; Bostanabad, R.; Liu, Z.; Hu, A.; Apley, D.W.; Brinson, C.; Chen, W.; Liu, W.K. A framework for data-driven analysis of materials under uncertainty: Countering the curse of dimensionality. Comput. Methods Appl. Mech. Eng. 2017, 320, 633–667. [Google Scholar] [CrossRef]
- Ashraf, M.; Anowar, F.; Setu, J.H.; Chowdhury, A.I.; Ahmed, E.; Islam, A.; Al-Mamun, A. A Survey on Dimensionality Reduction Techniques for Time-Series Data. IEEE Access 2023, 11, 42909–42923. [Google Scholar] [CrossRef]
- Ma, J.; Yuan, Y. Dimension reduction of image deep feature using PCA. J. Vis. Commun. Image Represent. 2019, 63, 102578. [Google Scholar] [CrossRef]
- Zaheer, R.; Hanif, M.K.; Sarwar, M.U.; Talib, R. Evaluating the Effectiveness of Dimensionality Reduction on Machine Learning Algorithms in Time Series Forecasting. IEEE Access 2025, 13, 50493–50510. [Google Scholar] [CrossRef]
- Sanderson, D.; Kalganova, T. Dynamic Data Inclusion with Sliding Window. In Proceedings of the Intelligent Sustainable Systems, London, UK, 23–26 July 2024; Nagar, A.K., Jat, D.S., Mishra, D.K., Joshi, A., Eds.; Springer: Singapore, 2024; pp. 525–544. [Google Scholar]
- Byerly, A.; Kalganova, T. Class Density and Dataset Quality in High-Dimensional, Unstructured Data. arXiv 2022, arXiv:2202.03856. [Google Scholar] [CrossRef]
- Sayed, A.N.; Himeur, Y.; Bensaali, F. Deep and transfer learning for building occupancy detection: A review and comparative analysis. Eng. Appl. Artif. Intell. 2022, 115, 105254. [Google Scholar] [CrossRef]
- Chitnis, S.; Somu, N.; Kowli, A. Occupancy estimation with environmental sensors: The possibilities and limitations. Energy Built Environ. 2025, 6, 96–108. [Google Scholar] [CrossRef]
- Zemouri, S.; Gkoufas, Y.; Murphy, J. A Machine Learning Approach to Indoor Occupancy Detection Using Non-Intrusive Environmental Sensor Data. In Proceedings of the 3rd International Conference on Big Data and Internet of Things, Melbourn, VIC, Australia, 22–24 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 70–74. [Google Scholar] [CrossRef]
- Mohammadabadi, A.; Rahnama, S.; Afshari, A. Indoor Occupancy Detection Based on Environmental Data Using CNN-XGboost Model: Experimental Validation in a Residential Building. Sustainability 2022, 14, 14644. [Google Scholar] [CrossRef]
- Vela, A.; Alvarado-Uribe, J.; Davila, M.; Hernandez-Gress, N.; Ceballos, H.G. Estimating Occupancy Levels in Enclosed Spaces Using Environmental Variables: A Fitness Gym and Living Room as Evaluation Scenarios. Sensors 2020, 20, 6579. [Google Scholar] [CrossRef]
- Pereira, L.M.; Salazar, A.; Vergara, L. On Comparing Early and Late Fusion Methods. In Advances in Computational Intelligence; Rojas, I., Joya, G., Catala, A., Eds.; Springer: Cham, Switzerland, 2023; Volume 14134. [Google Scholar] [CrossRef]
- Tsanousa, A.; Moschou, C.; Bektsis, E.; Vrochidis, S.; Kompatsiaris, I. Fusion of Environmental Sensors for Occupancy Detection in a Real Construction Site. Sensors 2023, 23, 9596. [Google Scholar] [CrossRef]
- Nguyen, T.; Khadka, R.; Phan, N.; Yazidi, A.; Halvorsen, P.; Riegler, M.A. Combining datasets to increase the number of samples and improve model fitting. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–9. [Google Scholar] [CrossRef]
- Vela, A.; Alvarado-Uribe, J.; Ceballos, H.G. Indoor Environment Dataset to Estimate Room Occupancy. Data 2021, 6, 133. [Google Scholar] [CrossRef]
- Schwee, J.H.; Johansen, A.; Jørgensen, B.N.; Kjærgaard, M.B.; Mattera, C.G.; Sangogboye, F.C.; Veje, C. Room-level occupant counts and environmental quality from heterogeneous sensing modalities in a smart building. Sci. Data 2019, 6, 287. [Google Scholar] [CrossRef]
- Jacoby, M.; Tan, S.Y.; Henze, G.; Sarkar, S. A high-fidelity residential building occupancy detection dataset. Sci. Data 2021, 8, 280. [Google Scholar] [CrossRef]
- Anil Jadhav, D.P.; Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
- Filippakis, P.; Ougiaroglou, S.; Evangelidis, G. Prototype Selection for Multilabel Instance-Based Learning. Information 2023, 14, 572. [Google Scholar] [CrossRef]
- Uddin, M.F. Addressing Accuracy Paradox Using Enhanched Weighted Performance Metric in Machine Learning. In Proceedings of the 2019 Sixth HCT Information Technology Trends (ITT), Ras Al Khaimah, United Arab Emirates, 20–21 November 2019; pp. 319–324. [Google Scholar] [CrossRef]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
- Martin Malík, REALiX, s.r.o. HWiNFO. Available online: https://www.hwinfo.com/ (accessed on 10 April 2025).
- The Department for Energy Security and Net Zero. Greenhouse Gas Reporting: Conversion Factors. 2024. Available online: https://www.gov.uk/government/publications/greenhouse-gas-reporting-conversion-factors-2024 (accessed on 11 April 2024).
Site | Number of Datapoints | Original Number of Sensors/Derived Number of Features | Least Important Sensor | Class Balance Ratio (Not Occ:Occ) |
---|---|---|---|---|
Alpha | 147,750 | 5:15 | 4 | 20:80 |
Beta | 146,879 | 4:12 | N/A | 40:60 |
Charlie | 302,399 | 5:15 | 0 | 22:78 |
Delta | 146,879 | 5:15 | 4 | 21:79 |
Epsilon | 129,599 | 5:15 | 4 | 24:76 |
Fazbear | 328,319 | 4:12 | N/A | 47:53 |
Model | Accuracy | AUC-ROC |
---|---|---|
RF | 98.744% | 97.143% |
XGBoost | 95.128% | 93.054% |
CNN | 91.021% | 89.783% |
LSTM | 85.470% | 85.393% |
Site | Number of Datapoints | Class Balance (Not Occ:Occ) | Balanced Class Max Reduction | Total Dataset Reduction at Max Balancing | Class Density (Not Occ:Occ) |
---|---|---|---|---|---|
Alpha | 147,750 | 20:80 | 74.912% | 59.887% | 0.674:1.585 |
Beta | 146,879 | 40:60 | 34.599% | 20.918% | 1.068:1.201 |
Charlie | 302,399 | 22:78 | 72.111% | 56.386% | 0.639:1.532 |
Delta | 146,879 | 21:79 | 77.569% | 63.358% | 0.547:1.576 |
Epsilon | 129,599 | 24:76 | 67.755% | 51.235% | 0.886:1.392 |
Fazbear | 328,319 | 47:53 | 12.111% | 6.446% | 1.260:0.892 |
Site | Accuracy | AUC-ROC |
---|---|---|
Alpha | 98.813% | 98.812% |
Beta | 99.613% | 99.613% |
Charlie | 99.755% | 99.612% |
Delta | 99.589% | 99.271% |
Epsilon | 99.692% | 99.574% |
Fazbear | 99.367% | 99.368% |
Class Balancing Runtimes (s) | No Class Balancing Runtimes (s) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Reduction Amount | A | B | C | D | E | F | A | B | C | D | E | F |
None | 15 | 15 | 44 | 18 | 15 | 42 | 15 | 15 | 44 | 18 | 15 | 42 |
5% | 15 | 14 | 43 | 18 | 18 | 49 | 15 | 15 | 41 | 18 | 15 | 42 |
10% | 14 | 14 | 41 | 18 | 17 | 49 | 14 | 14 | 43 | 18 | 17 | 40 |
25% | 12 | 12 | 37 | 15 | 13 | - | 12 | 11 | 36 | 15 | 16 | - |
50% | 10 | - | 26 | 10 | 10 | - | 9 | - | 27 | 10 | 9 | - |
75% | - | - | - | 6 | - | - | - | - | - | 6 | - | - |
Max% | 7 | 11 | 20 | 6 | 8 | 48 | 6 | 12 | 18 | 6 | 7 | 40 |
Number of Datapoints | Class Balance (Not Occ:Occ) | Class Density (Not Occ:Occ) |
---|---|---|
4,599,960 | 30:70 | 0.624:1.386 |
Site | Accuracy | AUC-ROC |
---|---|---|
Alpha | 82.576% | 59.915% |
Beta | 65.829% | 57.186% |
Charlie | 91.316% | 84.657% |
Delta | 84.504% | 61.632% |
Epsilon | 79.618% | 59.643% |
Fazbear | 69.956% | 71.670% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sanderson, D.; Kalganova, T. Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets. AI 2025, 6, 98. https://doi.org/10.3390/ai6050098
Sanderson D, Kalganova T. Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets. AI. 2025; 6(5):98. https://doi.org/10.3390/ai6050098
Chicago/Turabian StyleSanderson, Dominic, and Tatiana Kalganova. 2025. "Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets" AI 6, no. 5: 98. https://doi.org/10.3390/ai6050098
APA StyleSanderson, D., & Kalganova, T. (2025). Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets. AI, 6(5), 98. https://doi.org/10.3390/ai6050098