Windowing as a Sub-Sampling Method for Distributed Data Mining
Abstract
:1. Introduction
Algorithm 1 Windowing. |
Require: {The original training set} Ensure: {The induced model}
|
2. Materials and Methods
2.1. Windowing in JaCa-DDM
- The dataset may be distributed in different sites, instead of the traditional approach based on a single dataset in a single site.
- The loop for collecting the misclassified examples to be added to the window is performed by a set of agents using copies of the model distributed among the available sites, in a round-robin fashion.
- The initial window is a stratified sample, instead of a random one.
- An auto-adjustable stop criteria is combined with a configurable maximum number of iterations.
2.2. Datsets
2.3. Experiments
2.3.1. On the Generalization of Windowing
- Naive Bayes.
- A probabilistic classifier based on Bayes’ theorem with a strong assumption of independence among attributes [14].
- jRip.
- An inductive rule learner based on RIPPER that builds a set of rules while minimizing the amount of error [15].
- Multilayer-perceptron.
- A multi-layer perceptron trained by backpropagation with sigmoid nodes except for numeric classes, in which case the output nodes become unthresholded linear units [16].
- SMO.
- An implementation of John Platt’s sequential minimal optimization algorithm for training a support vector classifier [17].
2.3.2. On the Properties of Samples and Models Obtained by Windowing
- The model accuracy defined as the percentage of correctly classified instances.
- The metric AUC defined as the probability of a random instance to be correctly classified [18].Even though this measure was conceived for binary classification problems. Foster Provost [19] proposes an implementation for multi-class problems based in the weighted average of AUC metrics for every class using a one-against-all approach, and the weight for every AUC is calculated as the class’ appearance frequency in the data .
- The MDL principle states that the best model to infer from a dataset is the one which minimizes the sum of the length of the model , and the length of the data when encoded using the theory as a predictor for the data [20].For decision trees, Quinlan [21] proposes the next definition:
- The number of bits needed to encode a tree is:
- The number of bits needed to encode the data using the decision tree is:
- The Kullback–Leibler divergence () [22] is defined as:
- [23] is a similarity measure between datasets defined as:
- Without sampling, using all the available data to induce the model.
- By Random sampling, where any instance has the same selection probability [24].
- By Stratified random sampling, where the instances are subdivided by their class into subgroups, the number of selected instances per subgroup is defined as the division of the sample size by the number of instances [24].
- By Balanced random sampling, as stratified random sampling, the instances are subdivided by their class into subgroups, but the number of selected instances per subgroup is defined as the division of the sample size by the number of subgroups, this allows the same number of instances per class [24].
3. Results
- Generalization of the behavior of windowing, i.e., high accuracy correlating with fewer training examples used to induce the model, when other inductive algorithms, apart of J48, are adopted.
- Informational properties of the samples obtained by different methods, based on the Kullback–Leibler divergence and the attribute-value similitude.
- Properties of the models induced with the samples, in terms of their size, complexity, and data compression, which supplies information about their data fitting capacity.
- Predictive performance of the induced models in terms of accuracy and the AUC.
- Statistical tests about significant gains produced by windowing using the former metrics.
3.1. Windowing Generalization
3.2. Samples Properties
3.3. Model Complexity and Data Compression
3.4. Predictive Performance
3.5. Statistical Tests
4. Conclusions
- Adopting metrics for detecting relevant, noisy, and redundant instances to enhance the quality and size of the obtained samples, in order to improve the performance of the obtained models. Maillo et al. [30] review multiple metrics to describe redundancy, complexity, and density of a problem and also propose two data big metrics. These kind of metrics may be helpful to select instances that provides quality information.
- Studying the evolution of windows over time can offer more insights about the behavior of windowing. The main difficulty here is adapting some of the used metrics, e.g., MDL, to be used with models that are not decision trees.
- Dealing with datasets of higher dimensions. Melgoza-Gutiérrez et al. [31] propose an agent & artifacts-based method to distribute vertical partitions of datasets and deal with the growing time complexity when datasets have a high number of attributes. It is expected that the achieved understanding on windowing contributes to combine these approaches.
- Applying windowing to real problems. Limón et al. [10] applies windowing to the segmentation of colposcopic images presenting possible precancerous cervical lesions. Windowing is exploited here to distribute the computational cost of processing a dataset of instances and 30 attributes. The exploitation of windowing to cope with learning problems of distributed nature is to be explored.
Author Contributions
Funding
Conflicts of Interest
Appendix A. Results of Accuracy without Using Windowing
j48 | NB | jRip | MP | SMO | |
---|---|---|---|---|---|
Adult | 85.98 ± 0.28 | 83.24 ± 0.19 | 84.65 ± 0.16 | na | na |
Australian | 87.10 ± 0.65 | 85.45 ± 1.57 | 84.44 ± 1.78 | 83.10 ± 1.28 | 86.71 ± 1.43 |
Breast | 96.16 ± 0.38 | 97.84 ± 0.51 | 95.03 ± 0.89 | 96.84 ± 0.77 | 96.67 ± 0.40 |
Credit-g | 73.59 ± 2.11 | 75.59 ± 1.04 | 73.45 ± 1.96 | 73.10 ± 0.72 | 76.66 ± 2.87 |
Diabetes | 72.95 ± 0.77 | 75.83 ± 1.17 | 78.27 ± 1.81 | 74.51 ± 1.46 | 78.02 ± 1.79 |
Ecoli | 84.44 ± 1.32 | 83.5 ± 1.64 | 82.25 ± 3.11 | 83.69 ± 1.44 | 83.93 ± 1.31 |
German | 73.89 ± 1.59 | 76.94 ± 2.29 | 70.06 ± 0.90 | 70.26 ± 0.96 | 74.55 ± 1.76 |
Hypothyroid | 99.48 ± 0.20 | 95.72 ± 0.68 | 99.60 ± 0.15 | 94.38 ± 0.25 | 94.01 ± 0.48 |
Kr-vs-kp | 99.31 ± 0.06 | 87.68 ± 0.43 | 99.37 ± 0.29 | 99.06 ± 0.13 | 96.67 ± 0.37 |
Letter | 87.81 ± 0.10 | 64.33 ± 0.28 | 86.34 ± 0.22 | na | na |
Mushroom | 100.0 ± 0.00 | 95.9 ± 0.32 | 100.0 ± 0.00 | 100.0 ± 0.00 | 100.0 ± 0.00 |
Poker-lsn | 99.79 ± 0.00 | 59.33 ± 0.03 | na | na | na |
Segment | 96.02 ± 0.29 | 79.95 ± 0.69 | 95.25 ± 0.52 | 95.61 ± 0.91 | 92.97 ± 0.36 |
Sick | 98.88 ± 0.29 | 93.13 ± 0.43 | 98.19 ± 0.22 | 95.81 ± 0.45 | 93.70 ± 0.56 |
Splice | 93.81 ± 0.39 | 95.05 ± 0.36 | 94.19 ± 0.27 | na | 93.46 ± 0.48 |
Waveform5000 | 75.58 ± 0.37 | 80.25 ± 0.33 | 79.54 ± 0.37 | na | 86.81 ± 0.21 |
References
- Quinlan, J.R. Induction over Large Data Bases; Technical Report STAN-CS-79-739; Computer Science Department, School of Humanities and Sciences, Stanford University: Stanford, CA, USA, 1979. [Google Scholar]
- Quinlan, J.R. Learning efficient classification procedures and their application to chess en games. In Machine Learning; Michalski, R.S., Carbonell, J.G., Mitchell, T.M., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1983; Volume I, Chapter 15; pp. 463–482. [Google Scholar] [CrossRef]
- Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, CA, USA, 1993; Volume 1. [Google Scholar]
- Quinlan, J.R. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef] [Green Version]
- Wirth, J.; Catlett, J. Experiments on the Costs and Benefits of Windowing in ID3. In Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI, USA, 12–14 June 1988; Laird, J.E., Ed.; Morgan Kaufmann: San Mateo, CA, USA, 1988; pp. 87–99. [Google Scholar]
- Fürnkranz, J. Integrative windowing. J. Artif. Intell. Res. 1998, 8, 129–164. [Google Scholar] [CrossRef] [Green Version]
- Quinlan, J.R. Learning Logical Definitions from Relations. Mach. Learn. 1990, 5, 239–266. [Google Scholar] [CrossRef] [Green Version]
- Limón, X.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Grimaldo, F. Modeling and implementing distributed data mining strategies in JaCa-DDM. Knowl. Inf. Syst. 2019, 60, 99–143. [Google Scholar] [CrossRef]
- Limón, X.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Acosta-Mesa, H.G.; Grimaldo, F. A Windowing Strategy for Distributed Data Mining Optimized through GPUs. Pattern Recognit. Lett. 2017, 93, 23–30. [Google Scholar] [CrossRef]
- Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2011. [Google Scholar]
- Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 29 June 2020).
- Bifet, A.; Holmes, G.; Kirkby, R.; Pfahringer, B. MOA: Massive Online Analysis. J. Mach. Learn. Res. 2010, 11, 1601–1604. [Google Scholar]
- John, G.H.; Langley, P. Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–20 August 1995; Morgan Kaufmann: San Mateo, CA, USA, 1995; pp. 338–345. [Google Scholar]
- Cohen, W.W. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; Morgan Kaufmann: San Francisco, CA, USA, 1995; pp. 115–123. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations; MIT Press: Cambridge, MA, USA, 1986; pp. 318–362. [Google Scholar]
- Platt, J. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods: Support Vector Learning; Schoelkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Provost, F.; Domingos, P. Well-Trained PETs: Improving Probability Estimation Trees (2000). Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.309 (accessed on 29 June 2020).
- Rissanen, J. Stochastic Complexity and Modeling. Ann. Stat. 1986, 14, 1080–1100. [Google Scholar] [CrossRef]
- Quinlan, J.R.; Rivest, R.L. Inferring decision trees using the minimum description length principle. Inf. Comput. 1989, 80, 227–248. [Google Scholar] [CrossRef] [Green Version]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Zhang, S.; Zhang, C.; Wu, X. Knowledge Discovery in Multiple Databases; Springer-Verlag London, Limited: London, UK, 2004. [Google Scholar]
- Ros, F.; Guillaume, S. Sampling Techniques for Supervised or Unsupervised Tasks; Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
- Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
- Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
- Zar, J.H. Biostatistical Analysis, 5th ed.; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
- Iman, R.L.; Davenport, J.M. Approximations of the critical region of the fbietkan statistic. Commun. Stat. Theory Methods 1980, 9, 571–595. [Google Scholar] [CrossRef]
- Maillo, J.; Triguero, I.; Herrera, F. Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access 2020, 8, 87918–87928. [Google Scholar] [CrossRef]
- Melgoza-Gutiérrez, J.; Guerra-Hernández, A.; Cruz-Ramírez, N. Collaborative Data Mining on a BDI Multi-Agent System over Vertically Partitioned Data. In Proceedings of the 13th Mexican International Conference on Artificial Intelligence, Tuxtla Gutiérrez, Mexico, 16–22 November 2014; Gelbukh, A., Castro-Espinoza, F., Galicia-Haro, S.N., Eds.; IEEE Computer Society: Los Alamitos, CA, USA, 2014; pp. 215–220. [Google Scholar]
Parameter | Value |
---|---|
Classifier | J48 |
Pruning | True |
Number of nodes | 8 |
Maximum number of rounds | 15 |
Initial percentage for the window | 0.20 |
Validation percentage for the test | 0.25 |
Change step of accuracy every round | 0.35 |
Dataset | Instances | Attributes | Attribute Type | Missing Values | Classes |
---|---|---|---|---|---|
Adult | 48842 | 15 | Mixed | Yes | 2 |
Australian | 690 | 15 | Mixed | No | 2 |
Breast | 683 | 10 | Numeric | No | 2 |
Diabetes | 768 | 9 | Mixed | No | 2 |
Ecoli | 336 | 8 | Numeric | No | 8 |
German | 1000 | 21 | Mixed | No | 2 |
Hypothyroid | 3772 | 30 | Mixed | Yes | 4 |
Kr-vs-kp | 3196 | 37 | Numeric | No | 2 |
Letter | 20000 | 17 | Mixed | No | 26 |
Mushroom | 8124 | 23 | Nominal | Yes | 2 |
Poker-lsn | 829201 | 11 | Mixed | No | 10 |
Segment | 2310 | 20 | Numeric | No | 7 |
Sick | 3772 | 30 | Mixed | Yes | 2 |
Splice | 3190 | 61 | Nominal | No | 3 |
Waveform5000 | 5000 | 41 | Numeric | No | 3 |
J48 | NB | jRip | MP | SMO | |
---|---|---|---|---|---|
Adult | 86.17 ± 0.55 | 84.54 ± 0.62 | na | na | na |
Australian | 85.21 ± 4.77 | 85.79 ± 4.25 | 85.94 ± 3.93 | 81.74 ± 6.31 | 85.80 ± 4.77 |
Breast | 94.42 ± 3.97 | 97.21 ± 2.34 | 95.31 ± 2.75 | 95.45 ± 3.14 | 96.33 ± 3.12 |
Diabetes | 73.03 ± 3.99 | 76.03 ± 4.33 | 71.74 ± 7.67 | 72.12 ± 4.00 | 76.04 ± 3.51 |
Ecoli | 82.72 ± 6.81 | 83.93 ± 7.00 | 81.22 ± 6.63 | 82.12 ± 7.49 | 84.53 ± 4.11 |
German | 71.10 ± 5.40 | 75.20 ± 2.82 | 70.20 ± 3.85 | 69.60 ± 4.84 | 75.80 ± 3.12 |
Hypothyroid | 99.46 ± 0.17 | 95.36 ± 0.99 | 99.23 ± 0.48 | 92.26 ± 2.75 | 94.30 ± 0.53 |
Kr-vs-kp | 99.15 ± 0.66 | 96.65 ± 0.84 | 98.46 ± 0.95 | 98.72 ± 0.54 | 96.62 ± 0.75 |
Letter | 85.79 ± 1.24 | 69.28 ± 1.26 | 85.31 ± 1.06 | na | na |
Mushroom | 100.00 ± 0.00 | 99.80 ± 0.16 | 100.00 ± 0.00 | 100.00 ± 0.00 | 100.0 ± 0.00 |
Poker-lsn | 99.75 ± 0.07 | 60.02 ± 0.42 | na | na | na |
Segment | 96.53 ± 1.47 | 84.24 ± 1.91 | 95.54 ± 1.55 | 96.10 ± 1.15 | 92.42 ± 1.87 |
Sick | 98.64 ± 0.53 | 96.34 ± 1.44 | 97.93 ± 0.95 | 96.32 ± 1.04 | 96.71 ± 0.77 |
Splice | 94.04 ± 0.79 | 95.32 ± 1.07 | 92.75 ± 2.11 | na | 92.41 ± 1.34 |
Waveform5000 | 73.06 ± 2.55 | 82.36 ± 1.64 | 77.02 ± 1.59 | na | 85.94 ± 1.32 |
J48 | NB | jRip | MP | SMO | |
---|---|---|---|---|---|
Adult | 0.30 ± 0.01 | 0.21 ± 0.00 | na | na | na |
Australian | 0.31 ± 0.02 | 0.25 ± 0.01 | 0.33 ± 0.02 | 0.39 ± 0.04 | 0.27 ± 0.01 |
Breast | 0.17 ± 0.01 | 0.06 ± 0.00 | 0.14 ± 0.01 | 0.11 ± 0.01 | 0.09 ± 0.01 |
Diabetes | 0.54 ± 0.05 | 0.40 ± 0.02 | 0.52 ± 0.04 | 0.48 ± 0.03 | 0.42 ± 0.02 |
Ecoli | 0.38 ± 0.03 | 0.27 ± 0.01 | 0.40 ± 0.03 | 0.31 ± 0.03 | 0.29 ± 0.02 |
German | 0.56 ± 0.04 | 0.43 ± 0.01 | 0.59 ± 0.02 | 0.58 ± 0.02 | 0.47 ± 0.02 |
Hypothyroid | 0.05 ± 0.00 | 0.12 ± 0.01 | 0.05 ± 0.00 | 0.24 ± 0.01 | 0.12 ± 0.01 |
Kr-vs-kp | 0.08 ± 0.01 | 0.16 ± 0.01 | 0.13 ± 0.00 | 0.08 ± 0.00 | 0.12 ± 0.00 |
Letter | 0.35 ± 0.02 | 0.38 ± 0.00 | 0.39 ± 0.01 | na | na |
Mushroom | 0.03 ± 0.00 | 0.04 ± 0.00 | 0.03 ± 0.00 | 0.02 ± 0.00 | 0.02 ± 0.00 |
Poker-lsn | 0.05 ± 0.00 | 0.59 ± 0.00 | na | na | na |
Segment | 0.16 ± 0.01 | 0.22 ± 0.01 | 0.19 ± 0.01 | 0.14 ± 0.01 | 0.18 ± 0.00 |
Sick | 0.07 ± 0.00 | 0.10 ± 0.01 | 0.08 ± 0.00 | 0.11 ± 0.01 | 0.10 ± 0.00 |
Splice | 0.26 ± 0.01 | 0.11 ± 0.00 | 0.25 ± 0.01 | na | 0.19 ± 0.00 |
Waveform5000 | 0.59 ± 0.02 | 0.22 ± 0.01 | 0.52 ± 0.00 | na | 0.26 ± 0.01 |
Dataset | Method | Instances | St. Dv. C.D. | KL Div | Sim1 |
---|---|---|---|---|---|
Adult | Windowing | 14502.840 ± 574.266 | 0.083 ± 0.004 | 0.128 ± 0.004 | 0.386 ± 0.012 |
Adult | Full-Dataset | 43957.800 ± 0.402 | 0.369 ± 0.000 | 0.000 ± 0.000 | 0.935 ± 0.001 |
Adult | Random-sampling | 14502.840 ± 574.266 | 0.374 ± 0.049 | 0.005 ± 0.005 | 0.418 ± 0.013 |
Adult | Stratified-sampling | 14502.840 ± 574.266 | 0.369 ± 0.000 | 0.000 ± 0.000 | 0.418 ± 0.013 |
Adult | Balanced-sampling | 14502.840 ± 574.266 | 0.000 ± 0.000 | 0.206 ± 0.000 | 0.400 ± 0.013 |
Australian | Windowing | 215.440 ± 14.363 | 0.031 ± 0.020 | 0.017 ± 0.008 | 0.999 ± 0.006 |
Australian | Full-Dataset | 621.000 ± 0.000 | 0.078 ± 0.001 | 0.000 ± 0.000 | 0.999 ± 0.005 |
Australian | Random-sampling | 215.440 ± 14.363 | 0.080 ± 0.047 | 0.004 ± 0.005 | 0.986 ± 0.016 |
Australian | Stratified-sampling | 215.440 ± 14.363 | 0.078 ± 0.004 | 0.000 ± 0.000 | 0.986 ± 0.016 |
Australian | Balanced-sampling | 215.440 ± 14.363 | 0.001 ± 0.002 | 0.009 ± 0.000 | 0.987 ± 0.016 |
Breast | Windowing | 109.210 ± 14.732 | 0.043 ± 0.030 | 0.086 ± 0.031 | 1.000 ± 0.000 |
Breast | Full-Dataset | 614.700 ± 0.461 | 0.212 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
Breast | Random-sampling | 109.210 ± 14.732 | 0.224 ± 0.107 | 0.019 ± 0.017 | 1.000 ± 0.000 |
Breast | Stratified-sampling | 109.210 ± 14.732 | 0.215 ± 0.007 | 0.000 ± 0.000 | 1.000 ± 0.000 |
Breast | Balanced-sampling | 109.210 ± 14.732 | 0.003 ± 0.003 | 0.066 ± 0.003 | 1.000 ± 0.000 |
Diabetes | Windowing | 436.260 ± 27.768 | 0.087 ± 0.022 | 0.025 ± 0.009 | 0.751 ± 0.028 |
Diabetes | Full-Dataset | 691.200 ± 0.402 | 0.213 ± 0.001 | 0.000 ± 0.000 | 0.954 ± 0.004 |
Diabetes | Random-sampling | 436.260 ± 27.768 | 0.214 ± 0.021 | 0.001 ± 0.001 | 0.763 ± 0.028 |
Diabetes | Stratified-sampling | 436.260 ± 27.768 | 0.215 ± 0.002 | 0.000 ± 0.000 | 0.766 ± 0.028 |
Diabetes | Balanced-sampling | 436.260 ± 27.768 | 0.001 ± 0.001 | 0.067 ± 0.001 | 0.770 ± 0.028 |
Ecoli | Windowing | 126.640 ± 8.579 | 0.109 ± 0.005 | 0.182 ± 0.055 | 0.761 ± 0.026 |
Ecoli | Full-Dataset | 302.400 ± 0.492 | 0.145 ± 0.000 | 0.001 ± 0.001 | 0.979 ± 0.006 |
Ecoli | Random-sampling | 126.640 ± 8.579 | 0.147 ± 0.010 | 0.007 ± 0.010 | 0.763 ± 0.025 |
Ecoli | Stratified-sampling | 126.640 ± 8.579 | 0.154 ± 0.004 | 0.013 ± 0.003 | 0.758 ± 0.027 |
Ecoli | Balanced-sampling | 126.640 ± 8.579 | 0.099 ± 0.004 | 0.113 ± 0.028 | 0.781 ± 0.028 |
German | Windowing | 584.750 ± 25.308 | 0.119 ± 0.012 | 0.041 ± 0.006 | 1.000 ± 0.000 |
German | Full-Dataset | 900.000 ± 0.000 | 0.283 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
German | Random-sampling | 584.750 ± 25.308 | 0.284 ± 0.022 | 0.001 ± 0.001 | 1.000 ± 0.000 |
German | Stratified-sampling | 584.750 ± 25.308 | 0.283 ± 0.001 | 0.000 ± 0.000 | 1.000 ± 0.000 |
German | Balanced-sampling | 584.750 ± 25.308 | 0.055 ± 0.022 | 0.079 ± 0.015 | 1.000 ± 0.000 |
Hypothyroid | Windowing | 151.680 ± 9.619 | 0.293 ± 0.017 | 0.262 ± 0.047 | 0.428 ± 0.017 |
Hypothyroid | Full-Dataset | 3394.800 ± 0.402 | 0.449 ± 0.000 | 0.000 ± 0.000 | 0.979 ± 0.005 |
Hypothyroid | Random-sampling | 151.680 ± 9.619 | 0.580 ± 0.149 | 0.212 ± 0.103 | 0.387 ± 0.020 |
Hypothyroid | Stratified-sampling | 151.680 ± 9.619 | 0.516 ± 0.007 | 0.000 ± 0.001 | 0.387 ± 0.013 |
Hypothyroid | Balanced-sampling | 151.680 ± 9.619 | 0.191 ± 0.004 | 0.668 ± 0.023 | 0.435 ± 0.016 |
Kr-vs-kp | Windowing | 242.550 ± 18.425 | 0.050 ± 0.036 | 0.010 ± 0.012 | 0.998 ± 0.004 |
Kr-vs-kp | Full-Dataset | 2876.400 ± 0.492 | 0.031 ± 0.000 | 0.000 ± 0.000 | 0.999 ± 0.004 |
Kr-vs-kp | Random-sampling | 242.550 ± 18.425 | 0.221 ± 0.130 | 0.106 ± 0.099 | 0.975 ± 0.013 |
Kr-vs-kp | Stratified-sampling | 242.550 ± 18.425 | 0.032 ± 0.003 | 0.000 ± 0.000 | 0.977 ± 0.009 |
Kr-vs-kp | Balanced-sampling | 242.550 ± 18.425 | 0.001 ± 0.001 | 0.001 ± 0.000 | 0.977 ± 0.008 |
Letter | Windowing | 7390.450 ± 491.435 | 0.008 ± 0.000 | 0.037 ± 0.002 | 0.989 ± 0.006 |
Letter | Full-Dataset | 18000.000 ± 0.000 | 0.001 ± 0.000 | 0.000 ± 0.000 | 0.999 ± 0.002 |
Letter | Random-sampling | 7390.450 ± 491.435 | 0.007 ± 0.001 | 0.022 ± 0.009 | 0.983 ± 0.008 |
Letter | Stratified-sampling | 7390.450 ± 491.435 | 0.000 ± 0.000 | 0.000 ± 0.000 | 0.985 ± 0.007 |
Letter | Balanced-sampling | 7390.450 ± 491.435 | 0.001 ± 0.000 | 0.001 ± 0.000 | 0.984 ± 0.006 |
Mushroom | Windowing | 219.490 ± 16.871 | 0.043 ± 0.033 | 0.004 ± 0.005 | 0.968 ± 0.021 |
Mushroom | Full-Dataset | 7311.600 ± 0.492 | 0.025 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
Mushroom | Random-sampling | 219.490 ± 16.871 | 0.504 ± 0.244 | 2.083 ± 1.852 | 0.833 ± 0.072 |
Mushroom | Stratified-sampling | 219.490 ± 16.871 | 0.026 ± 0.004 | 0.000 ± 0.000 | 0.903 ± 0.032 |
Mushroom | Balanced-sampling | 219.490 ± 16.871 | 0.002 ± 0.002 | 0.001 ± 0.000 | 0.902 ± 0.033 |
Segment | Windowing | 371.280 ± 27.458 | 0.104 ± 0.008 | 0.390 ± 0.076 | 0.279 ± 0.015 |
Segment | Full-Dataset | 2079.000 ± 0.000 | 0.000 ± 0.000 | 0.000 ± 0.000 | 0.938 ± 0.003 |
Segment | Random-sampling | 371.280 ± 27.458 | 0.050 ± 0.007 | 0.105 ± 0.144 | 0.310 ± 0.019 |
Segment | Stratified-sampling | 371.280 ± 27.458 | 0.002 ± 0.001 | 0.000 ± 0.000 | 0.315 ± 0.018 |
Segment | Balanced-sampling | 371.280 ± 27.458 | 0.002 ± 0.001 | 0.000 ± 0.000 | 0.315 ± 0.018 |
Sick | Windowing | 264.600 ± 17.420 | 0.305 ± 0.028 | 0.233 ± 0.032 | 0.565 ± 0.019 |
Sick | Full-Dataset | 3394.800 ± 0.402 | 0.621 ± 0.000 | 0.000 ± 0.000 | 0.979 ± 0.005 |
Sick | Random-sampling | 264.600 ± 17.420 | 0.623 ± 0.066 | 0.015 ± 0.014 | 0.483 ± 0.018 |
Sick | Stratified-sampling | 264.600 ± 17.420 | 0.623 ± 0.002 | 0.000 ± 0.000 | 0.483 ± 0.014 |
Sick | Balanced-sampling | 264.600 ± 17.420 | 0.002 ± 0.001 | 0.665 ± 0.002 | 0.495 ± 0.014 |
Splice | Windowing | 835.300 ± 29.689 | 0.072 ± 0.011 | 0.036 ± 0.009 | 0.969 ± 0.043 |
Splice | Full-Dataset | 2871.000 ± 0.000 | 0.169 ± 0.047 | 0.000 ± 0.000 | 0.987 ± 0.034 |
Splice | Random-sampling | 835.300 ± 29.689 | 0.161 ± 0.000 | 0.014 ± 0.013 | 0.890 ± 0.060 |
Splice | Stratified-sampling | 835.300 ± 29.689 | 0.161 ± 0.001 | 0.000 ± 0.000 | 0.862 ± 0.036 |
Splice | Balanced-sampling | 835.300 ± 29.689 | 0.001 ± 0.001 | 0.104 ± 0.001 | 0.871 ± 0.046 |
Waveform-5000 | Windowing | 3263.590 ± 330.000 | 0.006 ± 0.004 | 0.000 ± 0.000 | 0.940 ± 0.018 |
Waveform-5000 | Full-Dataset | 4500.000 ± 0.000 | 0.004 ± 0.000 | 0.000 ± 0.000 | 0.983 ± 0.001 |
Waveform-5000 | Random-sampling | 3263.590 ± 330.000 | 0.018 ± 0.010 | 0.002 ± 0.002 | 0.932 ± 0.019 |
Waveform-5000 | Stratified-sampling | 3263.590 ± 330.000 | 0.004 ± 0.000 | 0.000 ± 0.000 | 0.932 ± 0.019 |
Waveform-5000 | Balanced-sampling | 3263.590 ± 330.000 | 0.000 ± 0.000 | 0.000 ± 0.000 | 0.932 ± 0.019 |
Dataset | Method | L(H) | L(D|H) | MDL |
---|---|---|---|---|
Adult | Windowing | 1361.599 ± 465.850 | 2366.019 ± 59.709 | 3727.618 ± 483.653 |
Adult | Full-Dataset | 2077.010 ± 282.565 | 2374.002 ± 49.985 | 4451.012 ± 270.561 |
Adult | Random-sampling | 1009.386 ± 276.429 | 2420.278 ± 56.458 | 3429.664 ± 264.703 |
Adult | Stratified-sampling | 1031.172 ± 181.155 | 2410.870 ± 49.932 | 3442.042 ± 186.437 |
Adult | Balanced-sampling | 1351.736 ± 265.668 | 2423.024 ± 44.271 | 3774.759 ± 274.906 |
Australian | Windowing | 77.299 ± 29.067 | 41.284 ± 6.849 | 118.582 ± 30.088 |
Australian | Full-Dataset | 66.820 ± 16.934 | 41.044 ± 6.711 | 107.864 ± 17.430 |
Australian | Random-sampling | 45.151 ± 18.592 | 41.820 ± 6.916 | 86.971 ± 19.120 |
Australian | Stratified-sampling | 50.313 ± 22.016 | 41.836 ± 6.776 | 92.149 ± 21.220 |
Australian | Balanced-sampling | 44.603 ± 22.878 | 42.327 ± 6.764 | 86.929 ± 22.830 |
Breast | Windowing | 46.541 ± 13.199 | 25.904 ± 4.584 | 72.445 ± 12.435 |
Breast | Full-Dataset | 58.757 ± 7.942 | 25.338 ± 5.280 | 84.095 ± 8.195 |
Breast | Random-sampling | 22.301 ± 6.555 | 29.008 ± 7.229 | 51.309 ± 7.316 |
Breast | Stratified-sampling | 23.991 ± 6.915 | 28.631 ± 6.720 | 52.622 ± 8.350 |
Breast | Balanced-sampling | 22.767 ± 7.801 | 28.191 ± 5.710 | 50.959 ± 8.137 |
Diabetes | Windowing | 59.000 ± 37.207 | 65.437 ± 5.227 | 124.437 ± 37.477 |
Diabetes | Full-Dataset | 126.620 ± 46.019 | 64.383 ± 5.161 | 191.003 ± 45.988 |
Diabetes | Random-sampling | 95.960 ± 38.989 | 65.674 ± 4.884 | 161.634 ± 39.119 |
Diabetes | Stratified-sampling | 94.940 ± 39.261 | 64.354 ± 5.965 | 159.294 ± 39.505 |
Diabetes | Balanced-sampling | 104.840 ± 36.621 | 65.263 ± 5.003 | 170.103 ± 36.829 |
Ecoli | Windowing | 99.328 ± 23.152 | 29.959 ± 7.767 | 129.287 ± 23.257 |
Ecoli | Full-Dataset | 144.454 ± 19.804 | 27.648 ± 6.460 | 172.102 ± 18.623 |
Ecoli | Random-sampling | 69.348 ± 16.853 | 33.969 ± 9.853 | 103.317 ± 15.614 |
Ecoli | Stratified-sampling | 65.678 ± 16.214 | 34.174 ± 10.710 | 99.852 ± 16.457 |
Ecoli | Balanced-sampling | 83.869 ± 20.904 | 30.357 ± 7.087 | 114.226 ± 20.376 |
German | Windowing | 315.252 ± 60.182 | 82.866 ± 5.220 | 398.118 ± 60.077 |
German | Full-Dataset | 287.566 ± 54.049 | 83.857 ± 5.339 | 371.423 ± 53.413 |
German | Random-sampling | 211.627 ± 51.692 | 83.245 ± 5.156 | 294.871 ± 51.783 |
German | Stratified-sampling | 212.684 ± 54.545 | 83.006 ± 5.125 | 295.689 ± 53.830 |
German | Balanced-sampling | 238.184 ± 51.813 | 84.412 ± 5.352 | 322.596 ± 51.356 |
Hypothyroid | Windowing | 84.812 ± 19.108 | 28.291 ± 6.449 | 113.102 ± 20.727 |
Hypothyroid | Full-Dataset | 122.317 ± 10.791 | 27.105 ± 6.877 | 149.422 ± 10.562 |
Hypothyroid | Random-sampling | 15.667 ± 15.278 | 189.232 ± 110.454 | 204.899 ± 96.402 |
Hypothyroid | Stratified-sampling | 30.645 ± 6.465 | 67.493 ± 22.683 | 98.138 ± 22.336 |
Hypothyroid | Balanced-sampling | 45.353 ± 10.448 | 61.502 ± 18.798 | 106.854 ± 18.199 |
Kr-vs-kp | Windowing | 198.034 ± 14.570 | 69.919 ± 4.871 | 267.953 ± 14.944 |
Kr-vs-kp | Full-Dataset | 219.807 ± 16.870 | 69.345 ± 4.277 | 289.152 ± 17.014 |
Kr-vs-kp | Random-sampling | 64.438 ± 18.816 | 98.961 ± 21.032 | 163.399 ± 21.636 |
Kr-vs-kp | Stratified-sampling | 72.664 ± 18.341 | 92.724 ± 15.119 | 165.388 ± 15.947 |
Kr-vs-kp | Balanced-sampling | 73.848 ± 18.721 | 91.842 ± 14.262 | 165.690 ± 15.840 |
Letter | Windowing | 11862.644 ± 473.112 | 1248.697 ± 64.017 | 13111.341 ± 453.031 |
Letter | Full-Dataset | 12431.372 ± 180.896 | 1165.793 ± 38.869 | 13597.165 ± 182.617 |
Letter | Random-sampling | 7020.909 ± 385.222 | 1473.635 ± 81.356 | 8494.544 ± 358.576 |
Letter | Stratified-sampling | 7102.767 ± 358.000 | 1461.702 ± 80.161 | 8564.469 ± 328.131 |
Letter | Balanced-sampling | 7126.843 ± 381.507 | 1449.106 ± 76.567 | 8575.949 ± 354.232 |
Mushroom | Windowing | 79.249 ± 7.033 | 76.881 ± 4.163 | 156.130 ± 7.189 |
Mushroom | Full-Dataset | 77.237 ± 0.600 | 79.510 ± 1.744 | 156.747 ± 1.810 |
Mushroom | Random-sampling | 18.228 ± 19.552 | 461.838 ± 353.124 | 480.066 ± 337.153 |
Mushroom | Stratified-sampling | 31.126 ± 14.101 | 114.606 ± 23.525 | 145.732 ± 20.201 |
Mushroom | Balanced-sampling | 31.879 ± 15.063 | 113.501 ± 22.427 | 145.380 ± 17.422 |
Segment | Windowing | 348.723 ± 34.369 | 81.656 ± 10.719 | 430.379 ± 33.528 |
Segment | Full-Dataset | 365.928 ± 22.569 | 79.045 ± 9.609 | 444.973 ± 22.295 |
Segment | Random-sampling | 142.987 ± 22.538 | 135.754 ± 31.843 | 278.741 ± 31.578 |
Segment | Stratified-sampling | 142.715 ± 18.438 | 126.640 ± 24.516 | 269.356 ± 26.762 |
Segment | Balanced-sampling | 141.267 ± 17.852 | 127.325 ± 23.254 | 268.591 ± 26.010 |
Sick | Windowing | 170.530 ± 26.600 | 50.476 ± 8.212 | 221.005 ± 26.977 |
Sick | Full-Dataset | 182.701 ± 22.491 | 42.346 ± 7.910 | 225.047 ± 20.038 |
Sick | Random-sampling | 21.786 ± 16.605 | 80.715 ± 38.277 | 102.501 ± 24.810 |
Sick | Stratified-sampling | 31.126 ± 6.768 | 55.199 ± 13.736 | 86.325 ± 15.387 |
Sick | Balanced-sampling | 57.996 ± 17.446 | 60.045 ± 9.531 | 118.040 ± 18.444 |
Splice | Windowing | 725.951 ± 53.364 | 181.187 ± 11.871 | 907.139 ± 53.195 |
Splice | Full-Dataset | 745.146 ± 51.142 | 179.689 ± 11.014 | 924.834 ± 52.532 |
Splice | Random-sampling | 425.144 ± 52.153 | 187.097 ± 21.631 | 612.240 ± 47.209 |
Splice | Stratified-sampling | 443.339 ± 51.337 | 188.061 ± 19.286 | 631.400 ± 48.312 |
Splice | Balanced-sampling | 419.763 ± 41.676 | 188.473 ± 20.593 | 608.236 ± 40.687 |
Waveform-5000 | Windowing | 2418.668 ± 215.760 | 363.799 ± 56.499 | 2782.467 ± 224.433 |
Waveform-5000 | Full-Dataset | 2615.956 ± 94.305 | 415.810 ± 20.601 | 3031.766 ± 92.381 |
Waveform-5000 | Random-sampling | 1957.647 ± 203.398 | 413.447 ± 24.548 | 2371.094 ± 202.636 |
Waveform-5000 | Stratified-sampling | 1957.202 ± 199.174 | 417.104 ± 26.348 | 2374.306 ± 196.151 |
Waveform-5000 | Balanced-sampling | 1966.554 ± 193.650 | 417.152 ± 28.133 | 2383.706 ± 190.987 |
Dataset | Method | Test Acc | Test AUC |
---|---|---|---|
Adult | Windowing | 86.355 ± 0.889 | 78.227 ± 1.161 |
Adult | Full-Dataset | 86.074 ± 0.390 | 77.080 ± 0.823 |
Adult | Random-sampling | 85.516 ± 0.423 | 76.131 ± 2.021 |
Adult | Stratified-sampling | 85.677 ± 0.401 | 76.680 ± 0.885 |
Adult | Balanced-sampling | 80.489 ± 0.722 | 81.956 ± 0.580 |
Australian | Windowing | 85.710 ± 4.355 | 85.471 ± 4.411 |
Australian | Full-Dataset | 86.536 ± 3.969 | 86.239 ± 4.041 |
Australian | Random-sampling | 85.101 ± 4.375 | 84.849 ± 4.517 |
Australian | Stratified-sampling | 85.391 ± 4.164 | 85.142 ± 4.266 |
Australian | Balanced-sampling | 85.536 ± 3.925 | 85.584 ± 3.854 |
Breast | Windowing | 94.829 ± 2.804 | 94.368 ± 3.117 |
Breast | Full-Dataset | 95.533 ± 2.674 | 95.058 ± 2.830 |
Breast | Random-sampling | 92.696 ± 3.821 | 91.687 ± 4.739 |
Breast | Stratified-sampling | 92.783 ± 3.485 | 91.956 ± 3.982 |
Breast | Balanced-sampling | 92.433 ± 3.558 | 92.301 ± 3.627 |
Diabetes | Windowing | 74.161 ± 4.864 | 70.041 ± 5.654 |
Diabetes | Full-Dataset | 74.756 ± 4.661 | 71.211 ± 5.027 |
Diabetes | Random-sampling | 72.280 ± 4.520 | 68.602 ± 5.403 |
Diabetes | Stratified-sampling | 73.222 ± 5.113 | 70.254 ± 5.721 |
Diabetes | Balanced-sampling | 71.018 ± 5.222 | 71.726 ± 4.937 |
Ecoli | Windowing | 82.777 ± 6.353 | 88.848 ± 4.134 |
Ecoli | Full-Dataset | 82.822 ± 5.467 | 88.873 ± 3.567 |
Ecoli | Random-sampling | 80.059 ± 6.268 | 86.924 ± 4.218 |
Ecoli | Stratified-sampling | 79.586 ± 6.227 | 86.721 ± 4.113 |
Ecoli | Balanced-sampling | 79.405 ± 6.360 | 86.981 ± 4.034 |
German | Windowing | 71.660 ± 4.608 | 63.119 ± 5.518 |
German | Full-Dataset | 71.300 ± 3.765 | 62.605 ± 4.388 |
German | Random-sampling | 71.800 ± 3.782 | 62.867 ± 4.408 |
German | Stratified-sampling | 71.640 ± 3.799 | 62.857 ± 4.546 |
German | Balanced-sampling | 67.820 ± 4.448 | 66.833 ± 4.014 |
Hypothyroid | Windowing | 99.483 ± 0.346 | 98.880 ± 1.204 |
Hypothyroid | Full-Dataset | 99.528 ± 0.353 | 98.871 ± 1.259 |
Hypothyroid | Random-sampling | 94.340 ± 2.524 | 70.634 ± 23.378 |
Hypothyroid | Stratified-sampling | 96.877 ± 1.652 | 94.594 ± 4.769 |
Hypothyroid | Balanced-sampling | 96.236 ± 1.831 | 97.598 ± 1.421 |
Kr-vs-kp | Windowing | 99.302 ± 0.583 | 99.294 ± 0.594 |
Kr-vs-kp | Full-Dataset | 99.415 ± 0.433 | 99.412 ± 0.433 |
Kr-vs-kp | Random-sampling | 94.171 ± 2.959 | 94.139 ± 3.061 |
Kr-vs-kp | Stratified-sampling | 94.956 ± 1.766 | 94.956 ± 1.802 |
Kr-vs-kp | Balanced-sampling | 94.984 ± 1.727 | 94.996 ± 1.756 |
Letter | Windowing | 87.161 ± 2.074 | 93.324 ± 1.078 |
Letter | Full-Dataset | 87.943 ± 0.720 | 93.731 ± 0.375 |
Letter | Random-sampling | 82.216 ± 1.006 | 90.753 ± 0.523 |
Letter | Stratified-sampling | 82.376 ± 1.148 | 90.836 ± 0.597 |
Letter | Balanced-sampling | 82.430 ± 1.160 | 90.864 ± 0.603 |
Mushroom | Windowing | 100.000 ± 0.000 | 100.000 ± 0.000 |
Mushroom | Full-Dataset | 100.000 ± 0.000 | 100.000 ± 0.000 |
Mushroom | Random-sampling | 73.746 ± 23.610 | 73.625 ± 23.684 |
Mushroom | Stratified-sampling | 98.367 ± 0.813 | 98.312 ± 0.831 |
Mushroom | Balanced-sampling | 98.424 ± 0.819 | 98.376 ± 0.831 |
Segment | Windowing | 96.329 ± 1.655 | 97.859 ± 0.965 |
Segment | Full-Dataset | 96.710 ± 1.335 | 98.081 ± 0.779 |
Segment | Random-sampling | 90.719 ± 3.181 | 94.586 ± 1.855 |
Segment | Stratified-sampling | 91.515 ± 2.074 | 95.051 ± 1.210 |
Segment | Balanced-sampling | 91.455 ± 1.984 | 95.015 ± 1.157 |
Sick | Windowing | 98.688 ± 0.640 | 93.667 ± 3.370 |
Sick | Full-Dataset | 98.741 ± 0.523 | 93.662 ± 3.323 |
Sick | Random-sampling | 96.193 ± 1.887 | 75.662 ± 19.843 |
Sick | Stratified-sampling | 97.301 ± 1.051 | 86.908 ± 6.166 |
Sick | Balanced-sampling | 94.785 ± 1.855 | 94.812 ± 2.641 |
Splice | Windowing | 94.132 ± 1.682 | 95.626 ± 1.344 |
Splice | Full-Dataset | 94.216 ± 1.474 | 95.723 ± 1.125 |
Splice | Random-sampling | 89.997 ± 2.226 | 92.370 ± 1.951 |
Splice | Stratified-sampling | 90.339 ± 1.973 | 92.757 ± 1.572 |
Splice | Balanced-sampling | 89.846 ± 2.199 | 92.902 ± 1.570 |
Waveform-5000 | Windowing | 83.802 ± 9.864 | 87.848 ± 7.402 |
Waveform-5000 | Full-Dataset | 75.202 ± 1.989 | 81.396 ± 1.493 |
Waveform-5000 | Random-sampling | 75.046 ± 2.159 | 81.279 ± 1.619 |
Waveform-5000 | Stratified-sampling | 75.252 ± 1.981 | 81.431 ± 1.487 |
Waveform-5000 | Balanced-sampling | 75.514 ± 2.143 | 81.628 ± 1.609 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Martínez-Galicia, D.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Limón, X.; Grimaldo, F. Windowing as a Sub-Sampling Method for Distributed Data Mining. Math. Comput. Appl. 2020, 25, 39. https://doi.org/10.3390/mca25030039
Martínez-Galicia D, Guerra-Hernández A, Cruz-Ramírez N, Limón X, Grimaldo F. Windowing as a Sub-Sampling Method for Distributed Data Mining. Mathematical and Computational Applications. 2020; 25(3):39. https://doi.org/10.3390/mca25030039
Chicago/Turabian StyleMartínez-Galicia, David, Alejandro Guerra-Hernández, Nicandro Cruz-Ramírez, Xavier Limón, and Francisco Grimaldo. 2020. "Windowing as a Sub-Sampling Method for Distributed Data Mining" Mathematical and Computational Applications 25, no. 3: 39. https://doi.org/10.3390/mca25030039