# Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

`dbstats`(Boj et al., 2017 [16]).

## 2. Materials and Methods

#### 2.1. Distance-Based Linear and Logistic Regression

`dbstats`library) can be used to solve classification problems with mixed-type data (see also [18] for a mixed-features classification procedure based on a general model for the joint distribution of the features). In some preliminary simulations, we compared the performance of the logit and probit links and there were no significant differences in the attained errors.

#### 2.2. The Choice of the Distance

`cluster`package and supported in the

`dblm`and

`dbglm`functions of the

`dbstats`package that we employed in the analysis of the real data sets and in the simulations.

#### 2.3. Some Ensemble Regression Techniques

#### 2.3.1. Aggregation Procedures for Regression

#### Mean Aggregation and Bagging

#### Stacking

#### Magging

`quadprog`to compute these weights, as suggested by these authors.

#### 2.3.2. Bagging for Classification

#### 2.4. Databases under Consideration

#### 2.4.1. Bike Sharing Demand

#### 2.4.2. King County House Sales

## 3. Results

#### 3.1. Aggregation and DB-LM on Two Real Data Sets

#### 3.1.1. Bike Sharing Demand

`ranger`with 100 trees,

`mtry`= 7,

`depth`= 20) the MSE is, respectively, 0.03700, 0.02168, and 0.01746. In Figure 3, we show the mean squared errors for all the regression methods. For this data set, it is clear that the number G of subsamples does not have a great influence on the MSE: this is good news since increasing G increases also noticeably computation time and memory requirements. In Table 4, we see the subsample size m and the overall number $G\phantom{\rule{0.166667em}{0ex}}m$ of observations used as input in the ensemble predictions. From Table 3 and Table 4 and Figure 3, we can see that, for $G=3$ and $m=600$, even if $G\phantom{\rule{0.166667em}{0ex}}m=69\%$ is considerably below 100%, the three aggregation procedures are very near the MSE of DB-LM using the whole sample. Bagging and stacking actually show very good performance for all values of G and $m=300$ (which is only 11% of the training sample size). Conversely, for any G, magging needs at least a subsample size $m=600$ for its MSE to get closer to that of bagging and stacking. A possible explanation to the worse performance of magging with respect to bagging and stacking is that any inhomogeneity in the data is well described by the qualitative predictors. Furthermore, the information of these is already incorporated into the bagging and stacking DB regression procedures via the distance matrix. Of course, it is important to choose a convenient dissimilarity between qualitative variables.

`daisy`,

`disttoD2`, and

`dblm`. For the Capital Bikeshare data, DB-LM takes 109.35 s and occupies 247.3 MB when predicting 290 new cases based on a training sample of 2903 observations. On the other hand, the LM model takes 0.03 s to make these predictions and uses 1 MB of memory. DB-LM execution times and memory usage can considerably be reduced when applying any of the ensemble methods studied (in fact, they take 0 s in the optimization and averaging procedures for the predictions, which is why the execution time refers only to the three R-functions mentioned above). This can be seen in Figure 5, where we analyze the complexity of the ensemble DB-LM prediction model. We observe that the memory usage increases with m, although it remains constant with G. The execution time increases with m and G. These findings reinforce our conclusion that, for $G=3$ and $m=600$, the ensemble DB-LM prediction model provides a solution to the scalability problem, since the complexity of the DB-LM prediction model is considerably reduced and MSE is very close to that of the DB-LM (see Figure 3 and Figure 6).

#### 3.1.2. King County House Sales

`mtry`= 10,

`depth`= 20) it was 0.04704.

#### 3.2. Bagging and DB-GLM: A Simulation Study

`Umpire`(Coombes et al., 2021 [28]). In all cases, we sampled n “vectors” $\mathbf{Z}$ of 20 mixed-type features, with prior probabilities of the populations sampled from a Dirichlet distribution. We consider two models:

- Model 1: The same proportion (1/3) of continuous, binary and nominal features.
- Model 2: The proportion of continuous and binary features is the same (25% approximately), and the remaining (50%) of the features are qualitative ones.

## 4. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Acknowledgments

## Conflicts of Interest

## References

- Boj, E.; Delicado, P.; Fortiana, J. Local linear functional regression based on weighted distance—Based regression. Comput. Stat. Data Ananl.
**2010**, 54, 429–437. [Google Scholar] [CrossRef] - Wang, R.; Shan, S.; Chen, X.; Dai, Q.; Gao, W. Manifold–manifold distance and its application to face recognition with image sets. IEEE Trans. Image Process.
**2012**, 21, 4466–4479. [Google Scholar] [CrossRef] - Shao, M.-W.; Du, X.-J.; Wang, J.; Zhai, C.-M. Recognition of leaf image set based on manifold-manifold distance. In International Conference on Intelligent Computing (ICIC) 2014; Lecture Notes in Computer Science; Springer: Taiyuan, China, 2014; pp. 332–337. [Google Scholar]
- Tsitsulin, A.; Munkhoeva, M.; Mottin, D.; Karras, P.; Bronstein, A.; Oseledets, I.; Müller, E. The Shape of Data: Intrinsic Distance for Data Distributions. 2020. Available online: https://arxiv.org/abs/1905.11141 (accessed on 30 August 2021).
- Cuadras, C.M. Distance analysis in discrimination and classification using both continuous and categorical variables. In Statistical Data Analysis and Inference; Dodge, Y., Ed.; North-Holland Publishing Co.: Amsterdam, The Netherlands, 1989; pp. 459–473. [Google Scholar]
- Cuadras, C.M.; Arenas, C. A distance-based model for prediction with mixed data. Commun. Stat. Theory Methods
**1990**, 19, 2261–2279. [Google Scholar] [CrossRef] - Cuadras, C.M.; Arenas, C.; Fortiana, J. Some computational aspects of a distance-based model for prediction. Commun. Stat. Simul. Comput.
**1996**, 25, 593–609. [Google Scholar] [CrossRef] - Boj, E.; Caballé, A.; Delicado, P.; Esteve, A.; Fortiana, J. Global and local distance-based generalized linear models. Test
**2016**, 25, 170–195. [Google Scholar] [CrossRef][Green Version] - Mendes-Moreira, J.; Soares, C.; Jorge, A.M.; de Sousa, J.F. Ensemble approaches for regression: A survey. ACM Comput. Surv.
**2012**, 45, 10. [Google Scholar] [CrossRef] - Ren, Y.; Zhang, L.; Suganthan, P.N. Ensemble classification and regression—Recent developments, applications and future directions. IEEE Comput. Intell. Mag.
**2016**, 11, 41–53. [Google Scholar] [CrossRef] - Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef][Green Version] - Breiman, L. Stacked regressions. Mach. Learn.
**1996**, 24, 49–64. [Google Scholar] [CrossRef][Green Version] - Bühlmann, P. Bagging, subagging and bragging for improving some prediction algorithms. In Recent Advances and Trends in Nonparametric Statistics; Elsevier: Amsterdam, The Netherlands, 2003; pp. 19–34. [Google Scholar]
- Bühlmann, P.; Meinshausen, N. Magging: Maximin aggregation for inhomogeneous large-scale data. Proc. IEEE
**2015**, 104, 126–135. [Google Scholar] [CrossRef][Green Version] - McCullagh, P.; Nelder, J.A. Generalized Linear Models; Chapman and Hall: London, UK, 1989. [Google Scholar]
- Boj, E.; Caballé, A.; Delicado, P.; Fortiana, J. Dbstats: Distance-Based Statistics. R Package, Version 1.0.5. 2017. Available online: https://CRAN.R-project.org/package=dbstats (accessed on 3 June 2019).
- Gower, J.C. Adding a point to vector diagrams in multivariate analysis. Biometrika
**1968**, 55, 582–585. [Google Scholar] [CrossRef] - De Leon, A.R.; Soo, A.; Williamson, T. Classification with discrete and continuous variables via general mixed-data models. J. Appl. Stat.
**2011**, 38, 1021–1032. [Google Scholar] [CrossRef] - Gower, J.C. A general coefficient of similarity and some of its properties. Biometrics
**1971**, 27, 857–874. [Google Scholar] [CrossRef] - Grané, A.; Salini, S.; Verdolini, E. Robust multivariate analysis for mixed-type data: Novel algorithm and its practical application in socio-Economic research. Socio-Econ. Plan. Sci.
**2021**, 73, 100907. [Google Scholar] [CrossRef] - Paradis, E. Multidimensional scaling with very large datasets. J. Comput. Graph. Stat.
**2018**, 27, 935–939. [Google Scholar] [CrossRef] - Grané, A.; Sow-Barry, A.A. Visualizing profiles of large datasets of weighted and mixed data. Mathematics
**2021**, 9, 891. [Google Scholar] [CrossRef] - Zhu, M. Use of majority votes in statistical learning. WIREs Comput. Stat.
**2015**, 7, 357–371. [Google Scholar] [CrossRef] - Wolpert, D. Stacked generalization. Neural Netw.
**1992**, 5, 241–259. [Google Scholar] [CrossRef] - Džeroski, S.; Ženko, B. Is combining classifiers with stacking better than selecting the best one? Mach. Learn.
**2004**, 54, 255–273. [Google Scholar] [CrossRef][Green Version] - Meinshausen, N.; Bühlmann, P. Maximin effects in inhomogeneous large-scale data. Ann. Stat.
**2015**, 43, 1801–1830. [Google Scholar] [CrossRef][Green Version] - Sakar, C.O.; Polat, S.O.; Katircioglu, M.; Kastro, Y. Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Comput. Appl.
**2018**, 31, 6893–6908. [Google Scholar] [CrossRef] - Coombes, C.E.; Abrams, Z.B.; Nakayiza, S.; Brock, G.; Coombes, K.R. Umpire 2.0: Simulating realistic, mixed–type, clinical data for machine learning. F1000Research
**2021**, 9, 1186. [Google Scholar] [CrossRef]

**Figure 4.**Box-plots for the median absolute prediction errors (MAE) for the Capital Bikeshare data. For each method, the performance is evaluated with the distribution of MAE values computed in 500 runs within each scenario.

**Figure 5.**Complexity analysis (execution time and memory usage) of the ensemble DB-LM prediction model for the Capital Bikeshare data.

**Figure 6.**Percentage of complexity reduction of the ensemble DB-LM prediction model for the Capital Bikeshare data. Baseline: Execution time of 109.35 s and memory usage of 243.3 MB to predict 290 new cases using all the training sample.

**Figure 8.**Box plots for the median absolute prediction errors (MAE) for the King County house sales data.

Capital Bikeshare |
---|

Total count of daily users (both registered and not) |

Season: winter (1), spring (2), summer (3), autumn (4) |

Year, codified to 0 (=2011), 1 (=2012), 2 (=2013), …, 7 (=2018) |

Month, codified to 1, 2, …, 12 |

National holiday (1) or not (0) |

Weekday, codified to 0 (=Sunday), 1 (=Monday), …, 6 (=Saturday) |

Working day (1) or weekend day (0) |

NOAA at DCA |

Average daily wind speed (miles per hour) |

Precipitation (inches to hundredths) |

Maximum temperature (in Fahrenheit) |

Minimum temperature (in Fahrenheit) |

Ceiling height dimension (in meters) |

Mean daily temperature (in Celsius) |

Sea level pressure (in hPa) |

Relative humidity (in %) |

date | Date of the home sale |

price | Price of each home sold |

bedrooms | Number of bedrooms |

bathrooms | Number of bathrooms |

sqft_living | Square footage (sqft) of the interior living space |

sqft_lot | Sqft of the land space |

floors | Number of floors |

waterfront | 1 if the house overlooks the waterfront; 0 if not |

view | Index from 0 to 4 grading the view from the property |

condition | Index from 1 to 5 on the condition of the house |

grade | Index from 1 to 13, grading quality of construction and design |

sqft_above | Sqft of the interior housing space above ground level |

sqft_basement | Sqft of the interior housing space below ground level |

yr_built | The year the house was initially built |

yr_renovated | The year of the house’s last renovation |

zipcode | Zipcode area the house is in |

lat | Latitude |

long | Longitude |

sqft_living15 | Sqft of interior living space for the nearest 15 neighbors |

sqft_lot15 | Sqft of the land lots of the nearest 15 neighbors |

**Table 3.**Average (standard deviation) of mean squared prediction errors (MSE) for the Capital Bikeshare data with the ensemble procedures. Summary statistics computed across 500 iterations.

G | m | Bagging | Stacking | Magging |
---|---|---|---|---|

3 | 100 | 0.03303 (0.005855) | 0.03223 (0.005579) | 0.04217 (0.010301) |

3 | 300 | 0.02445 (0.003641) | 0.02424 (0.003559) | 0.02740 (0.005098) |

3 | 600 | 0.02257 (0.003102) | 0.02251 (0.003080) | 0.02365 (0.003562) |

3 | 900 | 0.02226 (0.003052) | 0.02224 (0.003046) | 0.02278 (0.003216) |

5 | 100 | 0.03064 (0.005142) | 0.02940 (0.004556) | 0.04323 (0.010627) |

5 | 300 | 0.02381 (0.003535) | 0.02349 (0.003360) | 0.02793 (0.005720) |

5 | 600 | 0.02223 (0.002936) | 0.02216 (0.002902) | 0.02352 (0.003551) |

5 | 900 | 0.02185 (0.003004) | 0.02182 (0.002997) | 0.02246 (0.003151) |

10 | 100 | 0.02958 (0.004680) | 0.02779 (0.004059) | 0.04471 (0.010182) |

10 | 300 | 0.02360 (0.003440) | 0.02307 (0.003197) | 0.02909 (0.006110) |

10 | 600 | 0.02221 (0.003054) | 0.02207 (0.003012) | 0.02399 (0.003898) |

10 | 900 | 0.02187 (0.002960) | 0.02181 (0.002947) | 0.02268 (0.003196) |

**Table 4.**m and $G\phantom{\rule{0.166667em}{0ex}}m$ as percentages of the training sample size $n=2613$.

$\mathit{m}$ | $\mathit{m}$ as % of $\mathit{n}$ | $\mathit{G}\phantom{\rule{0.166667em}{0ex}}\mathit{m}$ as % of n | ||
---|---|---|---|---|

$\mathit{G}$ | ||||

3 | 5 | 10 | ||

100 | 4% | 11% | 19% | 38% |

300 | 11% | 34% | 57% | 115% |

600 | 23% | 69% | 115% | 230% |

900 | 34% | 103% | 172% | 344% |

**Table 5.**Results of the two-sided paired-sample Wilcoxon tests (with Bonferroni correction) for in Capital Bikesharing data.

$\mathit{m}=100$ | $\mathit{m}=300$ | |||||
---|---|---|---|---|---|---|

B-S | B-M | S-M | B-S | B-M | S-M | |

$G=3$ | *** | *** | *** | *** | *** | *** |

$G=5$ | *** | *** | *** | *** | *** | *** |

$G=10$ | *** | *** | *** | *** | *** | *** |

$\mathit{m}=\mathbf{600}$ | $\mathit{m}=\mathbf{900}$ | |||||

B-S | B-M | S-M | B-S | B-M | S-M | |

$G=3$ | $3.91\times {10}^{-15}$ | *** | *** | $1.16\times {10}^{-4}$ | *** | *** |

$G=5$ | $2.75\times {10}^{-12}$ | *** | *** | $8.49\times {10}^{-8}$ | *** | *** |

$G=10$ | *** | *** | *** | $3.91\times {10}^{-13}$ | *** | *** |

**Table 6.**Average (standard deviation) in prediction of the logarithm of sale price in King County houses data with the ensemble procedures. Summary statistics computed across 500 iterations.

G | m | Bagging | Stacking | Magging |
---|---|---|---|---|

3 | 500 | 0.05895 (0.00817) | 0.05907 (0.00811) | 0.06581 (0.00804) |

3 | 1000 | 0.05008 (0.00602) | 0.05018 (0.00600) | 0.05387 (0.00598) |

3 | 2000 | 0.04531 (0.00390) | 0.04536 (0.00392) | 0.04749 (0.00362) |

5 | 500 | 0.05656 (0.00756) | 0.05678 (0.00744) | 0.06454 (0.00794) |

5 | 1000 | 0.04865 (0.00547) | 0.04886 (0.00548) | 0.05309 (0.00509) |

5 | 2000 | 0.04426 (0.00343) | 0.04438 (0.00348) | 0.04667 (0.00312) |

10 | 500 | 0.05488 (0.00717) | 0.05540 (0.00699) | 0.06343 (0.00717) |

10 | 1000 | 0.04786 (0.00514) | 0.04834 (0.00523) | 0.05243 (0.00445) |

10 | 2000 | 0.04396 (0.00344) | 0.04422 (0.00357) | 0.04663 (0.00297) |

**Table 7.**m and $\mathit{G}\phantom{\rule{0.166667em}{0ex}}\mathit{m}$ as percentages of the training sample size n = 19,435.

$\mathit{m}$ | $\mathit{m}$ as % of $\mathit{n}$ | $\mathit{G}\phantom{\rule{0.166667em}{0ex}}\mathit{m}$ as % of n | ||
---|---|---|---|---|

$\mathit{G}$ | ||||

3 | 5 | 10 | ||

500 | 3% | 8% | 13% | 26% |

1000 | 5% | 15% | 26% | 51% |

2000 | 10% | 31% | 51% | 103% |

**Table 8.**Results of the two-sided paired-sample Wilcoxon tests (with Bonferroni correction) for King County house sales data.

$\mathit{m}=500$ | $\mathit{m}=1000$ | $\mathit{m}=2000$ | |||||||
---|---|---|---|---|---|---|---|---|---|

B-S | B-M | S-M | B-S | B-M | S-M | B-S | B-M | S-M | |

$G=3$ | $1.72\times {10}^{-11}$ | *** | *** | *** | *** | *** | *** | *** | *** |

$G=5$ | $1.98\times {10}^{-15}$ | *** | *** | *** | *** | *** | *** | *** | *** |

$G=10$ | *** | *** | *** | *** | *** | *** | *** | *** | *** |

Link Function | Classical GLM (Quantitative Features) | DB-GLM (Whole Sample) | DB-GLM and Bagging |
---|---|---|---|

logit | 0.88940 | 0.88380 | 0.88395 |

probit | 0.88930 | 0.88465 | 0.88350 |

Classical Logistic (Quantitative Features) | DB Logistic (Whole Sample) | DB Logistic and Bagging |
---|---|---|

0.7401 | 0.9734 | 0.9780 |

Classical Logistic (Quantitative Features) | DB Logistic and Bagging | |
---|---|---|

Model 1 | 0.8220 | 0.9244 |

Model 2 | 0.9208 | 0.9601 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Baíllo, A.; Grané, A. Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data. *Mathematics* **2021**, *9*, 2247.
https://doi.org/10.3390/math9182247

**AMA Style**

Baíllo A, Grané A. Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data. *Mathematics*. 2021; 9(18):2247.
https://doi.org/10.3390/math9182247

**Chicago/Turabian Style**

Baíllo, Amparo, and Aurea Grané. 2021. "Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data" *Mathematics* 9, no. 18: 2247.
https://doi.org/10.3390/math9182247