fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search
Abstract
:1. Introduction
- The Batch Forward Search. Instead of incrementing the subset used in fitting by one observation we move from a subset of size m to one of size . In our example, the batch size . We use an approximation to test for outliers one observation at a time;
- A SAS version of the program, fsdaSAS (https://github.com/UniprJRC/FSDAsas accessed on 12 March 2021), which takes advantage of the file handling capabilities of SAS to increase the size of datasets that can be analysed and to decrease computation time for large problems.
2. Algebra for the Forward Search
3. Why SAS?
- When the data are at the limit of the physical memory, caching strategies become crucial to avoid the deterioration of performance. Unlike other statistical environments that only run in memory and crash when a dataset is too large to be loaded, SAS uses file-swapping to handle out-of-memory problems. The swapping is very efficient, as the SAS procedures are optimised to limit the number of files created within a procedure, avoiding unnecessary swapping steps;
- File records are stored sequentially, in such a way that processing happens one record at a time. Then, the SAS data step reads through the data only one time and applies all the commands to each line of data of interest. In this way, the data movements are drastically limited and the processing time is reduced;
- A data step only reads the data that it needs in the memory and leaves out the data that it does not need in the source;
- Furthermore, data are indexed to allow for faster retrieval from datasets;
- Finally, in regression and other predictive modelling methods, multi-threading is applied whenever this is appropriate for the analysis.
4. FS Analysis of the Transformed Loyalty Card Data
5. The FS Batch Procedure
6. Timing Comparisons
7. Balance Sheet Data—A Large Dataset
y | profitability, calculated as return over sales; |
labour share; the ratio of labour cost to value added; | |
the ratio of tangible fixed assets to value added; | |
the ratio of intangible assets to total assets; | |
the ratio of industrial equipment to total assets; | |
the firm’s interest burden; the ratio of the firm’s total assets to net capital. |
8. Discussion and Extensions
8.1. Three Classes of Estimator for Robust Regression
- Hard (0,1) trimming: In least trimmed squares (LTS: [17,18]) the amount of trimming is determined by the choice of the trimming parameter h, , which is specified in advance. The LTS estimate is intended to minimise the sum of squares of the residuals of h observations. For LS, . We also monitor a generalisation of least median of squares (LMS, [18]) in which the estimate of the parameters minimises the median of h squared residuals.
- Soft trimming (downweighting): M estimation and derived methods. The intention is that observations near the centre of the distribution retain their value, but the function ensures that increasingly remote observations have a weight that decreases with distance from the centre. SAS provides the ROBUSTREG procedure where the choice of downweighting estimators includes S [23] and MM estimation [24] independently of the function (Andrews, Bisquare, Cauchy, Fair, Hampel, Huber, Logistic, Median, Talworth, Welsch). Our contribution is the monitoring of these estimators and also of LTS and LMS (as described in the section below).Many of the algorithms for finding these estimators start from very small subsets of data, typically of size p or , before moving on to the use of larger subsets. Hawkins and Olive [25] argue that, to avoid inconsistent estimators, these larger subsets need to increase in size with n. Cerioli et al. [22] prove the consistency of the FS. In addition to developing the analysis of consistency, Olive [26] discusses the approximate nature of the estimators from subset procedures and analyses the computational complexity of the exact solutions to some of these robust estimation problems.
8.2. Monitoring and Graphics
8.3. Programs
- FSR.sx and FSM.sx, which implement the FS approach to detect outliers, respectively, in regression and in multivariate data;
- FSRfan.sx and FSMfan.sx for identifying the best transformation parameter for the Box–Cox transformation in regression and multivariate analysis ([31], Chapter 4);
- Monitoring.sx for monitoring a number of traditional robust multivariate and regression estimators (S, MM, LTS and LMS), already present in SAS, for specific choices of breakdown point or efficiency. Riani et al. [4] introduced the monitoring of regression estimators detailed in Section 8.2, however, in the FSDA toolbox, only for S and MM estimators (and the FS) in MATLAB. The extension to the monitoring of LTS and LMS is a particularly powerful new feature and a novelty in the statistical literature.
- FSM.sx, the multivariate counterparts of FSR, and FSMfan.sx for multivariate transformations;
- FSRms.sx for choosing the best model in regression. This function implements the procedure of Riani and Atkinson [33] which combines Mallows’ [34] with the flexible trimming of the FS to yield an information rich plot “The Generalized Candlestick Plot” revealing the effect of outliers on model choice;
- FSRMultipleStart.sx and FSMmultiplestart.sx for identifying observations that are divided into groups either of regression models or of multivariate normal clusters. The later procedure is derived from the FSDA implementation of Atkinson et al. [35].
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Testing for Outliers in Regression
Appendix B. Regression Outlier Detection in the FS
- Detection of a SignalThere are four conditions, the fulfilment of any one of which leads to the detection of a signal.
- In the central part of the search, we require 3 consecutive values of above the envelope or 1 above ;
- In the final part of the search, we need two consecutive values of above and 1 above ;
- envelope;
- envelope—in this case, a single outlier is detected and the procedure terminates.
The final part of the search is defined as - Confirmation of a SignalThe purpose of the first point, in particular, is to distinguish informative peaks from random fluctuations in the centre of the search. Once a signal takes place (at ), we check whether the signal is informative about the structure of the data. If 1% envelope, we decide the signal is not informative, increment m and return to Step 1.
- Identification of OutliersWith an informative signal, we start superimposing envelopes taking until the final, penultimate or ante-penultimate value are above the threshold or, alternatively, we have a value of for any which is greater than the threshold. Let this value be . We then obtain the best parameter estimates by using the sample of size .
Appendix C. Software for Robust Data Analysis
References
- Perrotta, D.; Riani, M.; Torti, F. New robust dynamic plots for regression mixture detection. Adv. Data Anal. Classif. 2009, 3, 263–279. [Google Scholar] [CrossRef]
- Riani, M.; Perrotta, D.; Torti, F. FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemom. Intell. Lab. Syst. 2012, 116, 17–32. [Google Scholar] [CrossRef]
- Torti, F.; Perrotta, D.; Atkinson, A.C.; Corbellini, A.; Riani, M. Monitoring Robust Regression in SAS IML Studio: S, MM, LTS, LMS and Especially the Forward Search; Technical Report JRC121650; Publications Office of the European Union: Luxembourg, 2020. [Google Scholar]
- Riani, M.; Cerioli, A.; Atkinson, A.C.; Perrotta, D. Monitoring Robust Regression. Electron. J. Stat. 2014, 8, 642–673. [Google Scholar] [CrossRef]
- Riani, M.; Atkinson, A.C.; Cerioli, A. Finding an Unknown Number of Multivariate Outliers. J. R. Stat. Soc. Ser. B 2009, 71, 447–466. [Google Scholar] [CrossRef] [Green Version]
- Atkinson, A.C.; Riani, M.; Corbellini, A. An analysis of transformations for profit-and-loss data. Appl. Stat. 2020, 69, 251–275. [Google Scholar] [CrossRef]
- Atkinson, A.C.; Riani, M. Distribution theory and simulations for tests of outliers in regression. J. Comput. Graph. Stat. 2006, 15, 460–476. [Google Scholar] [CrossRef]
- Atkinson, A.C. Testing transformations to normality. J. R. Stat. Soc. Ser. B 1973, 35, 473–479. [Google Scholar] [CrossRef]
- Riani, M.; Atkinson, A.C. Robust diagnostic data analysis: Transformations in regression (with discussion). Technometrics 2000, 42, 384–398. [Google Scholar] [CrossRef]
- Atkinson, A.C.; Riani, M. Tests in the fan plot for robust, diagnostic transformations in regression. Chemom. Intell. Lab. Syst. 2002, 60, 87–100. [Google Scholar] [CrossRef] [Green Version]
- Atkinson, A.C.; Corbellini, A.; Riani, M. Robust Bayesian Regression with the Forward Search: Theory and Data Analysis. Test 2017, 26, 869–886. [Google Scholar] [CrossRef] [Green Version]
- Cerioli, A.; Riani, M. Robust methods for the analysis of spatially autocorrelated data. Stat. Methods Appl. J. Ital. Stat. Soc. 2002, 11, 335–358. [Google Scholar] [CrossRef]
- Maitra, R.; Melnykov, V. Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms. J. Comput. Graph. Stat. 2010, 19, 354–376. [Google Scholar] [CrossRef] [Green Version]
- Torti, F.; Perrotta, D.; Riani, M.; Cerioli, A. Assessing Trimming Methodologies for Clustering Linear Regression Data. Adv. Data Anal. Classif. 2018. [Google Scholar] [CrossRef] [Green Version]
- Corbellini, A.; Magnani, M.; Morelli, G. Labor market analysis through transformations and robust multivariate models. Socio-Econ. Plan. Sci. 2020. [Google Scholar] [CrossRef]
- Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and transformation (with discussion). J. Am. Stat. Assoc. 1985, 80, 580–619. [Google Scholar] [CrossRef]
- Hampel, F.R. Beyond location parameters: Robust concepts and methods. Bull. Int. Stat. Inst. 1975, 46, 375–382. [Google Scholar]
- Rousseeuw, P.J. Least median of squares regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
- Atkinson, A.C.; Riani, M. Robust Diagnostic Regression Analysis; Springer: New York, NY, USA, 2000. [Google Scholar]
- Riani, M.; Atkinson, A.C.; Perrotta, D. A parametric framework for the comparison of methods of very robust regression. Stat. Sci. 2014, 29, 128–143. [Google Scholar] [CrossRef]
- Atkinson, A.C.; Riani, M.; Cerioli, A. The Forward Search: Theory and data analysis (with discussion). J. Korean Stat. Soc. 2010, 39, 117–134. [Google Scholar] [CrossRef]
- Cerioli, A.; Farcomeni, A.; Riani, M. Strong consistency and robustness of the Forward Search estimator of multivariate location and scatter. J. Multivar. Anal. 2014, 126, 167–183. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; Yohai, V.J. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis: Lecture Notes in Statistics 26; Springer: New York, NY, USA, 1984; pp. 256–272. [Google Scholar]
- Yohai, V.J.; Zamar, R.H. High breakdown-point estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [Google Scholar] [CrossRef]
- Hawkins, D.M.; Olive, D.J. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm (with discussion). J. Am. Stat. Assoc. 2002, 97, 136–159. [Google Scholar] [CrossRef] [Green Version]
- Olive, D.J. Robust Statistics. 2020. Available online: http://parker.ad.siu.edu/Olive/robbook.htm (accessed on 15 April 2021).
- Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; Wiley: New York, NY, USA, 1987. [Google Scholar]
- Riani, M.; Cerioli, A.; Torti, F. On consistency factors and efficiency of robust S-estimators. Test 2014, 23, 356–387. [Google Scholar] [CrossRef]
- Riani, M.; Atkinson, A.C.; Corbellini, A.; Perrotta, D. Robust regression with density power divergence: Theory, comparisons and data analysis. Entropy 2020, 22, 399. [Google Scholar] [CrossRef] [Green Version]
- Cerioli, A.; Riani, M.; Atkinson, A.C.; Corbellini, A. The power of monitoring: How to make the most of a contaminated multivariate sample (with discussion). Stat. Methods Appl. 2017. [Google Scholar] [CrossRef]
- Atkinson, A.C.; Riani, M.; Cerioli, A. Exploring Multivariate Data with the Forward Search; Springer: New York, NY, USA, 2004. [Google Scholar]
- Pison, G.; Van Aelst, S.; Willems, G. Small sample corrections for LTS and MCD. Metrika 2002, 55, 111–123. [Google Scholar] [CrossRef] [Green Version]
- Riani, M.; Atkinson, A.C. Robust model selection with flexible trimming. Comput. Stat. Data Anal. 2010, 54, 3300–3312. [Google Scholar] [CrossRef] [Green Version]
- Mallows, C.L. Some comments on Cp. Technometrics 1973, 15, 661–675. [Google Scholar]
- Atkinson, A.C.; Riani, M.; Cerioli, A. Cluster detection and clustering with random start forward searches. J. Appl. Stat. 2018, 45, 777–798. [Google Scholar] [CrossRef]
- Lehmann, E. Point Estimation; Wiley: New York, NY, USA, 1991. [Google Scholar]
- Guenther, W.C. An Easy Method for Obtaining Percentage Points of Order Statistics. Technometrics 1977, 19, 319–321. [Google Scholar] [CrossRef]
- Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions—1, 2nd ed.; Wiley: New York, NY, USA, 1994. [Google Scholar]
- Tallis, G.M. Elliptical and Radial Truncation in Normal Samples. Ann. Math. Stat. 1963, 34, 940–944. [Google Scholar] [CrossRef]
- Buja, A.; Rolke, W. Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data; Technical Report; The Wharton School, University of Pennsylvania: Philadelphia, PA, USA, 2003. [Google Scholar]
- Todorov, V.; Filzmoser, P. An Object-Oriented Framework for Robust Multivariate Analysis. J. Stat. Softw. 2009, 32, 1–47. [Google Scholar] [CrossRef] [Green Version]
- Rousseeuw, P.J.; Croux, C.; Todorov, V.; Ruckstuhl, A.; Salibian-Barrera, M.; Verbeke, T.; Maechler, M. Robustbase: Basic Robust Statistics. R Package Version 0.92-7. 2009. Available online: http://CRAN.R-project.org/package=robustbase (accessed on 15 April 2021).
- Riani, M.; Cerioli, A.; Corbellini, A.; Perrotta, D.; Torti, F.; Sordini, E.; Todorov, V. fsdaR: Robust Data Analysis Through Monitoring and Dynamic Visualization. 2017. Available online: https://CRAN.R-project.org/package=fsdaR (accessed on 15 April 2021).
- Hubert, M.; Debruyne, M. Minimum Covariance Determinant. Wires Comput. Stat. 2010, 2, 36–43. [Google Scholar] [CrossRef]
- Vanden Branden, K.; Hubert, M. Robustness properties of a robust partial least squares regression method. Anal. Chim. Acta 2005, 515, 229–241. [Google Scholar] [CrossRef]
- Verboven, S.; Hubert, M. Matlab library LIBRA. Wires Comput. Stat. 2010, 2, 509–515. [Google Scholar] [CrossRef]
- Hubert, M.; Rousseeuw, P.J.; Vanden Branden, K. ROBPCA: A new approach to robust principal component analysis. Technometrics 2005, 47, 64–79. [Google Scholar] [CrossRef]
- García-Escudero, L.A.; Gordaliza, A.; Matran, C.; Mayo-Iscar, A.; San Martin, R. A general trimming approach to robust cluster analysis. Ann. Stat. 2008, 36, 1324–1345. [Google Scholar] [CrossRef]
- García-Escudero, L.A.; Gordaliza, A.; Mayo-Iscar, A.; San Martin, R. Robust clusterwise linear regression through trimming. Comput. Stat. Data Anal. 2010, 54, 3057–3069. [Google Scholar] [CrossRef]
- Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: Chichester, UK, 2006. [Google Scholar]
- Van Aelst, S.; Rousseeuw, P.J. Minimum volume ellipsoid. Wires Comput. Stat. 2009, 1, 71–82. [Google Scholar] [CrossRef]
Least Squares on All Data | Standard FS | Batch FS | |
---|---|---|---|
Number of units | 44,140 | 43,995 | 43,979 |
Error d.f. | 44,134 | 43,989 | 43,973 |
values | |||
Intercept | 377.0 | 383.5 | 383.9 |
−249.3 | −253.9 | −254.2 | |
−47.4 | −48.5 | −48.5 | |
−10.2 | −10.4 | −10.3 | |
−5.0 | −4.9 | −5.0 | |
−15.2 | −15.5 | −15.5 | |
for regression | 1.274 | 1.322 | 1.325 |
0.591 | 0.601 | 0.600 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Torti, F.; Corbellini, A.; Atkinson, A.C. fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search. Stats 2021, 4, 327-347. https://doi.org/10.3390/stats4020022
Torti F, Corbellini A, Atkinson AC. fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search. Stats. 2021; 4(2):327-347. https://doi.org/10.3390/stats4020022
Chicago/Turabian StyleTorti, Francesca, Aldo Corbellini, and Anthony C. Atkinson. 2021. "fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search" Stats 4, no. 2: 327-347. https://doi.org/10.3390/stats4020022
APA StyleTorti, F., Corbellini, A., & Atkinson, A. C. (2021). fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search. Stats, 4(2), 327-347. https://doi.org/10.3390/stats4020022