# Evaluation of Regression Models: Model Assessment, Model Selection and Generalization Error

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preprocessing of Data and Regression Models

#### 2.1. Preprocessing

#### 2.2. Ordinary Least Squares Regression and Linear Polynomial Regression

#### 2.3. Regularization: Ridge Regression

#### 2.4. R Package

## 3. Overall View on Model Diagnosis

- Training data set: ${D}_{train}$
- Validation data set: ${D}_{val}$
- Test data set: ${D}_{test}$

## 4. Model Assessment

- SST is the
**sum of squares total**, also called the**total sum of squares**(TSS); - SSR is the
**sum of squares due to regression**(variation explained by linear model), also called the**explained sum of squares**(ESS); - SSE is the
**sum of squares due to errors**(unexplained variation), also called the**residual sum of squares**(RSS).

**mean squared error**(MSE) is useful, given by

#### 4.1. Bias-Variance Tradeoff

- Noise: This term measures the variability within the data, not considering any model. The noise cannot be reduced because it does not depend on the training data D or g, or any other parameter under our control; hence, it is a characteristic of the data. For this reason, this component is also called “irreducible error”.
- Variance: This term measures the model variability with respect to changing training sets. This variance can be reduced by using less complex models, g. However, this can increase the bias (underfitting).
- Bias: This term measures the inherent error that you obtain from your model, even with infinite training data. This bias can be reduced by using more complex models, g. However, this can increase the variance (overfitting).

#### 4.2. Example: Linear Polynomial Regression Model

**error-complexity curves**. They are important for evaluating the learning behavior of models.

**Definition**

**1.**

**Error-complexity curves**show the training error and test error in dependence on the model complexity. The models underlying these curves are estimated from training data with a fixed sample size.

#### 4.3. Idealized Error-Complexity Curves

**a model is overfitting**if its test error is higher than those of a less complex model. That means to decide whether a model is overfitting, it is necessary to compare it with a simpler model. Hence, overfitting is detected from a comparison, and it is not an absolute measure. Figure 3B shows that all models with a model complexity larger than 3.5 are overfitting, with respect to the best model having a model complexity of ${c}_{opt}$ = 3.5 leading to the lowest test error. One can formalize this by defining an overfitting model as follows.

**Definition**

**2**(model overfitting)

**.**

**overfitting**if, for the test error of this model, the following holds:

**a model is underfitting**if its test error is higher than those of a more complex model. In other words, to decide whether a model is underfitting, it is necessary to compare it with a more complex model. In Figure 3B, all models with a model complexity smaller than 3.5 are underfitting, with respect to the best model. The formal definition of this can be given as follows.

**Definition**

**3**(model underfitting)

**.**

**underfitting**if, for the test error of this model, the following holds:

**generalization capabilities of a model**are assessed by its predictive performance of the test error in comparison with the training error. If the distance between the test error and the training error is small (has a small gap), such as

**Definition**

**4**(generalization)

**.**

## 5. Model Selection

#### 5.1. ${R}^{2}$ and Adjusted ${R}^{2}$

#### 5.2. Mallows’ Cp Statistic

- The optimism increases with ${\sigma}^{2}$;
- The optimism increases with p;
- The optimism decreases with n.

- Adding more noise (indicated by increasing ${\sigma}^{2}$) and leaving n and p fixed makes it harder for a model to be learned;
- Increasing the complexity of the model (indicated by increasing p) and leaving ${\sigma}^{2}$ and n fixed makes it easier for a model to fit the test data but is prune to overfitting;
- Increasing the test data set (indicated by increasing n) and leaving ${\sigma}^{2}$ and p fixed reduces the chances for overfitting.

#### 5.3. Akaike’s Information Criterion (AIC), Schwarz’s BIC, and the Bayes Factor

**evidence for the model**${\mathcal{M}}_{i}$, or simply “evidence”.

**posterior odds**of the models is given by:

- BIC selects smaller models (more parsimonious) than AIC and tends to perform underfitting;
- AIC selects larger models than BIC and tends to perform overfitting;
- AIC represents a frequentist point of view;
- BIC represents a Bayesian point of view;
- AIC is asymptotically efficient but not consistent;
- BIC is consistent but not asymptotically efficient;
- AIC should be used when the goal is prediction accuracy of a model;
- BIC should be used when the goal is model interpretability.

#### 5.4. Best Subset Selection

Algorithm 1: Best subset selection. |

Algorithm 2: Forward stepwise selection. |

Algorithm 3: Backward stepwise selection. |

#### 5.5. Stepwise Selection

#### 5.5.1. Forward Stepwise Selection

#### 5.5.2. Backward Stepwise Selection

## 6. Cross-Validation

- Cross-validation is a computational method that is simple in its realization;
- Cross-validation makes few assumptions about the true underlying model;
- Compared with AIC, BIC, and the adjusted ${R}^{2}$, cross-validation provides a direct estimate of the prediction error;
- Every data point is used for both training and testing.

- The computation time can be long because the whole analysis needs to be repeated K times for each model;
- The number of folds (K) needs to be determined;
- For a small number of folds, the bias of the estimator will be large.

## 7. Learning Curves

**Definition**

**5.**

**Learning curves**show the training error and test error in dependence on the sample size of the training data. The models underlying these curves all have the same complexity.

#### 7.1. Learning Curves for Linear Polynomial Regression Models

- How much training data is needed?
- How much bias and variance is present?

- If the curve is still changing (increasing for training error and decreasing for test error) rapidly → need larger sample size;
- If the curve is completely flattened out → sample size is sufficient;
- If the curve is gradually changing → a much larger sample size is needed.

- A model has
**high bias**if the training and test error converge to a value much larger than ${E}_{test}$. In this case, increasing the sample size of the training data will not improve the results. This indicates an underfitting of the data because the model is too simple. In order to improve this, one needs to increase the complexity of the model. - A model has
**high variance**if the training and test error are quite different from each other, with a large gap between both. Here, a gap is defined as ${E}_{test}(n)-{E}_{train}(n)$ for sample size n of the training data. In this case, the training data are fitted much better than the test data, indicating problems with the generalization capabilities of the model. In order to improve the sample size of the training data, needs to be increased.

#### 7.2. Idealized Learning Curves

**underfitting model**.

**overfitting model**.

## 8. Summary

- Selecting predictor variables for linear regression models;
- Selecting among different regularization models, such as ridge regression, LASSO, or elastic net;
- Selecting the best classification method from a list of candidates, such as random forest, logistic regression, or the support vector machine of neural networks;
- Selecting the number of neurons and hidden layers in neural networks.

- An underfitting model: Such a model is characterized by high bias, low variance, and poor test error. In general, such a model is too simple;
- The best model: For such a model, the bias and variance are balanced and the test error makes good predictions;
- An overfitting model: Such a model is characterized by low bias, high variance, and poor test error. In general, such a model is too complex.

## 9. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Chang, R.M.; Kauffman, R.J.; Kwon, Y. Understanding the paradigm shift to computational social science in the presence of big data. Decis. Support Syst.
**2014**, 63, 67–80. [Google Scholar] [CrossRef] - Provost, F.; Fawcett, T. Data science and its relationship to big data and data-driven decision making. Big Data
**2013**, 1, 51–59. [Google Scholar] [CrossRef] - Hardin, J.; Hoerl, R.; Horton, N.J.; Nolan, D.; Baumer, B.; Hall-Holt, O.; Murrell, P.; Peng, R.; Roback, P.; Lang, D.T.; et al. Data science in statistics curricula: Preparing students to ‘think with data’. Am. Stat.
**2015**, 69, 343–353. [Google Scholar] [CrossRef] - Emmert-Streib, F.; Moutari, S.; Dehmer, M. The process of analyzing data is the emergent feature of data science. Front. Genet.
**2016**, 7, 12. [Google Scholar] [CrossRef] [PubMed] - Emmert-Streib, F.; Dehmer, M. Defining data science by a data-driven quantification of the community. Mach. Learn. Knowl. Extr.
**2019**, 1, 235–251. [Google Scholar] [CrossRef] - Dehmer, M.; Emmert-Streib, F. Frontiers Data Science; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Ansorge, W. Next-generation DNA sequencing techniques. New Biotechnol.
**2009**, 25, 195–203. [Google Scholar] [CrossRef] - Emmert-Streib, F.; de Matos Simoes, R.; Mullan, P.; Haibe-Kains, B.; Dehmer, M. The gene regulatory network for breast cancer: Integrated regulatory landscape of cancer hallmarks. Front. Genet.
**2014**, 5, 15. [Google Scholar] [CrossRef] - Musa, A.; Ghoraie, L.; Zhang, S.D.; Glazko, G.; Yli-Harja, O.; Dehmer, M.; Haibe-Kains, B.; Emmert-Streib, F. A review of connectivity mapping and computational approaches in pharmacogenomics. Brief. Bioinf.
**2017**, 19, 506–523. [Google Scholar] - Mardis, E.R. Next-generation DNA sequencing methods. Ann. Rev. Genom. Hum. Genet.
**2008**, 9, 387–402. [Google Scholar] [CrossRef] - Tripathi, S.; Moutari, S.; Dehmer, M.; Emmert-Streib, F. Comparison of module detection algorithms in protein networks and investigation of the biological meaning of predicted modules. BMC Bioinf.
**2016**, 17, 1–18. [Google Scholar] [CrossRef] [PubMed] - Conte, R.; Gilbert, N.; Bonelli, G.; Cioffi-Revilla, C.; Deffuant, G.; Kertesz, J.; Loreto, V.; Moat, S.; Nadal, J.P.; Sanchez, A.; et al. Manifesto of computational social science. Eur. Phys. J.-Spec. Top.
**2012**, 214, 325–346. [Google Scholar] [CrossRef] - Lazer, D.; Pentland, A.S.; Adamic, L.; Aral, S.; Barabasi, A.L.; Brewer, D.; Christakis, N.; Contractor, N.; Fowler, J.; Gutmann, M.; et al. Life in the network: The coming age of computational social science. Science
**2009**, 323, 721. [Google Scholar] [CrossRef] [PubMed] - Emmert-Streib, F.; Yli-Harja, O.; Dehmer, M. Data analytics applications for streaming data from social media: What to predict? Front. Big Data
**2018**, 1, 1. [Google Scholar] [CrossRef] - Breiman, L. Bagging Predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] - Clarke, B.; Fokoue, E.; Zhang, H.H. Principles and Theory for Data Mining and Machine Learning; Springer: Dordrecht, The Netherlands; New York, NY, USA, 2009. [Google Scholar]
- Harrell, F.E. Regression Modeling Strategies; Springer: New York, NY USA, 2001. [Google Scholar]
- Haste, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
- Emmert-Streib, F.; Dehmer, M. High-dimensional LASSO-based computational regression models: Regularization, shrinkage, and selection. Mach. Learn. Knowl. Extr.
**2019**, 1, 359–383. [Google Scholar] [CrossRef] - Schölkopf, B.; Smola, A. Learning with Kernels: Support Vector Machines, Regulariztion, Optimization and Beyond; The MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Ding, J.; Tarokh, V.; Yang, Y. Model selection techniques: An overview. IEEE Signal Process. Mag.
**2018**, 35, 16–34. [Google Scholar] [CrossRef] - Forster, M.R. Key concepts in model selection: Performance and generalizability. J. Math. Psychol.
**2000**, 44, 205–231. [Google Scholar] [CrossRef] - Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv.
**2010**, 4, 40–79. [Google Scholar] [CrossRef] - Burnham, K.P.; Anderson, D.R. Multimodel inference: Understanding AIC and BIC in model selection. Sociol. Methods Res.
**2004**, 33, 261–304. [Google Scholar] [CrossRef] - Kadane, J.B.; Lazar, N.A. Methods and criteria for model selection. J. Am. Stat. Assoc.
**2004**, 99, 279–290. [Google Scholar] [CrossRef] - Raftery, A.E. Bayesian model selection in social research. Sociol. Methodol.
**1995**, 25, 111–163. [Google Scholar] [CrossRef] - Wit, E.; van der Heuvel, E.; Romeijn, J.W. ‘All models are wrong…’: An introduction to model uncertainty. Stat. Neerl.
**2012**, 66, 217–236. [Google Scholar] [CrossRef] - Aho, K.; Derryberry, D.; Peterson, T. Model selection for ecologists: The worldviews of AIC and BIC. Ecology
**2014**, 95, 631–636. [Google Scholar] [CrossRef] [PubMed] - Zucchini, W. An introduction to model selection. J. Math. Psych.
**2000**, 44, 41–61. [Google Scholar] [CrossRef] - R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2008; ISBN 3-900051-07-0. [Google Scholar]
- Sheather, S. A Modern Approach to Regression With R; Springer Science & Business Media: New York, NY, USA, 2009. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso And Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
- Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics
**1970**, 12, 55–67. [Google Scholar] [CrossRef] - Friedman, J.; Hastie, T.; Tibshirani, R. Glmnet: Lasso and elastic-net regularized generalized linear models. R Package Version
**2009**, 1. [Google Scholar] - Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.
**2006**, 101, 1418–1429. [Google Scholar] [CrossRef] - Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2005**, 67, 301–320. [Google Scholar] [CrossRef] - Abu-Mostafa, Y.S.; Magdon-Ismail, M.; Lin, H.T. Learning from Data; AMLBook: New York, NY, USA, 2012; Volume 4. [Google Scholar]
- Geman, S.; Bienenstock, E.; Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput.
**1992**, 4, 1–58. [Google Scholar] [CrossRef] - Kohavi, R.; Wolpert, D.H. Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; Volume 96, pp. 275–283. [Google Scholar]
- Geurts, P. Bias vs. variance decomposition for regression and classification. In Data Mining and Knowledge Discovery Handbook; Springer: Boston, MA, USA, 2009; pp. 733–746. [Google Scholar]
- Weinberger, K. Lecture Notes in Machine Learning (CS4780/CS5780). 2017. Available online: http://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote11.html (accessed on 1 January 2019).
- Nicholson, A.M. Generalization Error Estimates and Training Data Valuation. Ph.D. Thesis, California Institute of Technology, Pasadena, CA, USA, 2002. [Google Scholar]
- Wang, J.; Shen, X. Estimation of generalization error: Random and fixed inputs. Stat. Sin.
**2006**, 16, 569. [Google Scholar] - Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
- Forster, M.R. Predictive accuracy as an achievable goal of science. Philos. Sci.
**2002**, 69, S124–S134. [Google Scholar] [CrossRef] - Draper, N.R.; Smith, H. Applied Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2014; Volume 326. [Google Scholar]
- Wright, S. Correlation of causation. J. Agric. Res.
**1921**, 20, 557–585. [Google Scholar] - Gilmour, S.G. The interpretation of Mallows’s C_p-statistic. J. R. Stat. Soc. Ser. D (Stat.)
**1996**, 45, 49–56. [Google Scholar] - Zuccaro, C. Mallows? Cp statistic and model selection in multiple linear regression. Mark. Res. Soc. J.
**1992**, 34, 1–10. [Google Scholar] [CrossRef] - Akaike, H. Akaike, H. A new look at the statistical model identification. In Selected Papers of Hirotugu Akaike; Springer: New York, NY, USA, 1974; pp. 215–222. [Google Scholar]
- Symonds, M.R.; Moussalli, A. A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using Akaike’s information criterion. Behav. Ecol. Sociobiol.
**2011**, 65, 13–21. [Google Scholar] [CrossRef] - Schwarz, G. Estimating the dimension of a model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Neath, A.A.; Cavanaugh, J.E. The Bayesian information criterion: Background, derivation, and applications. Wiley Interdiscip. Rev. Comput. Stat.
**2012**, 4, 199–203. [Google Scholar] [CrossRef] - Kass, R.E.; Raftery, A.E. Bayes factors. J. Am. Stat. Assoc.
**1995**, 90, 773–795. [Google Scholar] [CrossRef] - Morey, R.D.; Romeijn, J.W.; Rouder, J.N. The philosophy of Bayes factors and the quantification of statistical evidence. J. Math. Psychol.
**2016**, 72, 6–18. [Google Scholar] [CrossRef] - Lavine, M.; Schervish, M.J. Bayes factors: What they are and what they are not. Am. Stat.
**1999**, 53, 119–122. [Google Scholar] - Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Vrieze, S.I. Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol. Methods
**2012**, 17, 228. [Google Scholar] [CrossRef] [PubMed] - Yang, Y. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika
**2005**, 92, 937–950. [Google Scholar] [CrossRef] - Kuha, J. AIC and BIC: Comparisons of assumptions and performance. Sociol. Methods Res.
**2004**, 33, 188–229. [Google Scholar] [CrossRef] - Beale, E.; Kendall, M.; Mann, D. The discarding of variables in multivariate analysis. Biometrika
**1967**, 54, 357–366. [Google Scholar] [CrossRef] [PubMed] - Derksen, S.; Keselman, H.J. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br. J. Math. Stat. Psychol.
**1992**, 45, 265–282. [Google Scholar] [CrossRef] - Geisser, S. The predictive sample reuse method with applications. J. Am. Stat. Assoc.
**1975**, 70, 320–328. [Google Scholar] [CrossRef] - Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodol.)
**1974**, 36, 111–147. [Google Scholar] [CrossRef] - Good, P.I. Resampling Methods; Springer: Boston, MA, USA, 2006. [Google Scholar]
- Schumacher, M.; Holländer, N.; Sauerbrei, W. Resampling and cross-validation techniques: A tool to reduce bias caused by model building? Stat. Med.
**1997**, 16, 2813–2827. [Google Scholar] [CrossRef] - Efron, B. The Jackknife, the Bootstrap, and Other Resampling Plans; Siam: Philadelphia, PA, USA, 1982; Volume 38. [Google Scholar]
- Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman and Hall/CRC: New York, NY, USA, 1994. [Google Scholar]
- Wehrens, R.; Putter, H.; Buydens, L.M. The bootstrap: A tutorial. Chemometr. Intel. Lab. Syst.
**2000**, 54, 35–52. [Google Scholar] [CrossRef] - Krstajic, D.; Buturovic, L.J.; Leahy, D.E.; Thomas, S. Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminform.
**2014**, 6, 10. [Google Scholar] [CrossRef] - Molinaro, A.M.; Simon, R.; Pfeiffer, R.M. Prediction error estimation: A comparison of resampling methods. Bioinformatics
**2005**, 21, 3301–3307. [Google Scholar] [CrossRef] [PubMed] - Amari, S.I.; Fujita, N.; Shinomoto, S. Four types of learning curves. Neural Comput.
**1992**, 4, 605–618. [Google Scholar] [CrossRef] - Amari, S.I. A universal theorem on learning curves. Neural Netw.
**1993**, 6, 161–166. [Google Scholar] [CrossRef] - Cawley, G.C.; Talbot, N.L. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res.
**2010**, 11, 2079–2107. [Google Scholar] - Guyon, I.; Saffari, A.; Dror, G.; Cawley, G. Model selection: Beyond the bayesian/frequentist divide. J. Mach. Learn. Res.
**2010**, 11, 61–87. [Google Scholar] - Piironen, J.; Vehtari, A. Comparison of Bayesian predictive methods for model selection. Stat. Comput.
**2017**, 27, 711–735. [Google Scholar] [CrossRef] - Good, I. Explicativity: A mathematical theory of explanation with statistical applications. Proc. R. Soc. Lond. A
**1977**, 354, 303–330. [Google Scholar] [CrossRef] - Chen, H.; Chiang, R.H.; Storey, V.C. Business intelligence and analytics: From big data to big impact. MIS Q.
**2012**, 36, 1165–1188. [Google Scholar] [CrossRef] - Erevelles, S.; Fukawa, N.; Swayne, L. Big Data consumer analytics and the transformation of marketing. J. Bus. Res.
**2016**, 69, 897–904. [Google Scholar] [CrossRef] - Jin, X.; Wah, B.W.; Cheng, X.; Wang, Y. Significance and challenges of big data research. Big Data Res.
**2015**, 2, 59–64. [Google Scholar] [CrossRef] - Holzinger, A.; Kieseberg, P.; Weippl, E.; Tjoa, A.M. Current advances, trends and challenges of machine learning and knowledge extraction: From machine learning to explainable ai. In Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Hamburg, Germany, 27–30 August 2018; Springer: Cham, Switzerland, 2018; pp. 1–8. [Google Scholar]
- Lynch, C. Big data: How do your data grow? Nature
**2008**, 455, 28–29. [Google Scholar] [CrossRef] [PubMed]

**Figure 2.**Different examples for fitted linear polynomial regression models of varying degree d, ranging from 1 to 10. The model degree indicates the highest polynomial degree of the fitted model. These models correspond to different model complexities, from low-complexity ($d=1$) to high-complexity ($d=10$) models. The blue curves show the true model, the red curves show the fitted models, and the black points correspond to the training data. The shown results correspond to individual fits—that is, no averaging has been performed. For all results, the sample size of the training data was kept fixed.

**Figure 3.**Error-complexity curves showing the prediction error (training and test error) in dependence on the model complexity. (

**A**,

**C**,

**E**,

**F**) show numerical simulation results for a linear polynomial regression model. The model complexity is expressed by the degree of the highest polynomial. For this analysis, the training data set was fixed. (

**B**) Idealized error curves for general statistical models. (

**C**) Decomposition of the expected generalization error (test error) into noise, bias, and variance. (

**D**) Idealized decomposition in bias and variance. (

**E**) Percentage breakdown of the noise, bias, and variance shown in C relative to the polynomial degrees. (

**F**) Percentage breakdown for bias and variance.

**Figure 4.**Idealized visualization of the model selection process. (

**A**) Three model families are shown, and estimates of three specific models were obtained from training data. (

**B**) A summary combining model selection and model assessment, emphasizing that different data sets are used for different analysis steps.

**Figure 5.**Visualization of cross-validation. Before splitting the data, the data points are randomized, but then kept fixed for all splits. Neglecting the column MA shows a standard five-fold cross validation. Consideration of the column MA shows a five-fold cross validation with holding-out of a test set.

**Figure 6.**Estimated learning curves for training and test errors for six linear polynomial regression models. The model degree indicates the highest polynomial degree of the fitted model, and the horizontal dashed red line corresponds to the optimal error ${E}_{test}({c}_{opt})$ attainable by the model family for the optimal model complexity ${c}_{opt}=4$.

**Figure 7.**Idealized learning curves. The horizontal red dashed line corresponds to the optimal attainable error ${E}_{test}({c}_{opt})$ by the model family. Shown are the following four cases. (

**A**) Low bias, low variance; (

**B**) high bias, low variance; (

**C**) low bias, high variance; (

**D**) high bias, high variance.

**Figure 8.**Summary of different model selection approaches. Here, AIC stands for any criterion considering model complexity, such as BIC or ${C}_{p}$, and regularization is any regularized regression model, such as LASSO or elastic net.

Evidence | Δ BIC = BIC_{k} − BIC_{min} | BF_{min,k} |
---|---|---|

weak | 0–2 | 1–3 |

positive | 2–6 | 3–20 |

strong | 6–10 | 20–150 |

very strong | >10 | >150 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Emmert-Streib, F.; Dehmer, M.
Evaluation of Regression Models: Model Assessment, Model Selection and Generalization Error. *Mach. Learn. Knowl. Extr.* **2019**, *1*, 521-551.
https://doi.org/10.3390/make1010032

**AMA Style**

Emmert-Streib F, Dehmer M.
Evaluation of Regression Models: Model Assessment, Model Selection and Generalization Error. *Machine Learning and Knowledge Extraction*. 2019; 1(1):521-551.
https://doi.org/10.3390/make1010032

**Chicago/Turabian Style**

Emmert-Streib, Frank, and Matthias Dehmer.
2019. "Evaluation of Regression Models: Model Assessment, Model Selection and Generalization Error" *Machine Learning and Knowledge Extraction* 1, no. 1: 521-551.
https://doi.org/10.3390/make1010032