A New Criterion for Model Selection

: Selecting the best model from a set of candidates for a given set of data is obviously not an easy task. In this paper, we propose a new criterion that takes into account a larger penalty when adding too many coe ﬃ cients (or estimated parameters) in the model from too small a sample in the presence of too much noise, in addition to minimizing the sum of squares error. We discuss several real applications that illustrate the proposed criterion and compare its results to some existing criteria based on a simulated data set and some real datasets including advertising budget data, newly collected heart blood pressure health data sets and software failure data.


Introduction
Model selection has become an important focus in recent years in statistical learning, machine learning, and big data analytics [1][2][3][4]. Currently there are several criteria in the literature for model selection. Many researchers [3,[5][6][7][8][9][10][11] have studied the problem of selecting variables in regression in thepast three decades. Today it receives much attention due to growing areas in machine learning, data mining and data science. The mean squared error (MSE), root mean squared error (RMSE), R 2 , Adjusted R 2 , Akaike's Information Criterion (AIC), Bayesian Information Criterion (BIC), AICc are among common criteria that have been used to measure model performance and select the best model from a set of potential models. Yet choosing an appropriate criterion on the basis of which to compare the many candidate models remains not an easy task to many analysts since some criteria may taking toll on the model size of estimated parameters while the others could emphasis more on the sample size of a given data.
In this paper, we discuss a new criterion PIC that can be used to select the best model among a set of candidate models. The proposed PIC takes into account a larger penalty from adding too many coefficients in the model when there is too small a sample. We also discuss briefly several common existing criteria include AIC, BIC, AICc, R 2 , adjusted R 2 , MSE, and RMSE. To illustrate the proposed criterion, we discuss the results based on a simulated data and some real applications including advertising budget data and recent collected heart blood pressure health data sets. The new PIC takes into account a larger penalty when there are too many coefficients to be estimated from too small a sample in the presence of too much noise.

Some Criteria for Model Comparisons
Suppose there are n observations on a response variable Y that relates to a set of independent variables: X 1 , X 2 , . . . , X k−1 in the form of Y = f (X 1 , X 2 , . . . , X k−1 ). (1) The statistical significance of model comparisons can be determined based on existing goodness-of-fit criteria in the literature [12]. In this section, we first briefly discuss some existing

New PIC
We now discuss a new criterion for selecting a model among several candidate models. Suppose there are n observations on a response variable Y and (k − 1) explanatory variables X 1 , X 2 , . . . , X k−1 . Let y i be the ith response (dependent variable), i = 1, 2, . . . , n y i be the fitted value of y i e i be the ith residual, i.e., e i = y i −ŷ i From Equation (1), the sum of squared error can be defined as follows: In general, the adjusted R 2 attaches a small penalty for adding more variables in the model. The difference between the adjusted R 2 and R 2 is usually slightly small unless there are too many unknown coefficients in the model to be estimated from too small a sample in the presence of too much noise. In other words, the adjusted R 2 penalizes the loss of degrees of freedom that result from adding independent variables to the model. Our motivation of this study is to propose a new criterion by addressing the above situation. According to the unbiased estimators of the adjusted R 2 and R 2 that, respectively, correct for the sample size and numbers of estimated coefficients, we can easily show that the following function k 1−R 2 adj 1−R 2 or equivalently, that k n−1 n−k indicates a larger penalty for adding too many coefficients (or estimated parameters) in the model from too small a sample in the presence of too much noise where n is the sample size and k is the number of estimated parameters.
Based on the above, we propose a new criterion, PIC, for selecting the best model. The PIC value of the model is as follows: where n is the number of observations in the model k is the number of estimated parameters or (k−1) explanatory variables in the model, and SSE is the sum of squares error as given in Equation (2). Table 1 presents a summary of criteria for model selection in this study. The best model from among candidate models is the one that yields the smaller the value of MSE, RMSE, AIC, BIC, AICc and the new criterion value given in Equation (3) or the larger the value of R 2 , adjusted R 2 . Table 1. Some criteria model selection.

No.
Criteria Formula Measures the deviation between the fitted values with the actual data observation.
The square root of the MSE. 3 Measures the amount of variation accounted for the fitted model.
Take into account a small penalty for adding more variables in the model.
The model improves the goodness of the fit but also increases the penalty by adding more parameters.
Depend on the sample size n that shows how strongly BIC impacts the penalty of the number of parameters in the model.
AICc takes into account sample size by increasing the relative penalty for model complexity with small data sets. 8 PIC This new criterion takes into account a larger the penalty when adding too many coefficients in the model when there is too small a sample.

Numerical Examples
In this section, we illustrate the proposed criterion based on a simulated data from a multiple linear regression with three independent variables X 1 , X 2 and X 3 for a set of 100 observations (Case 1) and 20 observations (Case 2). Case 1: 100 observations based on simulated data. Table 2 presents a list of the first 10 observations from the simulated data consisting of 100 observations based on a multiple linear regression function. From Table 3 we can observe that based on the new proposed criterion the multiple regression models including all three independent variables provides the best fit. The results are also consistent with all of the criteria such as MSE, AIC, AICc, BIC, RMSE, R 2 , and adjusted R 2 .  Table 3. Criteria values of independent variables based on 100 simulated data observations consisting of three independent variables X 1 , X 2 , and X 3 . Case 2: 20 observations. Table 4 presents a simulated data set consisting of 20 observations based on a multiple linear regression function. From Table 5 we can observe that a multiple regression model including all three independent variables based on the new criterion provides the best fit from among all 7 models (see Table 5). This result is also consistent with all the criteria such as MSE, AIC, AICc, BIC, RMSE, R 2 and adjusted R 2 .  Table 5. Criteria values of independent variables based on 20 simulated data observations consisting of three independent variables X 1 , X 2 , and X 3 .

Applications
In this section we demonstrate the proposed criterion with several real applications including advertising products, heart blood pressure health and software reliability analysis. Based on our preliminary study on the collected data, the multiple linear regression model assumption is appropriate to be used in our applications 1 and 2 to illustrate the model selection.
In this study, we use the advertising budget data set [15] to illustrate the proposed criterion where the sales for a particular product is a dependent variable of multiple regressionand the three different media channels such as TV, Radio, and News paper are independent variables. The advertising dataset consists of the sales of a product in 200 different markets (200 rows), together with advertising budgets for the product in each of those markets for three different media channels: TV, radio and newspaper. The sales are in thousands of units and the budget is in thousands of dollars. Table 6 shows the first few rows of the advertising budget data set. We now discuss the results of the linear regression model using this advertising data. Figures 1 and 2 present the data plot and the correction coefficients between the pairs of variables of the advertising budget data, respectively. It shows that the pair of Sales and TV variables has the highest correlation. This implies that the TV advertising has a direct positive effect on the Sale. Results also show that there is a statistical significant positive effect of both TV and Radio advertisings on the Sales. From Table 7, TV media is the most significant media among the three advertising channels and it has strongest impacts on the Sales. The R 2 is 0.8972, so 89.72% of the variability is explained by all three media channels. From Table 8, the values of R 2 with all three variables and just two variables (TV and Radio advertisings) in the model are the same. This implies that we can select the model with two variables (TV and Radio) in the regression. We can now examine the adjusted R 2 measure. For the regression model with TV and Radio variables, the adjust R 2 is 0.8962 while adding the third variable (Newspaper) into the model, the adjusted R 2 of the full model size is then reduced to 0.8956. Based on the new proposed criterion, the model with the two advertising media channels (TV and Radio) is the best model from a set of seven candidate models as shown in Table 8. This result is consistent with all criteria such as MSE, AIC, AICc, BIC, RMSE, and adjusted R 2 .
Mathematics 2019, 7, x FOR PEER REVIEW 6 of 11 advertising budget data, respectively. It shows that the pair of Sales and TV variables has the highest correlation. This implies that the TV advertising has a direct positive effect on the Sale. Results also show that there is a statistical significant positive effect of both TV and Radio advertisings on the Sales. From Table 7, TV media is the most significant media among the three advertising channels and it has strongest impacts on the Sales. The R 2 is 0.8972, so 89.72% of the variability is explained by all three media channels. From Table 8, the values of R 2 with all three variables and just two variables (TV and Radio advertisings) in the model are the same. This implies that we can select the model with two variables (TV and Radio) in the regression. We can now examine the adjusted R 2 measure. For the regression model with TV and Radio variables, the adjust R 2 is 0.8962 while adding the third variable (Newspaper) into the model, the adjusted R 2 of the full model size is then reduced to 0.8956. Based on the new proposed criterion, the model with the two advertising media channels (TV and Radio) is the best model from a set of seven candidate models as shown in Table 8. This result is consistent with all criteria such as MSE, AIC, AICc, BIC, RMSE, and adjusted R 2 .    Blood pressure (BP) is one of the main risk factors for cardiovascular diseases. BP is the force of blood pushing against your artery walls as it goes through your body [16]. Abnormal BP has been a forceful issue that causes strokes, heart attacks, and kidney failureso it is important to check your blood pressure on a regular basis. The author has monitored blood pressure daily of an individual since January 2019 using Microlife product. He measured his blood pressure each morning and evening each day within the same time interval and recorded the results of all three measures such as Systolic Blood Pressure (ʺsystolicʺ), Diastolic Blood Pressure (ʺdiastolicʺ), and Heart Rate (ʺpulseʺ) each time as shown in Table 9. The Systolic BP is the pressure when the heart beats -while the heart muscle is contracting (squeezing) and pumping oxygen-rich blood into the blood vessels. Diastolic BP is the pressure on the blood vessels when the heart muscle relaxes. The diastolic pressure is always lower than the systolic pressure [17]. The Pulse or Heart rate measures the heart rate by counting the number of beats per minute (BPM).   Table 8. Criteria values of independent variables (TV, Radio, Newspaper) of regression models. (X 1 , X 2 , and X 3 be denoted as the TV, Radio and Newspaper, respectively). Blood pressure (BP) is one of the main risk factors for cardiovascular diseases. BP is the force of blood pushing against your artery walls as it goes through your body [16]. Abnormal BP has been a forceful issue that causes strokes, heart attacks, and kidney failureso it is important to check your blood pressure on a regular basis. The author has monitored blood pressure daily of an individual since January 2019 using Microlife product. He measured his blood pressure each morning and evening each day within the same time interval and recorded the results of all three measures such as Systolic Blood Pressure ("systolic"), Diastolic Blood Pressure ("diastolic"), and Heart Rate ("pulse") each time as shown in Table 9. The Systolic BP is the pressure when the heart beats -while the heart muscle is contracting (squeezing) and pumping oxygen-rich blood into the blood vessels. Diastolic BP is the pressure on the blood vessels when the heart muscle relaxes. The diastolic pressure is always lower than the systolic pressure [17]. The Pulse or Heart rate measures the heart rate by counting the number of beats per minute (BPM). Table 9. Sample heart blood pressure health data set of an individual in 86-day interval.
From Figure 3, the systolic BP and diastolic BP have the highest correlation. In this study, we decided not to include the Time variable (i.e., column 2 in Table 9) in this model analysis since it may not necessary reflect the health measurement much. The analysis shows that the Systolic blood pressure seems to be the most significant factor that can have strong impacts on the heart rate measure. The R 2 is 0.09997, so 9.99% of the variability is explained by all three variables (Day, Systolic, Diastolic) as shown in Table 10. Based on the new proposed criterion, the model with only Systolic blood pressure variable is the best model from the set of seven candidate models as shown in Table 10. This result stands alone compared to all other criteria, except BIC. In other words, the best model based on our proposed criterion will only obtain Systolic BP variable in the model. pressure seems to be the most significant factor that can have strong impacts on the heart rate measure. The R 2 is 0.09997, so 9.99% of the variability is explained by all three variables (Day, Systolic, Diastolic) as shown in Table 10. Based on the new proposed criterion, the model with only Systolic blood pressure variable is the best model from the set of seven candidate models as shown in Table 10. This result stands alone compared to all other criteria, except BIC. In other words, the best model based on our proposed criterion will only obtain Systolic BP variable in the model.   Table 10. Criteria values of variables (day, systolic, diastolic) of regression models (X 1 , X 2 , and X 3 be denoted as the day, systolic, diastolic, respectively).
In this example, we use the numerical results recently studied by Song et al. [12] to illustrate the new criterion by comparing it to some existing criteria based on the two real data sets in the applications of software reliability engineering. Table 11 shows the numericalresults of 19 different software reliability models based on four existing criteria such as MSE, AIC, R 2 , and adjusted R 2 and a new criterion, called Pham criterion, using dataset #1 [18]. In dataset #1, the week index ranges from 1 week to 21 weeks, and there are 38 cumulative failures at 14 weeks. Detailed information is recorded in Musa et al. [18]. Model 6 as shown in Table 11 provides the best fit based on the MSE, R 2 , adjusted R 2 and new criteria. However, Model 1 seems to be the best fit based on the AIC. Table 11. Results for criteria based on dataset #1 [12].

No.
Value Similarly, in this example we use the numerical results recently studied by Song et al. [12] to illustrate the new criterion based on a real dataset #2 [19]. In dataset #2, the weekly index uses cumulative system days, and the failures in 58,633 system days. The detailed information is recorded in [19]. Table 12 presents the numerical results of 19 different software reliability models based on four existing criteria such as MSE, AIC, R 2 , and adjusted R 2 and the new proposed criterion. Based on dataset #2, Model 7 (see Table 12) provides the best fit based on the AIC and new criteria where Model 17 indicates to be the best fit based on the MSE, R 2 , and adjusted R 2 .

Conclusions
In this paper we proposed a new PIC that can be used to select the best model from a set of candidate models. The proposed criterion takes into account a larger penalty when adding too many coefficients (or estimated parameters) in the model from too small a sample in the presence of too much noise where n is the sample size and k is the number of estimated parameters.
The paper illustrates the proposed criterion with several applications based on the Advertising budget data, the newly Heart Blood Pressure Health dataset, and software failure data. Given the number of estimated parameters k and sample size n, it is straightforward to obtain the new criterion value. Based on the three real applications studied in this paper, PIC has a very attractive performance which is accuracy based on both simulated data and several real world applications discussed in Section 3 for selecting the best model among a set of candidates.