3.2. Small Sample and Multiple Correlation Problems
This paper hopes to establish a mathematical model to describe the relationship between the mechanical and electrical properties of presspaper and fiber physicochemical parameters, and realizes the prediction of presspaper performance. This requires multivariate data analysis methods. Possible methods include neural network analysis, multiple linear regression analysis, partial least-squares regression analysis, and structural equation modeling. The relationship between the input layer and the output layer of the neural network is difficult to express simply, and the model itself is quite complex; structural equations are mainly used to explore the relationship between underlying variables, while this paper studies the explicit variables of the physicochemical parameters of fibers and the mechanical and electrical properties of presspaper. There is no necessary correlation between the dependent variables (mechanical and breakdown characteristics) in the physical sense. For example, during the aging process of oilpaper insulation, the mechanical strength will decrease significantly, but the breakdown strength changes little. Consequently, this paper establishes a single-variable regression model.
Before conducting a multivariate data analysis, it is important to pay attention to the relationship between independent variables and the number of test samples. These two requirements are more stringent for multiple linear regression models, but relatively relaxed for partial least-squares regression models.
The number of sample points required to establish a multiple regression model is usually greater than the number of independent variables, recommended to be 2 to 5 times greater.
Table 1 gives 13 independent variables, while the number of sample points used to build the model is only 16. Thus, the problem of few sample points needs to be solved first.
For this purpose, we can increase the number of test samples or reduce the number of variables. Considering that increasing the number of test samples is limited by many practical conditions, this paper focuses on reducing the number of variables. One method is to combine typical related analysis, filter and extract the information of each independent variable, recombine it into fewer variables, and establish the relationship between these variables and the mechanical and electrical characteristics of presspaper. This method is partial least-squares regression analysis. Another method to reduce the number of variables is to select independent variables with strong explanatory meanings to establish a multiple linear regression model. If the number of valid independent variables is 2 to 5, the problem of insufficient number of samples can be solved.
Figure 1a shows that DP, ash, and total charge are basically independent of each other, but not all of the selected physicochemical parameters. Pearson correlation coefficients were used to quantify the degree of correlation between different variables, as shown in
Figure 2. It is generally believed that if the absolute value of the Pearson correlation coefficient is greater than 0.8, the correlation between the variables is very strong. All these values are maked out by “*” in
Figure 2. It can be seen that there is a strong correlation between some variables in
Figure 2, especially holocellulose and lignin as well as holocellulose and DP. To visually show the correlation between variables,
Figure 2 also shows the scatter plots and confidence ellipses between the variables. The direction of the confidence ellipses correlates with the sign of the Pearson correlation coefficient, and the ratio of the major and minor axes correlates with the absolute value.
Figure 2 indicates that the correlation between the mechanical strength and some of the fiber physicochemical parameters is not strong. Like hemicellulose, the corresponding Pearson correlation coefficient is only 0.38. For DC breakdown field strength, there is also a problem of weak correlation with some of the physicochemical parameters. It is inferred that it is not appropriate to establish the regression equation using all the physicochemical parameters as independent variables.
Partial Least Squares regression analysis is more applicable to a situation where there is a serious collinearity between variables, so that the new components extracted can effectively represent the original variables. In this problem, the Pearson correlation coefficient between many physicochemical parameters is relatively low, resulting in a recombined variable that does not include all of the information of the original independent variable. Actually, the constructed model has a poor prediction effect, and the model itself is fairly complicated, which is not conducive to the practical application. Hence, multiple linear regression is used in this paper.
For multiple linear regression, the high correlation between independent variables will lead to the difficulty and instability of regression coefficient estimation. For instance, since there is a strong correlation between hollocellulose(x8), total lignin(x7), and DP(x12), x8 will be eliminated first when conducting the multiple linear regression analysis.
To solve the small sample problem, two to five variables need to be selected from the remaining 12 physicochemical parameters. In order to select a subset that can better explain the variation of dependent variables, forward, backward, stepwise regression, and best subset selection can be used. The basis of judgment of the first three methods is to fix a test level, then calculate whether it passes the partial F test so as to conduct an independent variable screening. The choice of independent variables depends on the given test level. Therefore, this paper chooses the best subset selection method for variable selection. Based on the idea of enumeration, all possible combinations are traversed for a given number of independent variables, with the variable combinations listed with the highest goodness of fit. Then change the number of given arguments, and the best combination with different number of variables can be obtained. Considering the number of variables, R2, adjusted R2, and Mallows Cp value, the physicochemical parameters set suitable for multiple linear regression modeling could be determined. As the expected number of selected variables is 2~5, the range of given variables is 1~6 when conducting best subset selection.
Table 4 shows the best subset results of the tensile strength model for presspaper. If only one physicochemical parameter is considered, the fiber length(
x1) is selected. If the number of physicochemical parameters to be considered is increased to two, the best subset is fiber width(
x2) and
CrI(
x13). For a different number of variables, there is a corresponding best subset.
Comparing the best subsets under different number of variables, we can see that with an increasing number of variables, the goodness-of-fit R2 increases, indicating that the fitting effect of the regression model is improved. However, when the number of independent variables reaches a certain value, the goodness of fit that the newly introduced variable brings is very limited. If the number of variables increases from one to two, R2 increases by 32.4%; when it increases from three to four, R2 is only increased by 2.4%. Adjusted R2 can partly eliminate the increase in goodness of fit due solely to an increase in the number of variables. Only when the new variable has a certain explanatory effect on the dependent variable does adjusted R2 increase. Since fiber morphology parameters, such as fiber length and width, etc., affect the mechanical properties of the presspaper when they are introduced, the adjusted R2 value increases. However, when the number of variables reaches two, increasing variables does not lead to a significant increase in the adjusted R2 value.
Mallows Cp characterize the bias and accuracy of the model. When the number of independent variables is too small, the resulting model may have a biased estimate, resulting in an excessively large value of Mallows Cp; when the number of independent variables is excessive, overfitting may occur, resulting in the value being too small. Only when the Mallows Cp value is close to the number of predictors plus the constant number can the model estimate the regression coefficients more accurately and predict new variables. In
Table 3, when there is only one variable, the Mallows Cp value is 30.7, which is obviously too large; when the number of variables increases to three or more, the Mallows Cp value is too small. When the number of variables is two, the Mallows Cp value is 2.5, which is closer to 3. Considering that the goodness of fit and adjusted
R2 value are relatively high, and no further significant increase occurs with increasing variables, the physicochemical parameters that are suitable for establishing the tensile strength regression model of presspaper are fiber width(
x2) and crystallinity(
x13). Similarly, for the AC breakdown strength regression model, the number of variables selected is three, and the selected parameters are fiber length(
x1), fines(
x5), and total charge(
x11). The values of adjusted
R2 and Mallows Cp are 82.5% and 4.1, respectively. For the DC breakdown strength regression model, the number of selected variables is three, and the selected parameters are fiber length(
x1), fines(
x5), and total lignin(
x7). The value of adjusted
R2 and Mallows Cp are 89.2% and 4.0, respectively.