Improving Innovation from Science Using Kernel Tree Methods as a Precursor to Designed Experimentation

: A key challenge in applied science when planning a designed experiment is to determine the aliasing structure of the interaction e ﬀ ects and selecting the appropriate levels for the factors. In this study, kernel tree methods are used as precursors to identify signiﬁcant interactions and levels of the factors useful for developing a designed experiment. This approach is aligned with integrating data science with the applied sciences to reduce the time from innovation in research and development to the advancement of new products, a very important consideration in today’s world of rapid advancements in industries such as pharmaceutical, medicine, aerospace, etc. Signiﬁcant interaction e ﬀ ects for six common independent variables using boosted trees and random forests of k = 1000 and k = 10,000 bootstraps were identiﬁed from industrial databases. The four common variables were related to speed, pressing time, pressing temperature, and ﬁber reﬁning. These common variables maximized tensile strength of medium density ﬁberboard (MDF) and the ultimate static load of oriented strand board (OSB), both widely-used industrial products. Given the results of the kernel tree methods, four possible designs with interaction e ﬀ ects were developed: full factorial, fractional factorial Resolution IV, Box–Behnken, and Central Composite Designs (CCD).


Introduction
Data science is evolving rapidly in the world, and the precipitous needs from innovation to adoption from industries such as the pharmaceutical, aerospace, food, etc., have never been greater.
A key challenge is to shorten the time span from innovation to adoption while maintaining scientific inference. Many applied scientists rely on formal experimentation during innovation development.
Scientists have budgetary and time constraints that limit cyclical experimentation. The study outlined in the paper presents a methodology of using data science kernel tree methods as a precursor for designed experimentation, to reduce the time from innovation in applied sciences to adoption for product production. This combination of induction and deduction is aligned with data science by combining contemporary methodologies with more classical methods to enhance scientific inference.
Many designed experiments in research and development (R&D) contain two or more factors, with two or more levels, e.g., a 2 k design (low [−] and high [+] levels) with k = 3 with a replication equate to n = 16 runs of experimentation. If three levels are desired, a 3 k design (low [−], medium [0], and high [+] levels) with k = 3 with a replication will have n = 54 experimental runs. In the applied

Dataset Descriptions
Datasets for the study were obtained from medium density fiberboard (MDF) and oriented strand board (OSB) manufacturers as part of a research confidentiality agreement. Variable names in the datasets were altered to respect the terms of the confidentiality agreement. MDF is a nonstructural biomaterial that is used as a substrate for furniture, kitchen cabinets, desks, tabletops, etc. OSB is a structural biomaterial that is used for residential and non-residential construction of dwellings. Tensile strength is the primary strength metric (dependent variable) for MDF. Ultimate static load is an important strength metric for OSB and is the certification metric for use in the marketplace. Data from the destructive testing labs were obtained from two different manufacturers in the United States of America (USA). The MDF dataset had 408 records and 184 regressors which represented destructive tests over a three-month time period for a nominal product of 15.88 mm in thickness. MDF destructive tests are typically taken from the production line at one-hour periodic intervals or when the product-type is changed. The tensile strength dependent variable for MDF had a normal pdf based on the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) ( Table 1) [13,14].
The OSB dataset had 150 records and 98 regressors which represented destructive tests over a six-month time period for the nominal product of 11.11 mm in thickness. OSB destructive tests are taken from the production line at four-hour periodic intervals or at product change. The ultimate static load dependent variable for OSB had a normal pdf based on the AIC and BIC criteria ( Table 1). The regressors for both MDF and OSB were taken from the process data warehouses and were fused with the dependent variables from the destructive testing labs. Eighty percent of the data for both OSB and MDF datasets were used for training and 20% of the data were used for validation. For MDF 184 regressors, and in OSB, 98 regressors were in the training datasets. Ten-fold cross validation was used for both MDF and OSB models.

Description of Predictor Variables
Process data are from sensors on the production line and are related to line speed, pressing speed, pressing pressure, pressing temperature, fiber moisture, etc. Both the MDF and OSB processes have a variation in all of the variables during the normal manufacturing process. Line speed typically is product specific, i.e., faster speeds for thinner and lower density products and slower speeds for thicker, higher density products. For example the predictor variable for OSB called 'wet bin speed' is directly related to line speed given that the bin that holds the flakes are emptied is a function of line speed. Line speed also changes because of moisture content changes in the fibers during the pressing stage. Pressing occurs under pressure and high temperature for the curing of the bonding of fibers and adhesives during the pressing stage. Fiber moisture may change during the manufacturing process due to natural variation in the feedstocks and variations in the temperatures during the drying process. The predictor variables reported in this manuscript are given descriptive names as related to these processes, e.g., 'flake moisture content' is the moisture content of the OSB flakes, 'total press time' is the time the material resides in the pressing stage, 'mat weight' is the actual weight of the formed fiber mats before the pressing stage, etc. The MDF process is different from the OSB process primarily in the early stages of the process, where wood is refined to small fibers with lengths of 1.34-1.84 mm. The predictor variables named 'face plate position' refer to the gap in grinding plates during the wood to fiber refining stage and 'face steam flow' refers to how much steam is injected into the refiners that create the fibers. Descriptions of the MDF and OSB processes are documented in [15,16].

Kernel Tree Methods
Decision trees as applied to continuous data are known as regression trees (RT) [17][18][19]. Given that documentation on the methodologies of kernel tree methods is extensive, only a summary is presented. As Hand et al. [20] noted, "linear regression is a global model, where there is a single predictive formula holding over the entire data-space. An alternative approach is to sub-divide, or partition, the space into smaller regions, where the interactions are more manageable." The 'regression tree' approach to modeling identifies a hierarchy of interactions that were previously unknown. The regression tree approach creates recursive partitions or cells within the entire data space (i.e., terminal nodes or leaves) and the cells are modeled separately. The cells of regression trees are typically 'pruned' during the model validation phase in identifying the best predictive model. One strength of this method is that data do not need to be imputed, given that partitions will be made in the presence of missing data (Georges, 2009). RTs are represented as two-dimensional graphics, which makes it easily understood and interpreted [6,7].
RTs are quite popular as an exploratory modeling technique, and are commonly associated with data mining techniques [21][22][23][24][25][26][27][28]. RTs are very resistant to irrelevant regressors given that the recursive tree-building algorithm estimates the optimal variable on which to split at each step, regressors unrelated to the response are not chosen for splitting [29] (pp. 199-215). In theory, a regression tree partitions the data space of all joint regressor values X into J-disjoint regions R j J 1 [6]. For a given set of joint regressor values X, the tree predictionŶ = T j (X) assigns as the response estimate, the value assigned to each region containing X: Given a set of regions, the optimal response values associated with each region minimize the prediction error in that region:Ŷ Unlike some kernel methods, RTs use the data to estimate a good partition instead of relying on a predefined model by the analyst. 'Boosted trees' (BT) relies on the philosophy that a small number of simple trees of weak learners that are combined as one model outperform the predictions of one large RT [6,30,31]. 'Boosting' builds trees sequentially such that each new tree improves the predictive power of the ensemble [32][33][34]. The result grows new trees specifically aimed at accommodating observations that an existing ensemble predicts poorly, i.e., overall improved predictive performance of the final BRT model. BRT approximates a solution to the problem of fitting a sum-of-trees by adding new trees one at a time, while keeping all existing trees unchanged [35]. As stated by Schapire [31] and Elith et al. [35], BT is a model to enhance the model accuracy and the key step in boosting is to consecutively apply the algorithm to constantly modified data, i.e., it minimizes the loss function through adding a regression tree in each iteration step [35].
'Random forests' (RF) developed by Breiman [36] and as summarized by Fawagreh et al. [37] combines Breiman's bagging sampling approach, and the random selection of features, introduced independently [38,39] in order to construct a collection of decision trees with controlled variation. Each tree in the ensemble acts as a base classifier to determine the class label of an unlabeled instance." Key advantages of RF over BT are robustness to noise and less overfitting [35,[38][39][40][41][42][43].
There were five key parameters used for model calibration in this study. A minimum size split of 10 was used for both the boosted tree and random forests to prevent the program from splitting any node with below a specified number of cases [44]. 'Early stopping' stopped the additive boosting process if further boosting failed to improve the fit in the validation dataset. A minimum split size of five was used with a learning rate (r) of 0.1 where (0 < r ≤ 1) which cued the program to build separate boosted trees for every combination of splits. This permitted the boosted tree to try various combinations of parameters in order to find the one that maximizes the fit. The tree was grown with 50 layers and splits per tree ranged from five to 10. An overfit penalty ensured against having any cases with predicted probabilities equal to zero, where higher values will result in less overfitting. The probability is: where the summation is across all response levels and n i is the number of observations at the node for the ith response level. Priori is the prior probability for the ith response level and is calculated as follows: where p i is the prior i from the parent node, P i is the prob i from the parent node, and λ is a weighting factor which was set at 0.9.

Fractional Factorials and Aliasing
Statistically designed experimentation (i.e., design of experiments or DOE) is a formal 'deductive' methodology, where independent variables (factors) are manipulated at different settings (levels) in a controlled fashion to explore optimization problems for key response variables. R.A. Fisher [45] is the father of DOE and expanded on the classic analysis of variance (ANOVA) methodology. A 'full factorial' DOE has experimental runs with replicates at all possible levels for all of the factors. Even though full factorial designs are the most informative designs, such designs are expensive.
George E.P. Box was influenced immensely by R.A. Fisher's work and studied under Egon Pearson at the University of London [46]. He developed a series of designs known as 'response surface methods' (RSM) which allows researchers to use a fraction of the total experimental runs (fractional factorials). Box's popular 'central composite' and 'Box-Behnken' RSM designs [1] minimize the number of experimental runs, while sustaining inference. A key consideration for RSM and any type of fractional factorial design is the aliasing structure of the design. Resolution III designs are designs where two-level interactions are aliased with main effects and other two-level interactions, and are typically avoided given that two-level interactions are typically significant. For example, a 1 2 fraction 2 k−1 , k = 4 with a replicate has n = 16 experimental runs is a Resolution III design with aliases: A = BC, B = AC, C = AB, AB = CD, BC = AD, and AC = BD. Generally, analysts select the highest possible design resolution (e.g., resolutions ≥ IV) when choosing a design to avoid confounding main effects. If the analyst conducts a fractional factorial design and does not have a detailed knowledge of the phenomenon under investigation, an aliasing structure is assumed. By developing RT models as a possible first step for the factors under investigation, unknown interactions and split-points may be discovered. This may accelerate innovation by choosing statistically significant interactions for a DOE that lead to lower costs of experimentation.

Boosted Tree Models and Bootstrap Forest
The boosted tree (BT) and bootstrap forest (BF), with k = 1000 and k = 10,000 bootstraps [47] for each method, indicated a set of significant variables (α = 0.05). The recursive partitions using smaller additive trees, which are unique to BT and BF methods identified a set of predictor variables that were common to both modeling types. Four common predictor variables in these bootstrapped additive trees for MDF were: 'face plate position', 'press position time', 'face steam flow', and 'adhesive percent' ( Table 2). The R 2 in training and validation for the BT was 0.758 and 0.552, respectively. The R 2 in training and R 2 in validation for the BF was 0.628 and 0.484, respectively. Even though the R 2 in validation is not high, these common variables explain a reasonable proportion of the variation influencing tensile  The boosted tree (BT) and bootstrap forest (BF) with k = 1000 and k = 10,000 bootstraps for the OSB data revealed a set of five common predictor variables: 'wet bin speed', 'flake moisture content', 'dryer inlet temperature', 'dryer outlet temperature', and 'wood weight' ( Table 3). The R 2 in training and R 2 in validation for the BT was 0.675 and 0.410, respectively. The R 2 in training and validation for the BF were 0.534 and 0.304, respectively. The means and variances of the dependent variable ultimate static load (kg) in the training and validation datasets were similar across the 10-fold cross validation with an = x = 213.7, s = 28.1, M = 210.6, CV = 5.9% in the training dataset; and = x = 204.8, s = 27.7, M = 204.0, CV = 6.1% in the validation dataset. Again, even though the R 2 in validation is not high, these five variables explain a reasonable proportion of the variation influencing ultimate static load. In order to find reasonable split points and levels for factors to be tested in designed experiment regression, tree models were developed using the five common independent variables previously mentioned.

RT Models as a Precursor for Response Surface Methods
The following examples for designed experiments are presented to demonstrate that kernel tree results can be helpful in designing a DOE without time consuming prescreening designs. Assuming that experimental runs are costly, several response surface designs are presented that minimize the number of runs while maintaining an α = 0.05 while maximizing the power of the experiment.
Designs for MDF from the RT model. From the RT model for MDF, a fractional factorial Resolution IV design with 2 k−1 , k = 4 with a replicate would require n = 16 runs. This design has a power of 82% for the main effects (α = 0.05), given a standard deviation of 95 kPa with the ability to detect a difference in the mean tensile strength of 950 kPa (Table 4). From the RT model results (Figure 1 [1] proposed the Box-Behnken design when points at the corners are costly [48]. A Box-Behnken design with n = 27 runs is presented in Figure 3, with the run order given in Table 5. If the analyst can afford three more experimental runs and corner points may be risky for the process, a spherical circumscribed central composite (CCD) RSM, k = 4, with four axial points is an option (Figure 3). A full factorial design would have only the corner points examined and requires n = 32 runs which includes a replicate.    Designs for OSB from the RT model. Given that the OSB RT model resulted in three significant factors, A = 'dryer inlet temperature', B = 'mat weight', and C = 'wet bin speed' at an α = 0.05, a full factorial design is feasible, e.g., 2 k , k = 3, n = 16 with a power of 84% for significant main effects (assuming an α = 0.05 for the experiment) (see Table 6 and Figure 2). Even though this design has no aliasing assumptions, it only explores corner points. A Box-Behnken RSM design would require two fewer runs, n = 15, with three center points (Figure 3). An alternative design would be a spherical circumscribed CCD RSM design with n = 17, k = 3, with four axial points (Figure 3), where the run order is presented in Table 7. Even though there are many other feasible designs, the aforementioned designs were developed to minimize the number of experimental runs while using the helpful results of the RT models.

Conclusions
Identifying interaction effects is important in innovation and product development. Kernel tree methods provide an accepted data science approach for enhancing the applied sciences. Such methods quickly identify and quantify undiscovered interactions among regressors. In this research, boosted trees and random forest models were constructed from two different manufacturing systems. Common significant variables in the models that effected strength of materials were related to speed of the process, fiber refining, pressing time, and pressing pressures. Experimental designs were proposed with a Resolution > III, while incorporating the aliasing structure identified from the RT models. Possible designs with four factors were: fractional factorial Resolution IV; Box-Behnken RSM; and CCD spherical circumscribed designs. The above designs were selected to minimize the number of experimental runs while sustaining inference. The hierarchy of interaction effects and split-points of regressors in kernel tree models may provide applied scientists with an important foundation for planning a designed experiment while minimizing the costs during innovation development. If feasible, future studies will explore support vector machine, Bayesian additive regression trees (BART), etc., for comparative analyses.

Conclusions
Identifying interaction effects is important in innovation and product development. Kernel tree methods provide an accepted data science approach for enhancing the applied sciences. Such methods quickly identify and quantify undiscovered interactions among regressors. In this research, boosted trees and random forest models were constructed from two different manufacturing systems. Common significant variables in the models that effected strength of materials were related to speed of the process, fiber refining, pressing time, and pressing pressures. Experimental designs were proposed with a Resolution > III, while incorporating the aliasing structure identified from the RT models. Possible designs with four factors were: fractional factorial Resolution IV; Box-Behnken RSM; and CCD spherical circumscribed designs. The above designs were selected to minimize the number of experimental runs while sustaining inference. The hierarchy of interaction effects and split-points of regressors in kernel tree models may provide applied scientists with an important foundation for planning a designed experiment while minimizing the costs during innovation development. If feasible, future studies will explore support vector machine, Bayesian additive regression trees (BART), etc., for comparative analyses.
Author Contributions: All authors contributed to the development of this paper. T.M.Y. did the primary statistical analysis and development of original draft manuscript. R.A.B. and T.L. were instrumental in procuring the industrial datasets and validating the results of the models. A.P.'s contribution was invaluable in advising on the methodology, results, and the review of the overall manuscript. All authors have read and agreed to the published version of the manuscript.