Next Article in Journal
User–Topic Modeling for Online Community Analysis
Previous Article in Journal
Comparing the Quality and Speed of Sentence Classification with Modern Language Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Innovation from Science Using Kernel Tree Methods as a Precursor to Designed Experimentation

1
Center for Renewable Carbon, The University of Tennessee, 2506 Jacob Drive, Knoxville, TN 37996-4570, USA
2
Georgia-Pacific Chemicals, 2883 Miller Rd., Decatur, GA 30035, USA
3
Huber Engineered Wood, 1442 State Rd 334, Commerce, GA 30530, USA
4
Fachhochschule Salzburg, GmbH, Salzburg University of Applied Sciences, Holztechnologie & Holzbau, Markt 136a, 5431 Kuchl, Austria
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(10), 3387; https://doi.org/10.3390/app10103387
Submission received: 8 April 2020 / Revised: 4 May 2020 / Accepted: 8 May 2020 / Published: 14 May 2020

Abstract

:
A key challenge in applied science when planning a designed experiment is to determine the aliasing structure of the interaction effects and selecting the appropriate levels for the factors. In this study, kernel tree methods are used as precursors to identify significant interactions and levels of the factors useful for developing a designed experiment. This approach is aligned with integrating data science with the applied sciences to reduce the time from innovation in research and development to the advancement of new products, a very important consideration in today’s world of rapid advancements in industries such as pharmaceutical, medicine, aerospace, etc. Significant interaction effects for six common independent variables using boosted trees and random forests of k = 1000 and k = 10,000 bootstraps were identified from industrial databases. The four common variables were related to speed, pressing time, pressing temperature, and fiber refining. These common variables maximized tensile strength of medium density fiberboard (MDF) and the ultimate static load of oriented strand board (OSB), both widely-used industrial products. Given the results of the kernel tree methods, four possible designs with interaction effects were developed: full factorial, fractional factorial Resolution IV, Box–Behnken, and Central Composite Designs (CCD).

1. Introduction

Data science is evolving rapidly in the world, and the precipitous needs from innovation to adoption from industries such as the pharmaceutical, aerospace, food, etc., have never been greater. A key challenge is to shorten the time span from innovation to adoption while maintaining scientific inference. Many applied scientists rely on formal experimentation during innovation development. Scientists have budgetary and time constraints that limit cyclical experimentation. The study outlined in the paper presents a methodology of using data science kernel tree methods as a precursor for designed experimentation, to reduce the time from innovation in applied sciences to adoption for product production. This combination of induction and deduction is aligned with data science by combining contemporary methodologies with more classical methods to enhance scientific inference.
Many designed experiments in research and development (R&D) contain two or more factors, with two or more levels, e.g., a 2k design (low [−] and high [+] levels) with k = 3 with a replication equate to n = 16 runs of experimentation. If three levels are desired, a 3k design (low [−], medium [0], and high [+] levels) with k = 3 with a replication will have n = 54 experimental runs. In the applied sciences, experimental runs are expensive and the number of total runs is typically restricted by budget constraints. Survey or screening designs with fractional factorials are helpful to alleviate the number of experimental runs to minimize costs. However, conducting multiple screening designs require time, labor, and cost, e.g., 23 design with replication requires at least 16 runs and one-half fraction of this design requires eight runs with replication and are not recommended, i.e., given the aliasing of main effects with two level interactions where two-level interactions are typically significant, as noted by [1].
The authors propose the use of kernel tree methods (e.g., regression trees, random forests, boosted trees, etc.) to minimize the time from innovation to product adoption while sustaining scientific inference. Kernel tree methods identify unknown interactions as a hierarchy in the data [2,3,4,5]. Models from these techniques have high explanatory value [6,7,8,9,10,11]. The interactions from such models may also identify unknown aliasing structures and avoid aliasing highly significant factors during experimentation. Combining bootstrapping [12] with kernel tree methods identifies the most common set of reoccurring variables that influence a response variable. Another challenge in designed experimentation is the selection of the ‘levels’ of the factors. If ‘levels’ are too narrow, knowledge gained from the experiment may be restricted. If ‘levels’ are too wide, the regions of the data space may not be feasible, which may negate experimental runs and create unbalanced design results. The ‘split-points’ of the factors in the tree-based models may guide researchers to feasible, but yet unexplored ‘levels’ within the data space. The objective of this study was to use kernel tree methods to quantify significant interactions as a hierarchy before conducting a formal experiment.

2. Material and Methods

2.1. Dataset Descriptions

Datasets for the study were obtained from medium density fiberboard (MDF) and oriented strand board (OSB) manufacturers as part of a research confidentiality agreement. Variable names in the datasets were altered to respect the terms of the confidentiality agreement. MDF is a nonstructural biomaterial that is used as a substrate for furniture, kitchen cabinets, desks, tabletops, etc. OSB is a structural biomaterial that is used for residential and non-residential construction of dwellings. Tensile strength is the primary strength metric (dependent variable) for MDF. Ultimate static load is an important strength metric for OSB and is the certification metric for use in the marketplace. Data from the destructive testing labs were obtained from two different manufacturers in the United States of America (USA). The MDF dataset had 408 records and 184 regressors which represented destructive tests over a three-month time period for a nominal product of 15.88 mm in thickness. MDF destructive tests are typically taken from the production line at one-hour periodic intervals or when the product-type is changed. The tensile strength dependent variable for MDF had a normal pdf based on the Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) (Table 1) [13,14].
The OSB dataset had 150 records and 98 regressors which represented destructive tests over a six-month time period for the nominal product of 11.11 mm in thickness. OSB destructive tests are taken from the production line at four-hour periodic intervals or at product change. The ultimate static load dependent variable for OSB had a normal pdf based on the AIC and BIC criteria (Table 1). The regressors for both MDF and OSB were taken from the process data warehouses and were fused with the dependent variables from the destructive testing labs. Eighty percent of the data for both OSB and MDF datasets were used for training and 20% of the data were used for validation. For MDF 184 regressors, and in OSB, 98 regressors were in the training datasets. Ten-fold cross validation was used for both MDF and OSB models.

2.2. Description of Predictor Variables

Process data are from sensors on the production line and are related to line speed, pressing speed, pressing pressure, pressing temperature, fiber moisture, etc. Both the MDF and OSB processes have a variation in all of the variables during the normal manufacturing process. Line speed typically is product specific, i.e., faster speeds for thinner and lower density products and slower speeds for thicker, higher density products. For example the predictor variable for OSB called ‘wet bin speed’ is directly related to line speed given that the bin that holds the flakes are emptied is a function of line speed. Line speed also changes because of moisture content changes in the fibers during the pressing stage. Pressing occurs under pressure and high temperature for the curing of the bonding of fibers and adhesives during the pressing stage. Fiber moisture may change during the manufacturing process due to natural variation in the feedstocks and variations in the temperatures during the drying process. The predictor variables reported in this manuscript are given descriptive names as related to these processes, e.g., ‘flake moisture content’ is the moisture content of the OSB flakes, ‘total press time’ is the time the material resides in the pressing stage, ‘mat weight’ is the actual weight of the formed fiber mats before the pressing stage, etc. The MDF process is different from the OSB process primarily in the early stages of the process, where wood is refined to small fibers with lengths of 1.34–1.84 mm. The predictor variables named ‘face plate position’ refer to the gap in grinding plates during the wood to fiber refining stage and ‘face steam flow’ refers to how much steam is injected into the refiners that create the fibers. Descriptions of the MDF and OSB processes are documented in [15,16].

2.3. Kernel Tree Methods

Decision trees as applied to continuous data are known as regression trees (RT) [17,18,19]. Given that documentation on the methodologies of kernel tree methods is extensive, only a summary is presented. As Hand et al. [20] noted, “linear regression is a global model, where there is a single predictive formula holding over the entire data-space. An alternative approach is to sub-divide, or partition, the space into smaller regions, where the interactions are more manageable.” The ‘regression tree’ approach to modeling identifies a hierarchy of interactions that were previously unknown. The regression tree approach creates recursive partitions or cells within the entire data space (i.e., terminal nodes or leaves) and the cells are modeled separately. The cells of regression trees are typically ‘pruned’ during the model validation phase in identifying the best predictive model. One strength of this method is that data do not need to be imputed, given that partitions will be made in the presence of missing data (Georges, 2009). RTs are represented as two-dimensional graphics, which makes it easily understood and interpreted [6,7].
RTs are quite popular as an exploratory modeling technique, and are commonly associated with data mining techniques [21,22,23,24,25,26,27,28]. RTs are very resistant to irrelevant regressors given that the recursive tree-building algorithm estimates the optimal variable on which to split at each step, regressors unrelated to the response are not chosen for splitting [29] (pp. 199-215). In theory, a regression tree partitions the data space of all joint regressor values X into J-disjoint regions   { R j } 1 J [6]. For a given set of joint regressor values X, the tree prediction Y ^ = T j ( X ) assigns as the response estimate, the value assigned to each region containing X:
X R j   T j ( X ) = y ^ j
Given a set of regions, the optimal response values associated with each region minimize the prediction error in that region:
Y ^ j = arg m i n y E y [ L ( y , y ) | X R j ]
Unlike some kernel methods, RTs use the data to estimate a good partition instead of relying on a predefined model by the analyst.
‘Boosted trees’ (BT) relies on the philosophy that a small number of simple trees of weak learners that are combined as one model outperform the predictions of one large RT [6,30,31]. ‘Boosting’ builds trees sequentially such that each new tree improves the predictive power of the ensemble [32,33,34]. The result grows new trees specifically aimed at accommodating observations that an existing ensemble predicts poorly, i.e., overall improved predictive performance of the final BRT model. BRT approximates a solution to the problem of fitting a sum-of-trees by adding new trees one at a time, while keeping all existing trees unchanged [35]. As stated by Schapire [31] and Elith et al. [35], BT is a model to enhance the model accuracy and the key step in boosting is to consecutively apply the algorithm to constantly modified data, i.e., it minimizes the loss function through adding a regression tree in each iteration step [35].
‘Random forests’ (RF) developed by Breiman [36] and as summarized by Fawagreh et al. [37] combines Breiman’s bagging sampling approach, and the random selection of features, introduced independently [38,39] in order to construct a collection of decision trees with controlled variation. Each tree in the ensemble acts as a base classifier to determine the class label of an unlabeled instance.” Key advantages of RF over BT are robustness to noise and less overfitting [35,38,39,40,41,42,43].
There were five key parameters used for model calibration in this study. A minimum size split of 10 was used for both the boosted tree and random forests to prevent the program from splitting any node with below a specified number of cases [44]. ‘Early stopping’ stopped the additive boosting process if further boosting failed to improve the fit in the validation dataset. A minimum split size of five was used with a learning rate (r) of 0.1 where ( 0   < r   1 )   which cued the program to build separate boosted trees for every combination of splits. This permitted the boosted tree to try various combinations of parameters in order to find the one that maximizes the fit. The tree was grown with 50 layers and splits per tree ranged from five to 10. An overfit penalty ensured against having any cases with predicted probabilities equal to zero, where higher values will result in less overfitting. The probability is:
P r o b i =   n i +   p r i o r i ( n i +   p r i o r i )
where the summation is across all response levels and n i is the number of observations at the node for the ith response level. Priori is the prior probability for the ith response level and is calculated as follows:
p r i o r i =   λ p i + ( 1 λ ) P i
where p i is the p r i o r i from the parent node, P i is the p r o b i from the parent node, and λ is a weighting factor which was set at 0.9.

3. Fractional Factorials and Aliasing

Statistically designed experimentation (i.e., design of experiments or DOE) is a formal ‘deductive’ methodology, where independent variables (factors) are manipulated at different settings (levels) in a controlled fashion to explore optimization problems for key response variables. R.A. Fisher [45] is the father of DOE and expanded on the classic analysis of variance (ANOVA) methodology. A ‘full factorial’ DOE has experimental runs with replicates at all possible levels for all of the factors. Even though full factorial designs are the most informative designs, such designs are expensive.
George E.P. Box was influenced immensely by R.A. Fisher’s work and studied under Egon Pearson at the University of London [46]. He developed a series of designs known as ‘response surface methods’ (RSM) which allows researchers to use a fraction of the total experimental runs (fractional factorials). Box’s popular ‘central composite’ and ‘Box–Behnken’ RSM designs [1] minimize the number of experimental runs, while sustaining inference. A key consideration for RSM and any type of fractional factorial design is the aliasing structure of the design. Resolution III designs are designs where two-level interactions are aliased with main effects and other two-level interactions, and are typically avoided given that two-level interactions are typically significant. For example, a ½ fraction 2k−1, k = 4 with a replicate has n = 16 experimental runs is a Resolution III design with aliases: A = BC, B = AC, C = AB, AB = CD, BC = AD, and AC = BD. Generally, analysts select the highest possible design resolution (e.g., resolutions ≥ IV) when choosing a design to avoid confounding main effects. If the analyst conducts a fractional factorial design and does not have a detailed knowledge of the phenomenon under investigation, an aliasing structure is assumed. By developing RT models as a possible first step for the factors under investigation, unknown interactions and split-points may be discovered. This may accelerate innovation by choosing statistically significant interactions for a DOE that lead to lower costs of experimentation.

4. Results

4.1. Boosted Tree Models and Bootstrap Forest

The boosted tree (BT) and bootstrap forest (BF), with k = 1000 and k = 10,000 bootstraps [47] for each method, indicated a set of significant variables ( α = 0.05 ) . The recursive partitions using smaller additive trees, which are unique to BT and BF methods identified a set of predictor variables that were common to both modeling types. Four common predictor variables in these bootstrapped additive trees for MDF were: ‘face plate position’, ‘press position time’, ‘face steam flow’, and ‘adhesive percent’ (Table 2). The R2 in training and validation for the BT was 0.758 and 0.552, respectively. The R2 in training and R2 in validation for the BF was 0.628 and 0.484, respectively. Even though the R2 in validation is not high, these common variables explain a reasonable proportion of the variation influencing tensile strength. Screening designs may not have led to a similar result in a short amount of time. The means and variances of the dependent variable tensile strength (kPa) in the training and validation datasets were similar for the 10-fold cross validation with an x = = 946.7 ,   s ¯ = 95.6   M ¯ =   944.6 ,   C V ¯ = 10.1 % in the training dataset; and x = = 946.4 ,   s ¯ = 91.2 ,   M ¯ =   945.9 ,   C V ¯ = 9.6 % in the validation dataset.
The boosted tree (BT) and bootstrap forest (BF) with k = 1000 and k = 10,000 bootstraps for the OSB data revealed a set of five common predictor variables: ‘wet bin speed’, ‘flake moisture content’, ‘dryer inlet temperature’, ‘dryer outlet temperature’, and ‘wood weight’ (Table 3). The R2 in training and R2 in validation for the BT was 0.675 and 0.410, respectively. The R2 in training and validation for the BF were 0.534 and 0.304, respectively. The means and variances of the dependent variable ultimate static load (kg) in the training and validation datasets were similar across the 10-fold cross validation with an x = = 213.7 ,     s ¯ = 28.1 ,   M ¯ = 210.6 ,     C V ¯ = 5.9 % in the training dataset; and x = = 204.8 ,     s ¯ = 27.7 ,     M ¯ = 204.0 ,     C V ¯ = 6.1 % in the validation dataset. Again, even though the R2 in validation is not high, these five variables explain a reasonable proportion of the variation influencing ultimate static load. In order to find reasonable split points and levels for factors to be tested in designed experiment regression, tree models were developed using the five common independent variables previously mentioned.

4.2. RT Models as a Precursor for Response Surface Methods

The following examples for designed experiments are presented to demonstrate that kernel tree results can be helpful in designing a DOE without time consuming prescreening designs. Assuming that experimental runs are costly, several response surface designs are presented that minimize the number of runs while maintaining an α = 0.05   while maximizing the power of the experiment.
Designs for MDF from the RT model. From the RT model for MDF, a fractional factorial Resolution IV design with 2k−1, k = 4 with a replicate would require n = 16 runs. This design has a power of 82% for the main effects ( α = 0.05 ), given a standard deviation of 95 kPa with the ability to detect a difference in the mean tensile strength of 950 kPa (Table 4). From the RT model results (Figure 1), the factor notation is: A = ‘face plate position’, B = ‘adhesive percent’, C = ‘press position time’, and D = ‘face steam flow.’ The aliasing structure for this fractional factorial design is A = BCD, B = ACD, C = ABD, D = ABC, AB = CD, AC = BD, and AD = BC. Even though this may be a feasible design for the analyst, it only explores the corner points of the data space. Box and Behnken [1] proposed the Box–Behnken design when points at the corners are costly [48]. A Box–Behnken design with n = 27 runs is presented in Figure 3, with the run order given in Table 5. If the analyst can afford three more experimental runs and corner points may be risky for the process, a spherical circumscribed central composite (CCD) RSM, k = 4, with four axial points is an option (Figure 3). A full factorial design would have only the corner points examined and requires n = 32 runs which includes a replicate.
Designs for OSB from the RT model. Given that the OSB RT model resulted in three significant factors, A = ‘dryer inlet temperature’, B = ‘mat weight’, and C = ‘wet bin speed’ at an α = 0.05 , a full factorial design is feasible, e.g., 2k, k = 3, n = 16 with a power of 84% for significant main effects (assuming an α = 0.05 for the experiment) (see Table 6 and Figure 2). Even though this design has no aliasing assumptions, it only explores corner points. A Box–Behnken RSM design would require two fewer runs, n = 15, with three center points (Figure 3). An alternative design would be a spherical circumscribed CCD RSM design with n = 17, k = 3, with four axial points (Figure 3), where the run order is presented in Table 7. Even though there are many other feasible designs, the aforementioned designs were developed to minimize the number of experimental runs while using the helpful results of the RT models.

5. Conclusions

Identifying interaction effects is important in innovation and product development. Kernel tree methods provide an accepted data science approach for enhancing the applied sciences. Such methods quickly identify and quantify undiscovered interactions among regressors. In this research, boosted trees and random forest models were constructed from two different manufacturing systems. Common significant variables in the models that effected strength of materials were related to speed of the process, fiber refining, pressing time, and pressing pressures. Experimental designs were proposed with a Resolution > III, while incorporating the aliasing structure identified from the RT models. Possible designs with four factors were: fractional factorial Resolution IV; Box–Behnken RSM; and CCD spherical circumscribed designs. The above designs were selected to minimize the number of experimental runs while sustaining inference. The hierarchy of interaction effects and split-points of regressors in kernel tree models may provide applied scientists with an important foundation for planning a designed experiment while minimizing the costs during innovation development. If feasible, future studies will explore support vector machine, Bayesian additive regression trees (BART), etc., for comparative analyses.

Author Contributions

All authors contributed to the development of this paper. T.M.Y. did the primary statistical analysis and development of original draft manuscript. R.A.B. and T.L. were instrumental in procuring the industrial datasets and validating the results of the models. A.P.’s contribution was invaluable in advising on the methodology, results, and the review of the overall manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Tennessee Institute of Agriculture AgResearch McIntire-Stennis project [TEN00MS-107] APC funded by [1006012].

Conflicts of Interest

No conflicts of interest.

References

  1. Box, G.E.P.; Behnken, D. Some new three level designs for the study of quantitative variables. Technometrics 1960, 2, 455–475. [Google Scholar] [CrossRef]
  2. Fielding, A. Binary segmentation: The automatic detector and related techniques for exploring data structure. In The Analysis of Survey Data, Exploring Data Structures; O’Muircheartaigh, C.A., Payne, C., Eds.; John Wiley and Sons, Inc.: New York, NY, USA, 1977; Volume I, pp. 221–257. [Google Scholar]
  3. Kass, G.V. Significance testing in automatic interaction detection (A.I.D.). Appl. Stat. 1975, 24, 178–189. [Google Scholar] [CrossRef]
  4. Loh, W.Y. Regression trees with unbiased variable selection and interaction detection. Stat. Sin. 2002, 12, 361–386. [Google Scholar]
  5. Morgan, J.N.; Sunquist, J.A. Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 1963, 58, 415–434. [Google Scholar] [CrossRef]
  6. Friedman, J.H. Greedy function approximation: A gradient booting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  7. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  8. Friedman, J.H.; Meulman, J.J. Multiple additive regression trees with application in epidemiology. Stat. Med. 2003, 22, 1365–1381. [Google Scholar] [CrossRef]
  9. Kim, H.; Loh, W.Y. Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 2001, 96, 589–604. [Google Scholar] [CrossRef] [Green Version]
  10. Kim, H.; Loh, W.Y. Classification trees with bivariate linear discriminant node models. J. Comput. Graph. Stat. 2003, 12, 512–530. [Google Scholar] [CrossRef] [Green Version]
  11. Kim, H.; Guess, F.M.; Young, T.M. Using data mining tools of decision trees in reliability applications. IIE Trans. 2011, 43, 43–54. [Google Scholar]
  12. Stoma, P.; Stoma, M.; Dudziak, A.; Caban, J. Bootstrap analysis of the production processes capability assessment. Appl. Sci. 2019, 9, 5360. [Google Scholar] [CrossRef] [Green Version]
  13. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  14. Schwarz, G.E. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  15. Adcock, T.; Wolcott, M.P. Wood: Structural Panel Processes. In Encyclopedia of Materials: Science and Technology; Buschow, K.H.J., Cahn, R.W., Flemings, M.C., Ilschner, B., Kramer, E.J., Mahajan, S., Veyssière, P., Eds.; Elsevier: Amsterdam, The Netherlands, 2001; pp. 9678–9683. [Google Scholar]
  16. Kamke, F.A. Wood: Nonstructural panel processes. In Encyclopedia of Materials: Science and Technology; Buschow, K.H.J., Cahn, R.W., Flemings, M.C., Ilschner, B., Kramer, E.J., Mahajan, S., Veyssière, P., Eds.; Elsevier: Amsterdam, The Netherlands, 2001; pp. 9673–9678. [Google Scholar]
  17. Chaudhuri, P.; Huang, M.C.; Loh, W.Y.; Yao, R. Piecewise-polynomial regression trees. Stat. Sin. 1994, 4, 143–167. [Google Scholar]
  18. Déath, G.; Fabricius, K. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 2000, 81, 3178–3192. [Google Scholar] [CrossRef]
  19. Loh, W.Y.; Vanichsetakul, N. Tree-structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 1988, 83, 715–728. [Google Scholar] [CrossRef]
  20. Hand, D.J.; Mannila, H.; Smyth, P. Principles of Data Mining (Adaptive Computation and Machine Learning), 3rd ed.; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
  21. André, N.; Young, T.M. Real-time process modeling of particleboard manufacture using variable selection and regression methods ensemble. Eur. J. Wood Wood Prod. 2013, 71, 361–370. [Google Scholar] [CrossRef]
  22. Carty, D.M.; Young, T.M.; Zaretzki, R.L.; Guess, F.M.; Petutschnigg, A. Predicting the strength properties of wood composites using boosted regression trees. Forest Prod. J. 2015, 65, 365–371. [Google Scholar] [CrossRef]
  23. Cherkassky, V.S.; Mulier, F. Learning from Data: Concepts, Theory, and Methods; John Wiley & Sons, Inc.: New York, NY, USA, 1998; pp. 1–536. [Google Scholar]
  24. Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery: An Overview of Advances in Knowledge Discovery and Data Mining; The MIT Press: Cambridge, MA, USA, 1996; pp. 1–34. [Google Scholar]
  25. Loh, W.Y. Classification and regression trees. WIREs Data Min. Knowl. 2011, 1, 14–23. [Google Scholar] [CrossRef]
  26. Young, T.M.; León, R.V.; Chen, C.-H.; Chen, W.; Guess, F.M.; Edwards, D.J. Robustly estimating lower percentiles when observations are costly. Qual. Eng. 2015, 27, 361–373. [Google Scholar] [CrossRef]
  27. Young, T.M.; Clapp, N.E., Jr.; Guess, F.M.; Chen, C.-H. Predicting key reliability response with limited response data. Qual. Eng. 2014, 26, 223–232. [Google Scholar] [CrossRef]
  28. Zeng, Y.; Young, T.M.; Edwards, D.J.; Guess, F.M.; Chen, C.-H. Case studies: A study of missing data imputation in predictive modeling of a wood composite manufacturing process. J. Qual. Technol. 2016, 48, 284–296. [Google Scholar]
  29. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.I. Classification and Regression Trees; Wadsworth: Belmont, CA, USA, 1984; pp. 199–215. [Google Scholar]
  30. Luna, J.M.; Gennatas, E.D.; Ungar, L.H.; Valdes, G. Building more accurate decision trees with the additive tree. Proc. Natl. Acad. Sci. USA 2019, 116, 19887–19893. [Google Scholar] [CrossRef] [Green Version]
  31. Schapire, R.E. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification; Denison, D.D., Hansen, M.H., Holmes, C., Mallick, B., Yu, B., Eds.; Springer: New York, NY, USA, 2003; pp. 113–141. [Google Scholar]
  32. Feng, J.; Yu, Y.; Zhou, Z.-H. Multi-layered gradient boosting decision trees. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 3555–3565. [Google Scholar]
  33. Khan, Z.; Gul, A.; Perperoglou, A. Ensemble of optimal trees, random forest and random projection ensemble classification. Adv. Data Anal. Cl. 2020, 14, 97–116. [Google Scholar] [CrossRef] [Green Version]
  34. Khuri, N. Mining environmental chemicals with boosted trees. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, Brno, Czech Republic, 30 March–3 April 2020; pp. 1082–1089. [Google Scholar]
  35. Elith, J.; Leathwick, J.R.; Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 2008, 77, 802–813. [Google Scholar] [CrossRef]
  36. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  37. Fawagreh, K.; Gaber, M.M.; Elyan, E. Random forests: From early developments to recent advancements. J. Syst. Sci. Syst. Eng. 2014, 2, 602–609. [Google Scholar] [CrossRef] [Green Version]
  38. Amit, Y.; Geman, D. Shape quantization and recognition with randomized trees. Neural Comput. 1997, 9, 1545–1588. [Google Scholar] [CrossRef] [Green Version]
  39. Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. 1998, 20, 832–844. [Google Scholar]
  40. Boinee, P.; De Angelis, A.; Foresti, G.L. Meta random forests. Int. J. Comput. Int. Syst. 2005, 2, 138–147. [Google Scholar]
  41. Gregorutti, B.; Michel, B.; Saint-Pierre, P. Correlation and variable importance in random forests. Stat. Comput. 2017, 27, 659–678. [Google Scholar] [CrossRef] [Green Version]
  42. Jaiswal, J.K.; Samikannu, R. Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression. In Proceedings of the World Congress on Computing and Communication Technologies (WCCCT), Tiruchirappalli, India, 2–4 February 2017; pp. 65–68. [Google Scholar]
  43. Liaw, A.; Wiener, M. Classification and regression by randomforest. IRNews 2002, 2, 18–22. [Google Scholar]
  44. Attewell, P.; Monaghan, D. Data Mining for the Social Cciences: An Introduction; University of California Press: Berkeley, CA, USA, 2015; pp. 1–264. [Google Scholar]
  45. Fisher, R.A. The Design of Experiments; Hafner Publishing Company: New York, NY, USA, 1971; pp. 23–36. [Google Scholar]
  46. Box, G.E.P. Science and statistics. J. Am. Stat. Assoc. 1976, 71, 791–799. [Google Scholar] [CrossRef]
  47. Pattengale, N.D.; Alipour, M.; Bininda-Emonds, O.R.P.; Moret, B.M.E.; Stamatakis, A. How Many Bootstrap Replicates Are Necessary; Batzoglou, S., Ed.; RECOMB, LNCS 5541; Springer: Berlin/Heidelberg, Germany, 2009; pp. 184–200. [Google Scholar]
  48. Box, G.E.P.; Draper, N.R. Empirical Model Building and Response Surfaces; John Wiley and Sons: New York, NY, USA, 1987; pp. 1–688. [Google Scholar]
Figure 1. Regression tree for MDF tensile strength derived from boosted tree and random forest models.
Figure 1. Regression tree for MDF tensile strength derived from boosted tree and random forest models.
Applsci 10 03387 g001
Figure 2. Regression tree for OSB ultimate static load derived from boosted tree and random forest models.
Figure 2. Regression tree for OSB ultimate static load derived from boosted tree and random forest models.
Applsci 10 03387 g002
Figure 3. Designed experimental models for MDF and OSB using the outcomes of the boosted tree and random forest models.
Figure 3. Designed experimental models for MDF and OSB using the outcomes of the boosted tree and random forest models.
Applsci 10 03387 g003aApplsci 10 03387 g003b
Table 1. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) statistics for tensile strength pdfs of medium density fiberboard (MDF) and ultimate static load of oriented strand board (OSB).
Table 1. Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) statistics for tensile strength pdfs of medium density fiberboard (MDF) and ultimate static load of oriented strand board (OSB).
Tensile Strength-Model Comparisons
DistributionAICBIC
Normal4966.53264975.3464
Generalized Gamma4967.62294980.8336
Log Generalized Gamma4967.63044980.8411
Lognormal4970.12644978.9402
Logistic4978.83424987.6480
Loglogistic4981.30854990.1222
Weibull5028.77955037.5932
LEV5044.75985053.5736
SEV5087.53015096.3439
Frechet5110.60155119.4153
Exponential7255.90667260.3168
Ultimate Static Load-Model Comparisons
DistributionAICBIC
Normal−524.1410−518.2014
Generalized Gamma−522.2338−513.3663
Log Generalized Gamma−522.0893−513.2218
Lognormal−521.9522−516.0125
Logistic−518.1157−512.1760
Loglogistic−516.7510−510.8113
Weibull−516.6517−510.7120
SEV−507.4300−501.4904
LEV−504.8098−498.8702
Frechet−491.4876−485.5480
Exponential30.159633.1432
Table 2. Common predictors variables for 10-fold cross validation for both bootstrap forest and boosted tree models for MDF after k = 1000 bootstraps and k = 10,000, ranked by highest sums of squares.
Table 2. Common predictors variables for 10-fold cross validation for both bootstrap forest and boosted tree models for MDF after k = 1000 bootstraps and k = 10,000, ranked by highest sums of squares.
Bootstrap Forest (k = 1000 Bootstraps)Boosted Tree (k = 1000)
VariableNumber of SplitsSums of SquaresVariableNumber of SplitsSums of Squares
Face Plate Position252771.68Face Plate Position926,525.05
Core Dust Speed131052.24Press Position Time617,736.67
Press Position Time12796.26Core Dust Speed714952.18
Total Press Time11594.18Face Steam Flow49565.28
Face Steam Flow15592.02Total Press Time28360.52
Adhesive Percent14469.34Adhesive Percent47688.29
Bootstrap Forest (k = 10,000 Bootstraps)Boosted Tree (k = 10,000)
VariableNumber of SplitsSums of SquaresVariableNumber of SplitsSums of Squares
Face Steam Flow202379.71Face Plate Position728,640.33
Adhesive Percent292142.70Swing Plate Position716,677.42
Face Plate Position871571.86Face Steam Flow815,481.28
Press Position Time151470.78Adhesive Percent49565.11
Swing Plate Position451319.93Press Position Time69553.90
Resin Temperature201211.34Face Steam Flow68170.58
Table 3. Common predictors variables for both bootstrap forest and boosted tree models for OSB after k = 1000 bootstraps, ranked by highest sums of squares.
Table 3. Common predictors variables for both bootstrap forest and boosted tree models for OSB after k = 1000 bootstraps, ranked by highest sums of squares.
Bootstrap Forest (k = 1000 Bootstraps)Boosted Tree (k =1000 Bootstraps)
VariableNumber of SplitsSum of SquaresVariableNumber of SplitsSum of Squares
Wet Bin Speed70.117Wet Bin Speed40.050
Dryer Inlet Temperature20.023Flake Moisture Content40.049
Flake Moisture Content20.019Dryer Inlet Temperature30.035
Dryer Outlet Temperature10.015Dryer Outlet Temperature20.017
Mat Weight10.012Mat Weight20.017
Wood Weight10.009Wood Weight20.016
Bootstrap Forest (k = 10,000 Bootstraps)Boosted Tree (k =10,000 Bootstraps)
VariableNumber of SplitsSum of SquaresVariableNumber of SplitsSum of Squares
Dryer Outlet Temperature50.172Dryer Inlet Temperature80.062
Dryer Inlet Temperature50.093Wet Bin Speed50.045
Wood Weight40.076Wood Weight30.034
Flake Moisture Content60.035Press Closing Time60.026
Dry Bin Speed20.026Dryer Outlet Temperature30.020
Wet Bin Speed10.024Flake Moisture Content40.016
Table 4. Descriptive statistics for the dependent and independent variables in the regression tree (RT) model for MDF.
Table 4. Descriptive statistics for the dependent and independent variables in the regression tree (RT) model for MDF.
Tensile Strength (kPa)
Quantiles
Face Plate Position (mm)
Quantiles
Core Dust Speed (m/min)
Quantiles
100.0%maximum1275.575100.0%maximum10.459100.0%maximum24.542
99.5% 1226.48399.5% 10.45699.5% 24.195
97.5% 1137.67597.5% 10.42297.5% 22.788
90.0% 1067.34690.0% 9.72090.0% 19.928
75.0%quartile1020.4675.0%quartile9.14075.0%quartile16.468
50.0%median951.5150.0%median8.32450.0%median10.921
25.0%quartile882.5625.0%quartile5.49225.0%quartile8.245
10.0% 827.410.0% 4.16010.0% 7.094
2.5% 767.41352.5% 3.1352.5% 6.128
0.5% 704.11740.5% 2.2710.5% 3.539
0.0%minimum668.8150.0%minimum2.1800.0%minimum2.851
Summary StatisticsSummary StatisticsSummary Statistics
Mean950.0881Mean0.321Mean41.069
Std Dev94.6977Std Dev0.068Std Dev16.341
Std Err Mean3.831056Std Err Mean0.003Std Err Mean0.772
Upper 95% Mean957.6118Upper 95% Mean0.327Upper 95% Mean42.587
Lower 95% Mean942.5644Lower 95% Mean0.315Lower 95% Mean39.552
N408N408N408
QuantilesFace Steam Flow (bar)
Quantiles
Adhesive Percent (%) Quantiles
100.0%100.0%100.0%100.0%maximum292.618100.0%maximum14.743
99.5%99.5%99.5%99.5% 285.13299.5% 14.583
97.5%97.5%97.5%97.5% 231.63597.5% 14.089
90.0%90.0%90.0%90.0% 190.57990.0% 13.687
75.0%75.0%75.0%75.0%quartile160.88075.0%quartile9.555
50.0%50.0%50.0%50.0%median114.14350.0%median9.468
25.0%25.0%25.0%25.0%quartile69.91125.0%quartile8.579
10.0%10.0%10.0%10.0% 60.55810.0% 7.662
2.5%2.5%2.5%2.5% 51.7782.5% 5.985
0.5%0.5%0.5%0.5% 49.5700.5% 5.679
0.0%0.0%0.0%0.0%minimum48.7330.0%minimum5.671
Summary StatisticsSummary StatisticsSummary Statistics
MeanMeanMean121.76Mean9.50
Std DevStd DevStd Dev53.10Std Dev1.93
Std Err MeanStd Err MeanStd Err Mean2.36Std Err Mean0.09
Upper 95% MeanUpper 95% MeanUpper 95% Mean126.39Upper 95% Mean9.68
Lower 95% MeanLower 95% MeanLower 95% Mean117.13Lower 95% Mean9.31
N408N408N408
Table 5. Box–Behnken design for tensile strength (n = 27) for 2k, k = 4 with three center points based on the results of the RT for MDF.
Table 5. Box–Behnken design for tensile strength (n = 27) for 2k, k = 4 with three center points based on the results of the RT for MDF.
RunsPatternFace Plate PositionPress Position TimeAdhesive PercentFace Steam FlowRandom Run Order
1−−009.38.07.7595.59
2−0−09.38.27.6095.55
3−00−9.38.27.7590.51
4−00+9.38.27.75100.518
5−0+09.38.27.9095.523
6−+009.38.47.7595.526
70−−09.58.07.6095.516
80−0−9.58.07.7590.517
90−0+9.58.07.75100.52
100−+09.58.07.9095.525
1100−−9.58.27.6090.521
1200−+9.58.27.60100.515
1300009.58.27.7595.57
1400009.58.27.7595.511
1500009.58.27.7595.53
1600+−9.58.27.9090.54
1700++9.58.27.90100.527
180+−09.58.47.6095.512
190+0−9.58.47.7590.510
200+0+9.58.47.75100.519
210++09.58.47.9095.58
22+−009.78.07.7595.513
23+0−09.78.27.6095.524
24+00−9.78.27.7590.514
25+00+9.78.27.75100.56
26+0+09.78.27.9095.520
27++009.78.47.7595.522
Table 6. Descriptive statistics for the dependent and independent variables in the RT model for OSB.
Table 6. Descriptive statistics for the dependent and independent variables in the RT model for OSB.
Ultimate Static Load (kg)
Quantiles
Dryer Inlet Temperature (°C)
Quantiles
100.0%maximum283.04100.0%maximum145.00
99.5% 283.0499.5% 145.00
97.5% 269.2097.5% 133.33
90.0% 250.8490.0% 127.22
75.0%quartile232.3575.0%quartile125.00
50.0%median209.1150.0%median120.28
25.0%quartile189.7125.0%quartile112.22
10.0% 174.2210.0% 103.33
2.5% 168.532.5% 96.67
0.5% 147.870.5% 93.33
0.0%minimum147.870.0%minimum93.33
Summary Statistics
Mean211.96Mean117.92
Std Dev28.23Std Dev8.36
Std Err Mean2.31Std Err Mean1.38
Upper 95% Mean216.52Upper 95% Mean119.44
Lower 95% Mean207.41Lower 95% Mean116.40
N150N150
Wood Weight (kg)
Quantiles
Wet Bin Speed (m/min)
Quantiles
100.0%maximum19.66100.0%maximum9.75
99.5% 19.6699.5% 9.75
97.5% 19.6697.5% 9.45
90.0% 5.6890.0% 9.14
75.0%quartile4.3875.0%quartile9.14
50.0%median3.2050.0%median8.53
25.0%quartile2.1925.0%quartile7.92
10.0% 0.1010.0% 7.32
2.5% 0.102.5% 6.40
0.5% 0.100.5% 6.10
0.0%minimum0.100.0%minimum6.10
Summary Statistics
Mean3.54Mean8.45
Std Dev4.19Std Dev0.84
Std Err Mean0.34Std Err Mean0.07
Upper 95% Mean4.21Upper 95% Mean8.59
Lower 95% Mean2.86Lower 95% Mean8.32
N150N150
Table 7. Central composite design for ultimate static load (n = 17) for 2k, k = 3 with three center points based on the results of the RT for OSB.
Table 7. Central composite design for ultimate static load (n = 17) for 2k, k = 3 with three center points based on the results of the RT for OSB.
RunsPatternDryer Inlet TemperatureMat WeightWet Bin SpeedRandom Run Order
1−−−1051.48.46
2−−+1051.49.411
3a001051.58.915
4−+−1051.68.410
5−++1051.69.416
60a0107.51.48.912
700a107.51.58.413
8000107.51.58.94
9000107.51.58.97
1000A107.51.59.417
110A0107.51.68.92
12+−−1101.48.41
13+−+1101.49.48
14A001101.58.93
15++−1101.68.414
16+++1101.69.45
17−−−1051.48.49

Share and Cite

MDPI and ACS Style

Young, T.M.; Breyer, R.A.; Liles, T.; Petutschnigg, A. Improving Innovation from Science Using Kernel Tree Methods as a Precursor to Designed Experimentation. Appl. Sci. 2020, 10, 3387. https://doi.org/10.3390/app10103387

AMA Style

Young TM, Breyer RA, Liles T, Petutschnigg A. Improving Innovation from Science Using Kernel Tree Methods as a Precursor to Designed Experimentation. Applied Sciences. 2020; 10(10):3387. https://doi.org/10.3390/app10103387

Chicago/Turabian Style

Young, Timothy M., Robert A. Breyer, Terry Liles, and Alexander Petutschnigg. 2020. "Improving Innovation from Science Using Kernel Tree Methods as a Precursor to Designed Experimentation" Applied Sciences 10, no. 10: 3387. https://doi.org/10.3390/app10103387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop