The Goldilocks Approach: A Review of Employing Design of Experiments in Prokaryotic Recombinant Protein Production

The production of high yields of soluble recombinant protein is one of the main objectives of protein biotechnology. Several factors, such as expression system, vector, host, media composition and induction conditions can influence recombinant protein yield. Identifying the most important factors for optimum protein expression may involve significant investment of time and considerable cost. To address this problem, statistical models such as Design of Experiments (DoE) have been used to optimise recombinant protein production. This review examines the application of DoE in the production of recombinant proteins in prokaryotic expression systems with specific emphasis on media composition and culture conditions. The review examines the most commonly used DoE screening and optimisation designs. It provides examples of DoE applied to optimisation of media and culture conditions.


Introduction
Advances in biotechnology, including the development of genetic engineering and cloning, have provided a means for the large scale expression of heterologous proteins for different applications [1]. Currently, recombinant proteins are widely used in the biological and biomedical industries as well as in research with their market share increasing rapidly [2,3]. The production of high yields of soluble and functional recombinant protein is the ultimate goal in protein biotechnology [4]. To achieve this objective, many key aspects such as the expression system, the expression vector, the host strain, the purification tag, the media composition, the induction conditions and the purification methods need to be carefully evaluated and optimised before embarking on large scale production of a recombinant protein of interest [5][6][7].
Although both eukaryotic and prokaryotic expression systems are used for overproduction of soluble recombinant protein, choosing the right system for your protein depends, amongst other things, on the growth rate and culturing conditions of host cells, the level of the target gene expression and post translational processing of the synthesized protein [8,9]. The most commonly used prokaryotic systems are based on expression in bacteria, including E. coli and Bacillus species [10,11]. There is no single method which is universally successful for protein expression that will ensure the production of a desired concentration of soluble and functional protein [12][13][14]. Varying factors that influence protein expression in a trial-and-error process to achieve optimum protein expression has been troublesome [15]. To overcome this problem, statistical approaches have been used to evaluate the variables that have the largest influence on the production of a recombinant protein of interest in terms of yield [16,17], product quality [18], purity [19,20] and solubility [21,22]. These statistical processes include the Design of Experiment (DoE) approach [23,24]. This approach advances the traditional one-factor-at-a-time (OFAT) method, which involves varying one factor while other factors are held constant. This single variable OFAT approach results in the need to run multiple experiments with a high risk of failing to identify the true optimum [25]. The DoE method provides for a significantly reduced experimental matrix [26][27][28].
There are an increasing number of published studies on the application of statistically based optimization processes in the field of protein biotechnology [18,29]. This has been matched by a corresponding increase in the application of DoE methods, such as screening and optimisation designs, to enhance protein production. This review examines the literature on the DoE methodologies commonly employed to evaluate the effect of media composition and culture conditions on recombinant protein expression. It will focus on the application of DoE to increase recombinant protein expression in prokaryotic systems, where high yields can be achieved but poor product quality remains a risk [30]. It also provides an overview of the important statistical analysis tools embedded in common DoE software. These tools facilitate the interpretation of experimental data which ultimately allows the identification of optimal factor levels for maximum yield. Finally, the review provides some thoughts on the benefits of the common DoE methods typically used in recombinant protein production in order to direct future research efforts.

Factors that Inform the Choice of Expression System
Protein purification from natural sources can require a large quantity of the source organism and may yield only small amount of target protein after several rounds of extraction and purification [4,31]. Recombinant expression of proteins has become an indispensable tool to produce proteins to satisfactory yields [32] and to meet the demands of industry and research [1,33]. With the aid of genetic engineering, a desired gene cloned into a suitable expression vector can be overexpressed as a recombinant protein of interest [34]. Recombinant proteins can be expressed in cell cultures of bacteria [35], yeasts [36], mammalian cells [37,38], plants [39] and insects [40]. However, the prokaryotic systems remains the most attractive hosts due to their low cost, high productivity and rapid production rates [30]. Prokaryotic heterologous protein expression is mainly carried out in the bacteria E. coli, although increasingly the Bacillus species are being employed [41][42][43]. Drawbacks of prokaryotic expression systems include poor protein quality, due to the inability of prokaryotic cells to carry out post-translational modifications such as glycosylation, the presence of toxic cell wall pyrogens, along with the formation of inclusion bodies resulting in aggregated and insoluble heterologous protein [44]. Some widely used bacterial expression systems that are commercially available are listed in Table 1. While there are a variety of expression vectors commercially available, their choice is strongly based on the combination of replicons, promoters, selection markers, multiple cloning sites and fusion proteins [11]. An informed decision on the best expression plasmid [10,[51][52][53][54] can be confusing. The most commonly used expression plasmids [22,[55][56][57][58] and their key features such as promoters [59][60][61][62][63], affinity tags [64,65] and selection markers [7] have been extensively reviewed in the literature, primarily focusing on the E. coli prokaryotic expression system. Widely used Bacillus strains [66,67], vectors and promoters have also been reviewed [68][69][70].

Factors that Influence Media Composition and Culture Conditions in an Expression System
A careful selection of expression system, expression vector and host does not always guarantee the production of a large amount of target protein in soluble and active form [7]. Media composition and induction conditions have a significant influence on recombinant protein expression levels [71][72][73] and solubility [45]. For example, media containing a defined concentration of salts, peptone and yeast influences the yield of a recombinant glucosidase [47]; while media composition does not always have a major effect on protein solubility [51]. Prosthetic groups in media are known to prevent the formation of inclusion bodies [74] where required by the protein [41,75]. The most common media used in prokaryotic expression systems, along with their advantages and disadvantages, have been reviewed elsewhere [76]. Culture conditions are another set of factors that must be carefully optimised to achieve high yields of heterologous protein [14]. Factors such as cell density prior to induction, inducer concentration, induction temperature and induction duration are all known to influence yield [77][78][79][80][81].

Enhancing the Production of Recombinant Proteins in a Prokaryotic Expression System by DoE
It can be difficult to make informed decisions regarding the optimal combination of expression system, conditions and media components. Oftentimes this results in an unsatisfactory and costly trial-and-error process being employed to enhance the overall production yield [64]. To address this problem more effective, statistically supported, approaches have been developed and have gained significant traction. In this approach, a controlled model is developed defining media components, induction and expression conditions based on the recombinant protein of interest [16]. DoE, employed in this way, has provided powerful tools to screen and optimise factors affecting recombinant protein expression [82]. This is due to DoEs' ability to identify factors affecting recombinant protein production and optimise the process with the minimum number of experiments [83]. A typical DoE workflow is depicted in diagrammatic form (see Figure 1). The desired output, or response, is to achieve a high yield of a protein of interest and involves three main stages: Stage 1. The first stage of the process is to compile a list of factors that can influence protein expression. These are usually such factors as; induction temperature, induction duration, pH, media components (carbon source, nitrogen source, micronutrients). Stage 2. At this stage, a suitable software package such as MINITAB, JMP or Design Experts will be acquired for the statistical analysis. The second stage of DoE aims to reduce the number of factors to a smaller subset, these being the most important factors (i.e. those with the greatest impact on expression). This process is known as screening. Having a smaller set of significant factors greatly simplifies the statistical process. Sometimes, if the number of factors is small (between 2 and 4) there is no need to carry out the screening stage. When looking at a factor that influences protein expression the concept of levels is important: temperature, for example, may be examined between 20 • C and 40 • C. These two temperatures represent the lowest and highest "level" of this parameter that will influence expression. For the purposes of modelling these two levels are input into the model for this factor. Similarly, the upper and lower levels are input for all other relevant parameters. It is important to note that the levels are input into the DoE package as +1 (highest value of a parameter) and −1 (lowest value of a parameter). This "coding" is carried out to avoid the use of multiple different measurement units for parameters such as pH, temperature. The software will then suggest a minimal set of experiments to explore the significance of each factor. The design of the experimental matrix can be selected from a range of choices such as Full Factorial Design, Plackett Burman Design or indeed a custom design. The objective is to assess the "main effect" of a factor (its direct effect on a response) as well as its "interaction effects" (the effect on other factors). The suggested experiments are carried out and the results are used to inform the next stage of the process-optimisation. Stage 3. The final stage of the process is optimisation and is typically carried out with a set of three to four factors. An experimental RSM (Response Surface Methodology) design strategy is selected and experiments are run as for the screening stage. The optimisation process expresses the response surface as a polynomial and uses the input data to estimate its coefficients. The derivative of this polynomial is used to obtain inflection points corresponding to maxima or minima in the model. The model can be evaluated by looking at the goodness of fit between the model and experimental data. Finally, experiments using the optimum conditions predicted by the model are carried out to validate the model.  [84] where a screening process was not required since the number of factors affecting this enzyme is not large (four factors). The four factors (A, B, C, D), therefore, underwent optimisation by Central Composite Design (CCD) under Response Surface Methodology (RSM) which resulted in a yield increase in protein expression of 3.1-fold. Case study B describes the optimisation process for high yield production of recombinant human interferon-γ [85]. In this case, the number of factors involved is large (nine factors) and they were subjected to a screening process before optimisation. Four factors (X1, X2, X3, X7) out of nine were identified by Plackett-Burman Design (PBD) based screening to be the most influential and subsequently used for further optimisation. A Box-Benkhn Design (BBD) also under RSM was selected to optimize the screened factors and increased the production of human interferon-γ up to 5.1 fold. Further details of these two case studies can be found in the references provided and similar cases are found in Tables  4 and 7.

DoE; a Brief Overview
DoE is a statistical technique used to plan experiments and analyse data using a controlled set of tests designed to model and explore the relationship between factors and observed responses [14]. This technique allows the researcher to use the minimum number of experiments, in which the experimental parameters can be varied simultaneously, to make evidence based decisions [86]. It uses a mathematical model to analyse the process data, such as protein expression levels [87]. The model allows a researcher to understand the influence of the experimental parameters (inputs) on the Figure 1. A typical DoE workflow in protein production. Case study A illustrates the optimization of recombinant lipase KV1 expression in E. coli [84] where a screening process was not required since the number of factors affecting this enzyme is not large (four factors). The four factors (A, B, C, D), therefore, underwent optimisation by Central Composite Design (CCD) under Response Surface Methodology (RSM) which resulted in a yield increase in protein expression of 3.1-fold. Case study B describes the optimisation process for high yield production of recombinant human interferon-γ [85]. In this case, the number of factors involved is large (nine factors) and they were subjected to a screening process before optimisation. Four factors (X 1 , X 2 , X 3 , X 7 ) out of nine were identified by Plackett-Burman Design (PBD) based screening to be the most influential and subsequently used for further optimisation. A Box-Benkhn Design (BBD) also under RSM was selected to optimize the screened factors and increased the production of human interferon-γ up to 5.1 fold. Further details of these two case studies can be found in the references provided and similar cases are found in Tables 4 and 7.

DoE; a Brief Overview
DoE is a statistical technique used to plan experiments and analyse data using a controlled set of tests designed to model and explore the relationship between factors and observed responses [14]. This technique allows the researcher to use the minimum number of experiments, in which the experimental parameters can be varied simultaneously, to make evidence based decisions [86]. It uses a mathematical model to analyse the process data, such as protein expression levels [87]. The model allows a researcher to understand the influence of the experimental parameters (inputs) on the response (outputs) and to identify a process optimum [88]. Furthermore, DoE software uses three-dimensional surface and contour plots, to visualise and understand the relationship between factors and responses [55,89]. In recombinant protein production, a DoE approach can significantly improve the efficiency in screening for most influential experimental parameters (e.g., media composition, culture condition etc.) and determine optimal experimental conditions [90].
The mathematical models employed in DoE define the process under study [91]. Screening designs such as Plackett Burman Design are based on a first order model [92] as shown in Equation (1).
where Y is the response, β0 is the model intercept, βi is the linear coefficient and Xi is the level of the independent variables. A statistically significant level of 5% (p-value = 0.05) is commonly used to identify the most influential factors. The significance level (or p-value) of each variable is based on its effect on the response and is calculated using Student's T-test [85] in Equation (1).
where E(X i ) is the effect of variable X i and S.E., the associated standard error. Factors with p-value < 0.05 are statistically significant while factors with p-value > 0.05 are not statistically significant (see Table 5 for more details). Statistically significant factors are subjected to further optimisation by Response Surface Methodology. A second-order polynomial equation in which independent variables are coded using Equation (3) is used to input factors into the model (see Section 5.4).
where x i is a dimensionless value of an independent variable; Xi is real value of an independent variable; X cp is real value of an independent variable at the design centre point; and ∆Xi is step change in the real value of the variable i [93]. Replicates at the central point are required to check for the absence of bias between sets of experiments. The fit of the model is then evaluated through analysis of variance (ANOVA) which determines the significance of each term in the equation and estimates the goodness of fit in each case [94] (see Figure 5 and Table 9 for more details).

DoE Versus One-Factor-At-a-Time (OFAT)
DoE advances the traditional OFAT approach; OFAT fails to account for variables interacting with and influencing, each other and also requires significantly more experiments to converge on an optimum; all of which increases cost and time [95]. Figure 2 provides a brief comparative description between DoE and OFAT.
In recombinant protein expression, where various independent variables do not always act in isolation, it is likely that their interaction effects can significantly influence protein production [96]. Therefore, it is necessary to use a controlled set of tests that can examine the effects of many interacting factors to achieve optimal expression [97].

Figure 2. Comparison between Design of Experiments (DoE) and
One-Factor-at-A-Time (OFAT) by examining the effect of two parameters, P1 (Parameter 1) and P2 (Parameter 2). (a) OFAT is performed using more experiments than DoE (each black dot represents an experiment) and does not identify the true optimum (indicated as a red oval). However, with the DoE approach (b) fewer experiments are used and the likelihood of finding the optimum conditions (in red) for the process being studied is high. With DoE the combined or interaction effect of P1 and P2 on the response can be identified and measured. The ovals indicate production yields, blue indicates the lowest yields, whereas red indicates highest yields, where the optimum is found. The DoE approach also identifies a pathway to the optimum response (indicated by the arrow).

Defining a DoE Workflow to Optimise Recombinant Protein Production
Employing DoE to optimise the production of a recombinant protein can be divided into two main work packages, initial screening and subsequent optimisation. To evaluate all the factors that influence a production process, it is initially required to carry out a wide-ranging experimental screening. This first screening step will identify all factors that significantly influence recombinant protein production [98]. The second step in the workflow is to use a DoE optimisation design to achieve optimum production focusing only on the factors identified through the initial screening design. A variety of DoE software packages such as MINITAB (Minitab Ltd., State College, PA, USA), JMP (SAS Institute, Cary, NC, USA) and Design Experts (Science Plus Group, Groningen, the Netherlands) are commercially available and provide a variety of factorial designs depending upon the objective of the experiment. Regardless of the statistical package used, the main steps of a typical DoE workflow include planning the test, screening and optimisation (detailed schematically in Figure  3).

Figure 2. Comparison between Design of Experiments (DoE) and
One-Factor-at-A-Time (OFAT) by examining the effect of two parameters, P1 (Parameter 1) and P2 (Parameter 2). (a) OFAT is performed using more experiments than DoE (each black dot represents an experiment) and does not identify the true optimum (indicated as a red oval). However, with the DoE approach (b) fewer experiments are used and the likelihood of finding the optimum conditions (in red) for the process being studied is high. With DoE the combined or interaction effect of P1 and P2 on the response can be identified and measured. The ovals indicate production yields, blue indicates the lowest yields, whereas red indicates highest yields, where the optimum is found. The DoE approach also identifies a pathway to the optimum response (indicated by the arrow).

Defining a DoE Workflow to Optimise Recombinant Protein Production
Employing DoE to optimise the production of a recombinant protein can be divided into two main work packages, initial screening and subsequent optimisation. To evaluate all the factors that influence a production process, it is initially required to carry out a wide-ranging experimental screening. This first screening step will identify all factors that significantly influence recombinant protein production [98]. The second step in the workflow is to use a DoE optimisation design to achieve optimum production focusing only on the factors identified through the initial screening design. A variety of DoE software packages such as MINITAB (Minitab Ltd., State College, PA, USA), JMP (SAS Institute, Cary, NC, USA) and Design Experts (Science Plus Group, Groningen, the Netherlands) are commercially available and provide a variety of factorial designs depending upon the objective of the experiment. Regardless of the statistical package used, the main steps of a typical DoE workflow include planning the test, screening and optimisation (detailed schematically in Figure 3).  5) The response data are analysed and visualised using plots for ease of data interpretation. At this stage, a reduced number of factors (i.e., the most influential) are retained for the subsequent optimisation phase. (6) Further optimisation can be carried out (via an optimisation DoE design).

Planning the Test; Selection of Factors and Associated Levels Influencing Recombinant Protein Production
The DoE workflow in protein production, like in any other DoE process optimisation, starts with the planning the test [99]. This involves defining the objective of the study, identifying factors involved and associated levels (i.e., high, central and low). Thus, preliminary experiments are recommended when knowledge of effects of factors on the experiment is not sufficient to set levels. The factors are input parameters that can be modified in the experiment and are referred to as the controllable factors. The levels of factors are fixed based on their working limits [82]. The most popular experimental designs are two level designs although more levels can be used depending upon the type of design and objective of the study. Table 2 depicts a two level experimental design. (3) The experimental screening design is selected based on the objectives of the study and the number of factors involved. (4) A mathematical model is built with certain conditions to meet the desired objectives (e.g., measurement of all the desired responses, process stability and accurate approximation by polynomial models). (5) The response data are analysed and visualised using plots for ease of data interpretation. At this stage, a reduced number of factors (i.e., the most influential) are retained for the subsequent optimisation phase. (6) Further optimisation can be carried out (via an optimisation DoE design).

Planning the Test; Selection of Factors and Associated Levels Influencing Recombinant Protein Production
The DoE workflow in protein production, like in any other DoE process optimisation, starts with the planning the test [99]. This involves defining the objective of the study, identifying factors involved and associated levels (i.e., high, central and low). Thus, preliminary experiments are recommended when knowledge of effects of factors on the experiment is not sufficient to set levels. The factors are input parameters that can be modified in the experiment and are referred to as the controllable factors. The levels of factors are fixed based on their working limits [82]. The most popular experimental designs are two level designs although more levels can be used depending upon the type of design and objective of the study. Table 2 depicts a two level experimental design. Table 2. An example of a two level experimental design having nine factors that are known to influence recombinant protein expression. In this case the nine factors relate to two experimental components; media composition and induction conditions. When planning the screening phase the selected factors (yeast extract, tryptone, glycerol, NaCl, Inoculum size, IPTG concentration, induction temperature, incubation time and pH, labelled X 1 to X 9 respectively) and associated levels (high, defined as +1 and low defined as −1 are selected to cover the intended experimental space (i.e., to cover the productive range). The levels are defined as the range between the known working limits.

Screening Designs to Identify Factors that Significantly Affect Recombinant Protein Expression
Screening designs are used to devise a matrix using factors and levels as formulated in the planning stage. [105]. By employing the statistical tools embedded in the DoE software, screening designs establish the relationships between variables and responses. The interaction effects between variables on a given response are also investigated [106]. In protein biotechnology, screening designs are mainly utilised to identify media composition and culture condition factors that significantly influence protein production [107]. Various researchers have explored the effects of both media components [94,[107][108][109][110] and culture conditions [111,112] on protein expression. There are many different types of screening designs and their choice depends upon the nature of experiment and the objective of the study. The classical screening designs include Full Factorial Designs, Fractional Factorial Designs and Plackett-Burman Designs. Current DoE software, such as JMP from the SAS Institute, provides additional screening designs such as Definitive Screening Designs and Custom Designs. The most common screening designs are compared in Table 3.

Full Factorial Design
When little is known about the effects of the factors on a response, a full factorial design is recommended. This design includes all combinations of all factor levels and provides a predictive model that includes the main effects and all possible interactions [113]. This design consists of two, or more, levels with experimental runs that encompass all possible combinations of these levels, across all factors. In a full factorial design where k represents number of factors; 2 k represents the number of experiments required to carry out a two level design with k factors. Similar to other screening designs, Full Factorial Design can include centre points, randomisation and blocking variables to improve the efficiency of the design [14]. This approach was significant in screening for the most influential factors affecting recombinant protein production for a variety of proteins [114,115] (see Table 4). Table 3. A comparison of DoE screening designs commonly used in optimizing recombinant protein production. The table lists the types of screening designs; the effect explained by the model along with number of factors and associated number of runs (a rune refers to an experiment). It should be noted that extra runs (such as those related to central points) can be added when required. Custom design is more flexible and allows the designer to select the number of experimental runs.

Fractional Factorial Design (FFD)
FFD is a recommended screening design when a large number of factors are involved. This design consists of reducing the initially large number of potential factors to a subset of the most effective ones and is represented using the following notation: where 2 represents number of levels, k the number of factors, p the extra columns required and R the resolution of the method. The method resolution describes the degree to which the estimated main effects are aligned with the estimated interactions associated with levels [22,116,117].

Plackett-Burman Designs (PBD)
PBD design is often used as an alternative to fractional and full factorial designs because of its potential to reduce the gaps found in fractional designs and to strengthen the estimation of the main effects, which may have been disregarded when full factorial designs are used [118][119][120][121][122].

Definitive Screening Design (DSD) and Custom Design (CD)
DSD and CD are a class of screening designs that have potential applications in recombinant protein expression for assessing the impact of a large number of factors on a given response. DSD has recently been reported to be particularly advantageous as it allows the estimation of the main effects of certain components alone but also the interactions between components as well as the factors with non-linear effects such as quadratic effects (an interaction term where a factor interacts with itself); all executed with the minimum number of experimental runs [123]. CD enables tailoring a design, whilst simultaneously minimising resource usage: it is highly flexible and more cost-effective than other screening designs. It allows for the best use of the experimental budget and tackles a wide range of challenges with the capability to model effects including centre points and replicates. However, in most cases this design allows for the estimation of main effects only. Table 4 summarises the most common screening designs, along with their roles in identifying most influential independent factors, in recombinant protein production.  Table 5. Identification of the statistically significant factors during a screening process using a Fractional Factorial Design. The table depicts the effect, positive or negative and p-value for seven factors examined (labelled X 1 to X 7 respectively). The effect of each factor, positive (+) or negative (−) is identified during the analysis stage using the statistical formula imbedded in DoE software used (JMP in this example). Interaction effects are also identified (e.g., X 5 *X 1 and X 3 *X 7 ; where * indicates an interaction). The p-value of each factor is also shown, at the significance level of 0.05. In this example, the highlighted factors, (X 3 , X 6 , X 1 ), were identified as the most influential based on their high effects (−1.11273, 0.2252, 0.17492) and p-values < 0.05 (0.001, 0.0143, 0.0296). Thus, only factors X 3 , X 6 and X 1 are statistically significant at the level of 0.05, with X 3 having a negative effect while X 6 and X 1 have positive effects. Other factors, X 2 , X 4 , X 5 , X 7 and interactions X 5 *X 1 , X 3 *X 7 are not statistically significant.  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on finding the variable levels that result in an optimal yield [132,133]. Figure 4, describes the benefit of  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on finding the variable levels that result in an optimal yield [132,133]. Figure 4, describes the benefit of  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on finding the variable levels that result in an optimal yield [132,133]. Figure 4, describes the benefit of  Table 5. Identification of the statistically significant factors during a screening process using a Fractional Factorial Design. The table depicts the effect, positive or negative and p-value for seven factors examined (labelled X1 to X7 respectively). The effect of each factor, positive (+) or negative (−) is identified during the analysis stage using the statistical formula imbedded in DoE software used (JMP in this example). Interaction effects are also identified (e.g., X5*X1 and X3*X7; where * indicates an interaction). The p-value of each factor is also shown, at the significance level of 0.05. In this example, the highlighted factors, (X3, X6, X1), were identified as the most influential based on their high effects (− The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on  Table 5. Identification of the statistically significant factors during a screening process using a Fractional Factorial Design. The table depicts the effect, positive or negative and p-value for seven factors examined (labelled X1 to X7 respectively). The effect of each factor, positive (+) or negative (−) is identified during the analysis stage using the statistical formula imbedded in DoE software used (JMP in this example). Interaction effects are also identified (e.g., X5*X1 and X3*X7; where * indicates an interaction). The p-value of each factor is also shown, at the significance level of 0.05. In this example, the highlighted factors, (X3, X6, X1), were identified as the most influential based on their high effects (− The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on  Table 5. Identification of the statistically significant factors during a screening process using a Fractional Factorial Design. The table depicts the effect, positive or negative and p-value for seven factors examined (labelled X1 to X7 respectively). The effect of each factor, positive (+) or negative (−) is identified during the analysis stage using the statistical formula imbedded in DoE software used (JMP in this example). Interaction effects are also identified (e.g., X5*X1 and X3*X7; where * indicates an interaction). The p-value of each factor is also shown, at the significance level of 0.05. In this example, the highlighted factors, (X3, X6, X1), were identified as the most influential based on their high effects (− The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation  The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation The rationale of screening designs lies in identifying the variables that are statistically significant in influencing protein production among a large number of potentially important variables [128,129]. Table 5 illustrates how screening analysis identifies statistically significant factors based on their effect and probability values.
The screening process identifies most influential factors on the process under investigation (i.e., X 1 and X 6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on finding the variable levels that result in an optimal yield [132,133]. Figure 4, describes the benefit of carrying out an optimisation process after a screening process has identified a small number of key variables. The screening process identifies most influential factors on the process under investigation (i.e., X1 and X6 in the example shown in Table 5) and thus paves the way for effective optimisation by reducing the number of factors to be optimised in the third work package of the DoE workflow [130].

Optimisation Designs to Maximise Recombinant Protein Production in Prokaryotic Systems
As a collection of statistical design and numerical optimisation techniques [131], optimisation uses the reduced number of variables identified in the previous screening process and focuses on finding the variable levels that result in an optimal yield [132,133]. Figure 4, describes the benefit of carrying out an optimisation process after a screening process has identified a small number of key variables.  Response Surface Methodology (RSM) is the most popular optimisation method [134]. It consists of mathematical and statistical techniques used to build empirical models capable of exploring the process space and studying the relationship between the response and process variables to find the optimal response [99,133,135]. In general, for a given number of factors, RSM requires more runs than screening designs, thus, the number of factors to consider should initially be reduced through an appropriate screening process. Central composite designs (CCD) and Box-Behnken designs (BBD) are the two of the major Response Surface Designs commonly used in recombinant protein optimization [136].

Central Composite Design (CCD)
CCDs are favoured in process optimisation due to determine the coefficients of a second-degree polynomial which fit a full quadratic during response surface analysis [127]. CCD has been widely used in optimising protein production process specifically addressing the aim of increasing productivity and solubility [137]. There are different types of central composite designs such as uniform precision, orthogonal/block and so forth. However, a common standard characteristic includes the number of runs per design [138], which depends on the number factors (see Table 6). Central composite uniform precision designs are used to provide protection against bias in the regression coefficients while central composite orthogonal designs can be used to avoid correlations between coefficients of variables [139]. Table 6. Common CCD components and the possible total number of runs. Factorial, axial and central points are the main components of a typical CCD and the total number of runs is dictated by the number of factors being tested. As the number of factors increases, the number of component points increase and so the total number of runs. In some cases, CCDs do not contain axial points, especially when the variance of model prediction is not suspected [140]. CCD has been extensively used to optimise the production of recombinant proteins (see Table 7).

Box Behnken Design (BBD)
BBDs are also a class of response surface designs; however, they differ from CCD in their design structure. For example, a CCD with 4 factors requires 31 runs (experiments), whereas a BBD only has 27 runs for the same number of factors. For 5 factors, CCD has 52 runs while BBD has 46 runs. Reduced runs can result in significant time and cost savings in an optimisation process.
In optimisation experiments BBD is widely used as a good design to fit the quadratic model with fewer experiments [141]. Several studies show that BBDs have contributed to production increases for recombinant proteins (see Table 7).

Summary and Choice of Optimisation Methods
Both CCD and BBD optimisation methods are widely used, the choice depends on the number of factors and objectives of the study (see Figure 1). The standard characteristic is that all response surface designs feature a second-order polynomial model to describe the process where interaction terms introduce curvature into the response function and a first-order equation is inadequate to fit the model [159,160]. CCD is the most preferred RSM [16,161] due to the fact that this design contains full factorial or fractional factorial modes, with the potential to add central points to evaluate the experimental error and axial points to check the variance of the model [14,140]. The number of runs (N) in CCD is calculated using Equation (4).
where k is the number of factors and Cp the number of centre points [162]. Table 8 is an example of a two level CCD with two centre point replicates along with responses such as actual, predicted and residues (see Table 8). Table 8. Central Composite Design of four independent factors (labelled X 1 , X 2 , X 3 , X 4 respectively) studied at two levels (+1 and −1) including two central point replicates (0 and 0). The table also shows different types of common responses found in optimisation process; (1) Actual data refers to experimental results; (2) predicted data are generated by software based on the design and actual results. The residuals are the difference between actual and predicted data.

Coded Values Responses
Runs X 1 X 2 X 3 X 4 Actual Predicted Residuals Experimental response Predicted response data Residual data Responses (e.g., actual, predicted and residues) data are utilised during the optimisation analysis to evaluate the validity of the model and determine the optimum.

Analysis and Interpretation of Optimisation Data
Regardless of the DoE design employed, the goal is to provide a methodology for conducting controlled experiments with the aim of identifying the vital process inputs and investigating interactions between them [163]. At a screening level, after the experimental data are entered, the DoE software generates a variety of graphs that are used to interpret the results obtained. These may be scatter plots, histograms, bar charts and Pareto charts that allow the researcher to identify the distribution of the data and statistical significance of the variables tested [85]. Different screening analysis methods have been used in the field of protein production [77,92,112,164]. Figure 5 illustrates a typical DoE data analysis and interpretation route from data visualisation, through experiment validation to conclusion.

Analysis and Interpretation of Optimisation Data
Regardless of the DoE design employed, the goal is to provide a methodology for conducting controlled experiments with the aim of identifying the vital process inputs and investigating interactions between them [163]. At a screening level, after the experimental data are entered, the DoE software generates a variety of graphs that are used to interpret the results obtained. These may be scatter plots, histograms, bar charts and Pareto charts that allow the researcher to identify the distribution of the data and statistical significance of the variables tested [85]. Different screening analysis methods have been used in the field of protein production [77,92,112,164]. Figure 5 illustrates a typical DoE data analysis and interpretation route from data visualisation, through experiment validation to conclusion. Figure 5. A typical DoE analysis route from initial Experiments to validation and conclusions. The rationale for data analysis is to evaluate the effects of variables on response. Graphical Representation shows how the data are distributed. The Statistical Analysis and Probability stage identifies variables that are statistically significant. This will identify variables that are important to bring forward to the subsequent optimisation step based on their statistical significance. The Visualization and Interpretation stage will focus on representational analysis that identifies optimal levels.

Evaluation of Experimental Design and Predictive Model Validation
For RSM analysis, the goals are to (i) develop a predictive model that describes how the process inputs influence the process output and (ii) determine the optimal settings of the inputs [165,166]. Following the completion of the optimisation experiments, the results are used to fit a second-order polynomial equation (Equation (5)) [85].
where Yi is the predicted response, β0, βi, βii and βij are regression coefficients for the intercept, firstorder model coefficients, quadratic coefficient and linear model coefficient for the interaction respectively [167,168]. The fit of the model is then evaluated through analysis of variance (ANOVA, Table 9) which compares the variation due to the change in the combination of variable levels with the variation due to the random errors [14,169]. The rationale for data analysis is to evaluate the effects of variables on response. Graphical Representation shows how the data are distributed. The Statistical Analysis and Probability stage identifies variables that are statistically significant. This will identify variables that are important to bring forward to the subsequent optimisation step based on their statistical significance. The Visualization and Interpretation stage will focus on representational analysis that identifies optimal levels.

Evaluation of Experimental Design and Predictive Model Validation
For RSM analysis, the goals are to (i) develop a predictive model that describes how the process inputs influence the process output and (ii) determine the optimal settings of the inputs [165,166]. Following the completion of the optimisation experiments, the results are used to fit a second-order polynomial equation (Equation (5)) [85].
where Yi is the predicted response, β 0 , β i , β ii and β ij are regression coefficients for the intercept, first-order model coefficients, quadratic coefficient and linear model coefficient for the interaction respectively [167,168]. The fit of the model is then evaluated through analysis of variance (ANOVA, Table 9) which compares the variation due to the change in the combination of variable levels with the variation due to the random errors [14,169]. The coefficient value of R 2 defines how well the model fits the data. The closer the R 2 is to 1, the better it describes the experimental data [21]. The Adjusted R 2 is used to check the adequacy of the model by measuring the amount of variation about the mean derived from the model; the closer the value is to 1, the better it describes the model [130]. For example, in Table 9, the R 2 = 0.9971 indicates the significance of regression of the fitting equation and therefore, adequacy of discrimination, indicating that only 0.29% of the total variation could not be explained by the fitting equation [142]. When R 2 = 99.71%, Adj-R 2 = 99.63%, Pred-R 2 = 99.48% are in good agreement with each other (as in Table 9), this provides confidence in the accuracy of the model [156].
Additionally, the p-value and signal-to-noise ratio are used to estimate the quality of the model. For a significant model, a p-value < 0.05 is desirable [170]. Appropriate precision measures the signal-to-noise ratio; where a ratio greater than 4 indicates an adequate model [171] and is commonly used in protein production optimisation [172,173]. Furthermore, the p-value lack of fit and the plot of observed values versus predicted values are used to estimate the quality of the model. With a good model, the p-value lack of fit should be >0.05 [168] as shown in Table 9. Finally, all data should fall on the straight line on the observed versus predicted plots [145] as shown in Figure 6. The coefficient value of R 2 defines how well the model fits the data. The closer the R 2 is to 1, the better it describes the experimental data [21]. The Adjusted R 2 is used to check the adequacy of the model by measuring the amount of variation about the mean derived from the model; the closer the value is to 1, the better it describes the model [130]. For example, in Table 9, the R 2 = 0.9971 indicates the significance of regression of the fitting equation and therefore, adequacy of discrimination, indicating that only 0.29% of the total variation could not be explained by the fitting equation [142]. When R 2 = 99.71%, Adj-R 2 = 99.63%, Pred-R 2 = 99.48% are in good agreement with each other (as in Table 9), this provides confidence in the accuracy of the model [156].
Additionally, the p-value and signal-to-noise ratio are used to estimate the quality of the model. For a significant model, a p-value < 0.05 is desirable [170]. Appropriate precision measures the signalto-noise ratio; where a ratio greater than 4 indicates an adequate model [171] and is commonly used in protein production optimisation [172,173]. Furthermore, the p-value lack of fit and the plot of observed values versus predicted values are used to estimate the quality of the model. With a good model, the p-value lack of fit should be >0.05 [168] as shown in Table 9. Finally, all data should fall on the straight line on the observed versus predicted plots [145] as shown in Figure 6.

Optimum Determination
Once the predictive model has been validated, it can be used to determine the optimised parameters. The statistical tools embedded in DoE software are used to generate 3D-graphs, called surface contour plots that visually describe the relationship between variables and response [174,175]. The 3-D surface and contour graphs are generated as a combination of two test variables with the others maintained at their respective zero levels [176] see Figure 7. Surface, contour and residual plots, along with ANOVA, are the main optimisation analysis tools commonly used to determine optimum levels for high yields of recombinant protein [20,[177][178][179].

Optimum Determination
Once the predictive model has been validated, it can be used to determine the optimised parameters. The statistical tools embedded in DoE software are used to generate 3D-graphs, called surface contour plots that visually describe the relationship between variables and response [174,175]. The 3-D surface and contour graphs are generated as a combination of two test variables with the others maintained at their respective zero levels [176] see Figure 7. Surface, contour and residual plots, along with ANOVA, are the main optimisation analysis tools commonly used to determine optimum levels for high yields of recombinant protein [20,[177][178][179].  [163]. The figure depicts the two-factor interaction (in this case the two factors explored are glucose and culturing temperature) where one factor influences the response of another factor. It also shows the visualisation of optimum levels. The colour scale indicates the level of lipase activity (IU/mL) where red indicates the region of optimal yield, yellow indicates medium yield, and green indicates low yield. In this case, the optimal enzyme activity (33 IU/mL) was achieved at a culture temperature between 30 °C and 34 °C; and a glucose concentration between 40 g/mL-50 g/mL. Image used with permission.

Conclusions; Getting It 'Just Right'
DoE offers many choices for screening and optimisation designs which advance traditional optimisation methodologies, such as one-factor-at-a-time. The statistical approach offered by DoE has proven to be applicable in protein biotechnology effectively investigating media composition and culture condition factors in recombinant protein production. DoE's ability to identify the most influential factors in recombinant protein expression through screening designs and identify the factor/levels that give the maximum yield has considerably enhanced the production of soluble, active recombinant protein. With the recent development of more flexible screening and optimisation designs and enhancements in computational processing DoE will continue to find applications in biotechnology; in recombinant protein production and beyond.   [163]. The figure depicts the two-factor interaction (in this case the two factors explored are glucose and culturing temperature) where one factor influences the response of another factor. It also shows the visualisation of optimum levels. The colour scale indicates the level of lipase activity (IU/mL) where red indicates the region of optimal yield, yellow indicates medium yield, and green indicates low yield. In this case, the optimal enzyme activity (33 IU/mL) was achieved at a culture temperature between 30 • C and 34 • C; and a glucose concentration between 40 g/mL-50 g/mL. Image used with permission.

Conclusions; Getting It 'Just Right'
DoE offers many choices for screening and optimisation designs which advance traditional optimisation methodologies, such as one-factor-at-a-time. The statistical approach offered by DoE has proven to be applicable in protein biotechnology effectively investigating media composition and culture condition factors in recombinant protein production. DoE's ability to identify the most influential factors in recombinant protein expression through screening designs and identify the factor/levels that give the maximum yield has considerably enhanced the production of soluble, active recombinant protein. With the recent development of more flexible screening and optimisation designs and enhancements in computational processing DoE will continue to find applications in biotechnology; in recombinant protein production and beyond.

Conflicts of Interest:
The authors declare no conflict of interest.