Next Article in Journal
Effects of Pore Water Content on Stress Sensitivity of Tight Sandstone Oil Reservoirs: A Study of the Mahu Block (Xinjiang Province, China)
Previous Article in Journal
Evaluating Industry 4.0 Manufacturing Configurations: An Entropy-Based Grey Relational Analysis Approach
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Generation of Dissimilar Alternative Product Formulations Using Graphs

Fernando P. Bernardo
GEPSI-PSE Group, Chemical Process Engineering and Forest Products Research Centre (CIEPQPF), Department of Chemical Engineering, University of Coimbra, Rua Sílvio Lima, Pólo II, Pinhal de Marrocos, 3030-790 Coimbra, Portugal
Processes 2023, 11(11), 3152;
Submission received: 5 October 2023 / Revised: 24 October 2023 / Accepted: 1 November 2023 / Published: 4 November 2023
(This article belongs to the Section Chemical Processes and Systems)


In this work, alternative product formulations are represented as a bipartite graph, and mixed-integer programming is used to find a graph partitioning that identifies the most dissimilar formulations within a potentially large design space. This involves fully integrating available models for product properties and known heuristic rules for product formulation. The set of dissimilar alternative formulations thus found constitutes the best exploratory plan of experiments in the face of the available knowledge. Also in this work, a cosmetic emulsion example is explored, where some ingredients (in a non-specified number) have to be chosen from a pool of 32 possible. The number of possible combinations of ingredients is around half a million. With the new tools proposed herein, one is able to identify small sets of alternative formulations that adhere to available models and known heuristics rules, have maximum dissimilarity, and are optimal regarding a specific product design objective (cost or other), in a few minutes of computational time.

1. Introduction

Formulated products are important across several economic sectors (e.g., the food, cosmetics and personal care, pharmaceutical, and agrochemical sectors), and their design is well recognized as one of the important topics within the broader field of chemical product design [1,2,3] that still poses several open problems and needs more systematic methodologies. One of the key hurdles in this regard is the lack of reliable quantitative property models relating product composition and microstructure to physico-chemical properties valued by customers [1,4,5,6,7]. This hinders a systematic search across the design space (ingredients, their amounts, and manufacturing process), such as the one that is possible across a molecular domain defined by fundamental units (often groups of atoms), known as computer-aided molecular design (CAMD) [8,9,10]. It is therefore important to equate how other less structured knowledge, in the form of heuristic rules (e.g., typical combinations of ingredients and their amounts) and experimental databases (regarding pure ingredients and previous successful formulations), can be integrated with available property models.
Zhang et al. [11] proposed such a framework but without systematically integrating property models and heuristic-based procedures, which may lead to suboptimal solutions (for instance, designing a solvent mixture (small molecules) using CAMD tools and then, on top of this, selecting an adequate surfactant (larger molecules) using heuristic rules). Later, we proposed a fully integrated approach [12] that uses propositional logic to convert heuristic rules into algebraic constraints, which are then incorporated side by side with quantitative property models in a single mixed-integer optimization formulation. The method is valid for formulations with any number n of ingredients to be chosen from a list of m possible ingredients ( m > n ). Yet, the overall product model (property models + heuristics) is often uncertain and may be even incomplete, only describing a subset of the important performance–composition relationships. Hundreds or thousands of alternative product formulations may therefore comply with such incomplete model. In other words, the feasible design space, although reduced by the available property models and heuristic rules, is still very large, making it difficult to identify what smaller set of formulations should proceed for testing and refinement.
In this study, we extend the method to incorporate the selection of a relatively small set of the most dissimilar formulations taken from a potentially large feasible design space. Dissimilarity is used as a selection criterion since one wants to determine a first exploratory plan of experiments (herein, “dissimilarity” corresponds to “space filling” in classical Design of Experiments (DoE) methods). For now, we adopt a simple measure of dissimilarity solely based on the number of ingredients shared by a set of alternative formulations, independently of ingredient concentration. To calculate such a measure, we represent a set of alternatives as a bipartite graph and use graph partitioning tools to evaluate the number of external edges. These are edges whose removal transforms the original graph into a set of disconnected subgraphs, with each one representing an alternative product formulation. The number of external edges is then the adopted measure of similarity (the higher the number of external edges, the more similar is the set of formulations). Then, the overall problem is to find the graph with the fewest external edges within the feasible design space. This will be formulated as a single mixed-integer optimization problem.
In a complete DoE programme, results from a first exploratory plan should be fed back to the front of the process and used to improve the product model. A new set of experiments may then be generated, and the process is repeated until a certain level of knowledge/optimality is attained (see, e.g., [13] for a review on adaptive DoE using Bayesian optimization). In this work, we only focus on generating the first set of experimental points, given the initial available product model. Herein, the overall process is not discussed, nor is the way the additional experimental information is fed back to the product model.
The design of space filling sets for product formulation (with n ingredients chosen from a pool of m available ingredients) is still considered a challenging problem since the set is generated explicitly in the space { 0,1 } m ; thus, set size increases exponentially with m . The problem is even more complicated if one wants to investigate different concentrations of ingredients [14]. For instance, with m = 30 and n = 7 , which is a common problem size in product formulation, a full factorial design has C 1 30 + C 2 30 + + C 7 30 3 million possible formulations. In our method, this dimensionality problem is partially avoided since implicit enumeration is used when solving the mixed-integer optimization problem and the search space reduced by available product models. A thorough comparison between efficient space filling methods (e.g., sampling based on low-discrepancy sequences) and our methodology (which may be seen as an implicit space filling design that is restricted by available product models) is still yet to be performed.
Graphs have long been used in chemistry and chemical engineering and for different purposes (molecular modelling, chemical reaction networks, heat exchanger networks, process synthesis) [15,16]. As far as we know, this work is the first of its kind to use graphs to represent a set of alternative product formulations, or, more simply, to represent different combinations of n ingredients taken from a pool of m available ingredients.
This paper is an extended version of a shorter paper presented at ESCAPE 29 [17], including new aspects such as the simultaneous handling of dissimilarity of alternatives and an overall product design objective. The remainder of this paper is organized as follows. First, the problem of generating alternative formulations within a feasible design space is formulated (Section 2). Then, the graph (and matrix) representation of those alternatives is presented (Section 3). Next, the chore problem of generating dissimilar formulations is formulated (Section 4). Finally, a cosmetic emulsion example is provided (Section 5).

2. Generation of Alternative Product Formulations

Let C = { c 1 , , c N C } be an ordered set containing all available ingredients from which a subset is to be selected to make part of the product formulation. Then, any product formulation may be described by a vector y of binary variables (codifying the presence or absence of an ingredient) and a corresponding vector x of mass fractions. Both x and y are ordered vectors with N C elements in the same order of that in set C .
Let p be the vector of product performance metrics with target values p * , meaning that product quality is evaluated in terms of the deviation of p from the target p * . Property models are any relationship between metrics p and product composition (herein represented by the following set of equations: h x , y , p = 0 ). Less structured knowledge is often available in the form of heuristic rules for product formulation, which may be simple rules regarding what components should be selected and in what amount for a given desired effect or eventually more complex rules involving logical conditions. In any case, these heuristic rules can be modelled using propositional logic and additional binary variables z [12], resulting in a set of (often linear) algebraic restrictions. Let g x , y , p , z 0 represent these heurist-related restrictions and other problem-specific restrictions. Finally, let O be an objective function to be minimized, accounting for both product quality and cost.
The problem of optimal product formulation may then be stated as the following optimization problem, labelled as Problem (P1):
min x , y O x , y , p , p *
s . t .     h x , y , p = 0
g x , y , p , z 0
Due to model incompleteness and uncertainties, one wants to find not a single “optimal” solution but instead an ordered set F of solutions with increasing value of O . This may be obtained by successively solving (P1) to global optimality and imposing cuts in vector y to prohibit previous solutions [18]. However, the set F thus generated may include very similar solutions, differing for instance in the existence of only one component. Moreover, it is not certain what should be the size of this set so that significantly different alternatives are captured. In the next section, a graph (and matrix) representation of any set of alternative formulations is proposed, and then, in Section 4, a graph partitioning technique is used to expand formulation (P1) to generate a set F of formulations with maximum (or close to maximum) dissimilarity.

3. Matrix and Graphical Representations of a Set of Product Formulations

Let F = { f 1 , , f N F } be a set of N F alternative product formulations, differing in the components they contain and not considering differences in components mass fractions. Set F may then be represented by a list of vectors y organized in a matrix Y = ( y f c ) N F × N C , with y f c = 1 if component c is present in formulation f , and y f c = 0 if that component is absent, and this applies for all c C and for all f F . This matrix representation is illustrated on the left side of Figure 1 for a set of 7 formulations using a total of 18 components.
Matrix Y has an equivalent representation as a bipartite graph G = ( C , F , E ) , with vertex sets C and F , and edge set E . Edge f c exists if and only if the corresponding matrix entry y f c = 1 and thus the number of edges of graph G is the number of nonzeros in Y [19]. In graph theory, matrix Y is designated as the biadjacency matrix of the bipartite graph G . The right side of Figure 1 shows the bipartite graph equivalent to the matrix Y on the left.

4. Generation of Sets of Dissimilar Product Formulations

Let F be a set of product formulations represented by a 0–1 matrix Y , as defined above. The problem of partitioning F in N P clusters may be stated as follows: “find row and column permutations of Y in order to obtain N P diagonal blocks and minimizing the number of external elements “1” outside the diagonal blocks”.
The matrix on the left side of Figure 1 is an example of the result of such matrix rearrangement, with three diagonal blocks corresponding to three identified clusters, as well as five external elements outside the diagonal blocks. The second cluster, for instance, contains formulations f 4 and f 5 and components c 9 to c 13 . Component c 3 does belong to formulation f 5 but is an external element since it does not belong to the second cluster. In general, each external element corresponds to a component that is shared or, more precisely, to a component from one cluster that is used in a formulation belonging to a second cluster. The number of external elements is thus a measure of similarity between clusters.
The right side of Figure 1 shows the same clustering but in graph format. The external elements of matrix Y here correspond to external edges, which are those connecting vertices located in different clusters. External edges are also designated by cut edges, since their removal decomposes the graph into N P disconnected subgraphs. For this reason, the set of external edges is designated by an edge separator.
Using graph language, the above-stated problem of partitioning F is as follows: “find the minimum number of edges whose removal decomposes the bipartite graph G = ( C , F , E ) into N P disconnected subgraphs (also designated as partitions or clusters)”. In the graph literature, this is known as the problem of graph partitioning by edge separator (GPES).
Sparse matrix rearrangement is a less studied problem than graph partitioning. In addition, the former is often solved by first translating the matrix into an equivalent graph and then applying graph partitioning methods [19]. Therefore, graph partitioning is the technique used in this work to find dissimilar formulations, although the matrix representation is perhaps clearer and thus will be used to present results.
The problem of graph partitioning is an NP-complete combinatorial optimization problem that has been well studied in multiple contexts, including parallel computing, sparse matrix computations, integrated circuit design, biological and social networks, and data mining.
For small graphs, exact solutions of the graph partitioning problem can be obtained using integer programming. Above some hundreds of vertices, computational time becomes prohibitive; thus, it is preferable to use heuristic algorithms. There are several of these available, each with different performances in terms of computational time versus quality of solutions (multilevel algorithms are perhaps the most well known [20,21,22]; Markov cluster algorithms, based on random walks in graphs, are also an interesting alternative [23,24]). In the case studied here, the number of vertices is at most N C + N F , which is a modest graph size that can, in principle, be easily handled by integer programming. This is therefore the technique we will use here, specifically the standard 0–1 formulation of Boulle [25], transposed to the particular (and simpler) case of a bipartite graph.
It is now time to pose our central problem: “given a set C of available components, find N F dissimilar formulations obeying to the product design restrictions of the above-listed Problem (P1) ( h = 0 and g 0 )”. In graph language, this is the problem of: (i) constructing the bipartite graph G = ( C , F , E ) , with edges  f c still to be decided, and (ii) then partitioning the graph via edge separator into N F clusters, with each cluster having one and only one formulation ( N P = N F ). Two sets of binary variables are thus needed, y f c and v f c , for all c C and all f F . In the first case, y f c = 1 if component c is chosen to be in formulation f , while in the second case, v f c = 1 if component c belongs to cluster f (but not necessarily to formulation f ).
The total number of edges is thus N E = ( f , c ) y f c , and the number of internal edges is N I E = ( f , c ) e f c , with e f c = y f c × v f c (in Figure 1, f 2 c 3 is an example of an internal edge). The number of external edges, which is the adopted measure of similarity, is N E E = N E N I E . Then, the graph construction and partition problem to find the N F most dissimilar formulations is as follows:
min y fc , v fc N E E = f , c y f c f , c e f c
s . t .     e f c y f c e f c v f c e f c y f c + v f c 1 ,   ( f , c )
f v f c = 1 ,   c  
v f c = 0 ,   c ,   o r d f o r d c + 1
y f c , v f c 0,1 , 0 e f c 1 ,   ( f , c )
The components of Restriction (5) are a linear formulation of e f c = y f c × v f c = 1 ; Equation (6) imposes that each component belongs to only one cluster; and the components of Equation (7) are anti-degeneracy constraints ( c 1 must belong to cluster f 1 , c 2 must belong either to f 1 or to f 2 , and, in general, c k must belong to one of the first k th clusters). Although e f c are binary variables, they may be treated as continuous, given Restriction (5). The total number of binary variables ( y f c and v f c ) is therefore 2 N F × N C .
For a graph with N E edges, N E E is an absolute measure of the partitioning quality and equivalently of the similarity between formulations. When comparing solutions for graphs of different dimensions, the fraction of external edges should be used instead: F E E = N E E / N E .
In addition to the dissimilarity criterion (Equations (4)–(8) above), all generated product formulations must obey to the design restrictions of Problem (P1), which are now written as follows:
h x f c , y f c , p f = 0 ,   f
g x f c , y f c , p f , z f 0 ,   f
Here, mass fractions x f c (of each component c in formulation f ) are continuous variables, and z f are additional binary variables used to describe heuristic rules. Therefore, the N F most dissimilar formulations adhering to design restrictions h = 0 and g 0 and without considering the design objective O are the solutions of Equations (4)–(10) (from now on referred to as Problem (P2)). If h and g are linear functions, Problem (P2) is a MILP problem.
In order to include objective O , a multiobjective approach is needed—for instance, defining a global objective as a weighted sum of N E E and O (or more precisely, a measure of O evaluated over F , such as the mean value of O ). Due to the combinatorial nature of Problem (P2), there may be quite a few solutions for the same minimum value of N E E (herein referred to as N E E * ). One may then first solve Problem (P2) to find N E E * and, in a second stage, solve the following problem—Problem (P3)—where the mean value of the design objective (or other appropriate measure) is minimized subject to N E E   N E E * :
min y fc , v fc , x fc , z f 1 N F f O x f c , y f c , p f , p f *  
s . t .   ( 5 ) ( 10 )
N E E N E E *
This way, one can find the set of N F formulations that are simultaneously the most dissimilar and the optimal ones in terms of a given design objective. If h , g , and O are linear functions, Problem (P3) is a MILP problem.
If the solution of Problem (P3) results in product formulations having unsatisfactory values of the design objective O , Restriction (12) may be relaxed, thus allowing for the generation of less dissimilar formulations but a better average performance (lower mean value of O ). Successive relaxations of Restriction (12) will produce a Pareto curve of the two competing objectives ( O versus N E E ).

5. Example of a Cosmetic Emulsion

The above-proposed optimization tools are now applied to an example of a cosmetic emulsion that has already been explored in [12]. The problem is to formulate a rinse-off hair conditioner, which is an o/w emulsion, selecting ingredients from a set of 32 possible ingredients organized in 6 subsets: emollients of type i ( i 1 to i 9 ), emollients of type j ( j 1 to j 8 ), emollients of type k ( k 1 to k 7 ), fatty alcohols ( m 1 , m 2 ), thickening polymers ( n 1 , n 2 , n 3 ), and cationic surfactants ( r 1 , r 2 , r 3 ). The design variables are vectors y (choice of ingredients) and x (mass fractions), both with dimension 32. The formulation also includes five mandatory ingredients (water, glycerol, disodium EDTA, propylparaben, and perfume) in fixed amounts (except for water, whose mass fraction is such that mass fractions of all components sum up to 1).
The available product model is presented in Table 1, including design variables, product performance specifications, property models, and heuristic rules. The model is incomplete, not covering all product attributes valued by consumers nor their interactions [26]. Only three product performance metrics are fully quantified with models estimating them as a function of product composition: initial viscosity ( μ 1 ), final viscosity ( μ 2 ), and greasiness value ( γ ). These models have been validated previously [12] (in the case of μ 1 and μ 2 , validation included the case of heuristic 3.3.; in the case of γ , validation was partial).
Given model incompleteness and uncertainty, the goal is to find a small set of alternative formulations to proceed to experimental testing. These alternatives are thus equated using relatively loose specifications for the three quantitative metrics: μ 1 , μ 2 , and γ .

5.1. Modelling of Heuristic Rules

The simplest heuristic rules are recommended or regulatory limits for a particular ingredient i , which are easily modelled as linear constraints of the type L y i x i U y i . If ingredient i is chosen ( y i = 1 ), then the desired limits L and U are imposed. Otherwise, if ingredient i is not chosen ( y i = 0 ), then constraint L y i x i U y i reduces to x i = 0 . If no heuristic limits are known, one simply writes 0 x i y i .
Heuristics expressed as logical expressions may also be translated to linear algebraic restrictions using additional binary variables and propositional logic [12], as is the case with Heuristics 3.2 and 3.3, which are modelled as follows.
Heuristic 3.2. Let z   0,1 and z = 1 y n = 0 , n (no thickening polymers are used). Then, the logical condition to be modelled is as follows:
z = 1 m x m M m 2 r x r M r
The right-hand side of the implication is equivalent to the following:
g x 0 ,   with   g x = 2 r x r M r m x m M m
The implication may then be written as follows: z = 1 g x 0 . This is modelled by the algebraic constraint g x U g ( 1 z ) , where U g is a non-attainable upper bound for g x (in this case, U g = 2 × 0.05 / 300 is an adequate value).
Finally, the logical equivalence z = 1 y n = 0 , n may be modelled as the following set of linear constraints:
z 1 ( y n 1 + y n 2 + y n 3 ) ,   z 1 y n 1 ,   z 1 y n 2   and   z 1 y n 3
Heuristic 3.3. In order to model Heuristic 3.3, the viscosity model has to be reformulated in such a way that if z = 1 viscosity limits are obeyed. Let l μ 1 = log μ 1 and l μ 1 c = n ( a n x n + b n y n ) ] . Then, the two following restrictions can describe both the viscosity model, l μ 1 = ( l μ 1 c + c ϕ ) , and Heuristic 3.3:
l 1 z l μ 1 ( l μ 1 c + c ϕ ) u 1 z r 1 1 z l μ 1 r 1 u 1 r 1 ( 1 z )
If z = 0 , the first restriction reduces to l μ 1 = ( l μ 1 c + c ϕ ) and the second restriction reduces to l μ 1 u 1 . This last one is non-active if u 1 is a sufficiently large constant. On the other hand, if z = 1 , the first restriction is non-active (with l 1 being a sufficiently low constant) and the second restriction reduces to l μ 1 = r 1 , with r 1 being any value between log 1350 and log 5000 . Hence, viscosity μ 1 is “forced” to be within the specifications stated by Heuristic 3.3.
A similar formulation is required for the viscosity μ 2 :
l 2 z l μ 2 ( l μ 2 c + f ϕ ) u 2 z r 2 1 z l μ 2 r 2 u 2 r 2 ( 1 z ) ,
with constants l 2 , u 2 , and r 2 having similar meanings. The following numerical values were adopted for the six constants: l 1 = 8 , u 1 = 7 , r 1 = log 1350 , l 2 = 5 , u 2 = 3 , r 2 = log 1.0 .

5.2. Sets of Dissimilar Alternative Formulations

Back to the product model as a whole, it generally consists of a set of linear equations and inequations that correspond to the generic design restrictions h = 0 and g 0 in Problem (P1) (see Section 2). One then formulates and solves Problems (P1), (P2), and (P3), which were presented in Section 2 and Section 4 and are now being applied to this particular case of a hair conditioner. All the problems are MILP problems, and all of them were solved using GAMS/CPLEX [35] on a laptop with an Intel® Core™ i7-1065G7 processor.
Problem (P1) is formulated with O as the cost of the formulation (excluding fixed ingredients and processing costs). Binary cuts are used to generate a rank of 50 formulations with increasing cost. These 50 alternative formulations use 15 different components and have a cost ranging between 0.884 and 0.952 USD/kg. This set is clustered using the heuristic multilevel algorithm hMeTiS, a publicly available tool [21], with a CPU time below 1 s. The result for three clusters (and a minimum of three components per cluster) is poor, with almost half of the edges being external edges ( F E E = 0.46 ). The quality of the partitioning is even worse with a higher number of clusters. One then concludes that significantly different subsets of alternative formulations cannot be found within this set F of 50 formulations. This means that F does not have enough diversity in terms of formulations using significantly different sets of components; thus, it is a relatively poor set from which to extract a small number of different alternatives to test.
Next, Problem (P2) is solved for different input values of N F , resulting in each case in a set of N F formulations with maximum dissimilarity. Afterwards, Problem (P3) is solved resulting in N F formulations that are simultaneously the most dissimilar and the ones with the lowest average cost. In all cases, the average cost is not considered to be excessive, and as such, the goal of maximum dissimilarity is not relaxed. Results are shown in Figure 2 and Table 3 below. In Figure 2, each set of formulations is represented by a matrix, and the squares were coloured using a grayscale according to the mass fraction of the selected ingredients.
With N F = 3 , one obtains set F 1 , which uses 14 different components and has a total of 19 edges (non-zero elements in the matrix representation). Out of these 14 different components, only 5 are external edges—corresponding to components m 2 , r 2 , and r 3 —that are used in more than one of the three formulations. The fraction of external edges is thus F E E   = 5/19 = 0.263. If the budget for experimental tests allows for, at most, three candidate formulations, then set F 1 is a good plan of experiments, with both maximum dissimilarity between experimental points and minimum average cost (given the available knowledge, expressed in the form of the product model in Table 1).
For higher values of N F , one obtains sets of formulations that use more components, have a lower level of dissimilarity (higher values of F E E ), and have a higher average cost. Examples of these types of sets are sets F 2 and F 3 (also shown in Figure 2), which have 5 and 7 formulations, respectively. These sets clearly cover a larger design space that includes new and more expensive ingredients. In fact, in order to find a larger number of dissimilar formulations, the algorithm is forced to construct formulations with more ingredients, also including more expensive ones.
Regarding computational time, Problem (P2) is solved in less than 0.3 s in all cases, while the time to solve Problem (P3) increases substantially as N F increases. This indicates that, for large values of N F , there is room to test alternative graph partitioning algorithms (both classical multilevel algorithms and Markov cluster algorithms), which are faster than integer programming (but they do not guarantee global optimality). A comparison between these heuristic partitioning methods and the approach adopted for this paper (MILP Problems P2 and P3) that includes the trade-off between CPU time and quality of partitioning is still yet to be performed. Furthermore, there is still a previous problem to solve, which is how to integrate product design restrictions (Equations (9) and (10)) with those efficient graph partitioning algorithms. Using integer programming, this integration is straightforward and was applied in Problems (P2) and (P3).
In the solutions presented thus far (sets F 1 , F 2 , and F 3 ), no thickening polymers were selected; instead, the cheaper solution of using fatty alcohols in excess was adopted, in accordance with Heuristic 3.2 of Table 1. Still, one may want to deliberately include this alternative in the experimental set. To do so, one only has to solve Problems (P2) and (P3) with explicit restrictions on the binary variables z f , which control this alternative. With N F = 7 and the restriction f z f 5 , at least two of the generated formulations will use a thickening polymer. The solution thus obtained (not shown in Figure 2) has the same number of external edges as that of set F 3 ( N E E = 18 ) and uses the same 25 ingredients of set F 3 plus polymers n 1 and n 2 (separately in two of the seven alternative formulations). The average cost increases from 1.36 to 1.46 EUR/kg.

6. Conclusions

We have proposed mixed-integer optimization and graph partitioning tools that are able to generate the most dissimilar product formulations within a large space of design alternatives; thus, these tools are very useful for designing an exploratory set of experiments. In particular, the sequential solution of Problems (P2) and (P3) can identify the set of formulations that are simultaneously the most dissimilar and the optimal ones regarding a given design objective. These features were illustrated by a cosmetic emulsion example, the formulation of which is equated based on a list of 32 possible ingredients and an incomplete model relating composition to product performance. Using the proposed tools, one is able to generate sets of alternative product formulations with prescribed size, with the larger ones including more components so as to achieve the goal of maximum dissimilarity. These are the optimal plans of the exploratory experiments to be performed, given the existing knowledge in the form of an incomplete product model. Therefore, the proposed tools provide a systematic methodology to acquire extra information, increase knowledge about the product performance–composition relationships, and, thus, sustain a more rational product development process.


This research was funded by Fundação para a Ciência e Tecnologia (FCT, Portugal), project reference UIDB/00102/2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request.


The author thanks Javier A. Arieta-Escobar ([email protected]) for his contribution in the cosmetic emulsion example and also expresses thanks for the financial support received from Fundação para a Ciência e Tecnologia (FCT, Portugal), project reference UIDB/00102/2020.

Conflicts of Interest

The author declares no conflict of interest.


  1. Gani, R.; Ng, K.M. Product design—Molecules, devices, functional products, and formulated products. Comput. Chem. Eng. 2015, 81, 70–79. [Google Scholar] [CrossRef]
  2. Charpentier, J.-C.; Costa, R.; Uhlemann, J. Product design and engineering—Past, present, future trends in teaching, research and practices: Academic and industry points of view. Curr. Opin. Chem. Eng. 2019, 42, 2258–2274. [Google Scholar]
  3. Gil, J.L.R.; Serna, J.; Arrieta-Escobar, J.A.; Rincón, P.C.N.; Boly, V.; Falk, V. Triggers for chemical product design: A systematic literature review. AIChE J. 2022, 68, e17563. [Google Scholar]
  4. Bernardo, F.P.; Saraiva, P.M. A conceptual model for chemical product design. AIChE J. 2015, 61, 802–815. [Google Scholar] [CrossRef]
  5. Zhang, L.; Babi, D.K.; Gani, R. New Vistas in Chemical Product and Process Design. Annu. Rev. Chem. Biomol. Eng. 2016, 7, 557–582. [Google Scholar] [CrossRef]
  6. Zhang, L.; Fung, K.Y.; Wibowo, C.; Gani, R. Advances in chemical product design. Rev. Chem. Eng. 2018, 34, 319–340. [Google Scholar] [CrossRef]
  7. Zhang, L.; Mao, H.; Liu, Q.; Gani, R. Chemical product design—Recent advances and perspectives. Curr. Opin. Chem. Eng. 2020, 27, 22–34. [Google Scholar] [CrossRef]
  8. Achenie, L.E.K.; Venkatasubramanian, V.; Gani, R. (Eds.) Computer Aided Molecular Design: Theory and Practice; Volume 12 of Computer-Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2003. [Google Scholar]
  9. Gani, R. Chemical product design: Challenges and opportunities. Comput. Chem. Eng. 2004, 28, 2441–2457. [Google Scholar] [CrossRef]
  10. Austin, N.D.; Sahinidis, N.V.; Trahan, D.W. Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques. Chem. Eng. Res. Des. 2016, 116, 2–26. [Google Scholar] [CrossRef]
  11. Zhang, L.; Fung, K.Y.; Zhang, X.; Fung, H.K.; Ng, K.M. An integrated framework for designing formulated products. Comput. Chem. Eng. 2017, 107, 61–76. [Google Scholar] [CrossRef]
  12. Arrieta-Escobar, J.A.; Bernardo, F.P.; Orjuela, A.; Camargo, M.; Morel, L. Incorporation of heuristic knowledge in the optimal design of formulated products: Application to a cosmetic emulsion. Comput. Chem. Eng. 2019, 122, 265–274. [Google Scholar] [CrossRef]
  13. Greenhill, S.; Rana, S.; Gupta, S.; Vellanki, P.; Venkatesh, S. Bayesian Optimization for Adaptive experimental Design: A Review. IEEE Access 2020, 8, 13937–13948. [Google Scholar] [CrossRef]
  14. Cao, L.; Russo, D.; Mathhews, E.; Lapkin, A.; Woods, D. Computer-aided design of formulated products: A bridge design of experiments for ingredient selection. Comput. Chem. Eng. 2023, 169, 108083. [Google Scholar] [CrossRef]
  15. Mah, R.S.H. Application of Graph Theory to Process Design and Analysis. Comput. Chem. Eng. 1983, 7, 239–257. [Google Scholar] [CrossRef]
  16. Balaban, A. Applications of Graph Theory in Chemistry. J. Chem. Inf. Comput. Sci. 1985, 25, 334–343. [Google Scholar] [CrossRef]
  17. Bernardo, F.P.; Arrieta-Escobar, J.A. Clustering alternative product formulations using graphs. In ESCAPE-29, Computer-Aided Chemical Engineering 46; Kiss, A.A., Zondervan, E., Lakerveld, R., Özkan, L., Eds.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 511–516. [Google Scholar]
  18. Tsai, J.F.; Lin, M.H.; Hu, Y.C. Finding multiple solutions to general integer linear programs. Eur. J. Oper. Res. 2008, 184, 802–809. [Google Scholar] [CrossRef]
  19. Aykanat, C.; Pinar, A.; Çatalyüurek, Ü.V. Permuting sparse rectangular matrices into block diagonal form. SIAM J. Sci. Comput. 2004, 25, 1860–1879. [Google Scholar] [CrossRef]
  20. Karypis, G.; Kumar, V. Analysis of Multilevel Graph Partitioning; Department of Computer Science & Engineering, University of Minnesota: Minneapolis, MN, USA, 1995. [Google Scholar]
  21. Karypis, G.; Kumar, V. hMeTiS, a Hypergraph Partitioning Package, Version 1.5.3; Department of Computer Science & Engineering, University of Minnesota: Minneapolis, MN, USA, 1998. [Google Scholar]
  22. Karypis, G.; Kumar, V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 1998, 20, 359–392. [Google Scholar] [CrossRef]
  23. van Dongen, S. Graph Clustering by Flow Simulation. Ph.D. Thesis, University of Utrecht, Utrecht, The Netherlands, 2000. [Google Scholar]
  24. van Dongen, S. Graph clustering via a discrete uncoupling process. SIAM J. Matrix Anal. Appl. 2008, 30, 121–141. [Google Scholar] [CrossRef]
  25. Boulle, M. Compact Mathematical Formulation for Graph Partitioning. Optim. Eng. 2004, 5, 315–333. [Google Scholar] [CrossRef]
  26. Arrieta-Escobar, J.A.; Camargo, M.; Morel, L.; Bernardo, F.P.; Orjuela, A.; Wendling, L. Design of formulated products integrating heuristic knowledge and consumer assessment. AIChE J. 2020, 67, e17117. [Google Scholar] [CrossRef]
  27. Brummer, R.; Godersky, S. Rheological studies to objectify sensations occurring when cosmetic emulsions are applied to the skin. Colloids Surf. A Physicochem. Eng. Asp. 1999, 152, 89–94. [Google Scholar] [CrossRef]
  28. Bagajewicz, M.J.; Hill, S.; Robben, A.; Lopez, H.; Sanders, M.; Sposato, E.; Baade, C.; Manora, S.; Coradin, J.H. Product design in price-competitive markets: A case study of a skin moisturizing lotion. AIChE J. 2011, 57, 160–177. [Google Scholar] [CrossRef]
  29. Prospector®. Available online: (accessed on 13 September 2017).
  30. Pal, R. Evaluation of theoretical viscosity models for concentrated emulsions at low capillary numbers. Chem. Eng. J. 2001, 81, 15–21. [Google Scholar] [CrossRef]
  31. Mentel, M.; Wiechers, S.; Howe, A.; Biehl, P.; Meyer, J. Senses—A Scientific Tool for the Selection of The Right Emollient. SOFW 2014, 140, 8–15. [Google Scholar]
  32. Iwata, H.; Shimada, K. Formulas, Ingredients and Production of Cosmetics; Springer: Tokyo, Japan, 2013. [Google Scholar]
  33. Nakarapanich, J.; Barameesangpet, T.; Suksamranchit, S.; Sirivat, A.; Jamieson, A.M. Rheological properties and structures of cationic surfactants and fatty alcohol emulsions: Effect of surfactant chain length and concentration. Colloid Polym. Sci. 2001, 279, 671–677. [Google Scholar] [CrossRef]
  34. Ansmann, A.; Busch, P.; Hensen, H.; Hill, K.; Krächter, H.-U.; Müller, M. Personal Care Formulations. In Handbook of Detergents, Part D: Formulation; Showell, M., Ed.; Surfactant Science; CRC Press: Boca Raton, FL, USA, 2006; pp. 207–260. [Google Scholar]
  35. GAMS. Version 32.2.0, Released 26 August 2020. Available online: (accessed on 10 September 2023).
Figure 1. Matrix and graph representations of 7 product formulations ( f 1 to f 7 ), each one with components chosen from a pool of 18 possible components ( c 1 to c 18 ). A partition in three clusters (A, B, and C) is also shown. In the matrix, these clusters correspond to diagonal blocks; in the graph, the clusters are delineated by dashed lines.
Figure 1. Matrix and graph representations of 7 product formulations ( f 1 to f 7 ), each one with components chosen from a pool of 18 possible components ( c 1 to c 18 ). A partition in three clusters (A, B, and C) is also shown. In the matrix, these clusters correspond to diagonal blocks; in the graph, the clusters are delineated by dashed lines.
Processes 11 03152 g001
Figure 2. Sets of formulations with maximum dissimilarity and minimum average cost. Set F 1 was obtained with N F = 3 , set F 2 was obtained with N F = 5 , and set F 3 was obtained with N F = 7 . Each set is represented in matrix form, with each selected ingredient being depicted as a square; darker squares correspond to higher mass fraction values based on the grayscale shown at the bottom.
Figure 2. Sets of formulations with maximum dissimilarity and minimum average cost. Set F 1 was obtained with N F = 3 , set F 2 was obtained with N F = 5 , and set F 3 was obtained with N F = 7 . Each set is represented in matrix form, with each selected ingredient being depicted as a square; darker squares correspond to higher mass fraction values based on the grayscale shown at the bottom.
Processes 11 03152 g002
Table 1. Hair conditioner model.
Table 1. Hair conditioner model.
Design variables: Choice of ingredients (binary variables y ) and respective mass fractions x ; pair { y , x } for each one of six subsets of ingredients: i , j , k , m , n , and r .
Product performance specifications
1. Initial viscosity μ 1 (perceived on the onset of flow of the product) between 1350 and 5000 Pa·s [27]
2. Final viscosity μ 2 (at ~500 s, perceived during product application on hair) between 0.023 and 1.0 Pa·s [27]
3. Greasiness value ( γ ) between 2.0 and 2.4 [28]
Property models
1. Initial viscosity: log μ 1 = n ( a n x n + b n y n ) + c ϕ ; final viscosity: log μ 2 = n ( d n x n + e n y n ) + f ϕ , with log being decimal logarithm. Since, at most, one thickener is used, only one term of the sum Σ n is non-zero. Parameters a n , b n ,   d n , and e n are estimated from experimental data of aqueous solutions of n [29] (see Table 2 below); c and f (effect of the dispersed phase) are estimated from the Yaron and Gal-Or theoretical model [30]: c = 1 ; f = 2 .
2. Greasiness value γ given by a linear mixing rule using known γ values for each emollient [31].
Heuristic rules 1
1. Oil phase mass fraction:  ϕ = q x q 0.25 , q = { i , j , k , m , r }
2. Cationic surfactants
2.1 Heuristic for “moist and soft” product [32]: no r 1 , 2 x r 3 x r 2 4 x r 3
2.2 General limits: x r 1 0.01 , x r 2 0.01 , x r 1 + x r 2 0.03
2.3 Cationic surfactants at about 20% of the oil phase stabilizes the emulsion: 0.16 ϕ r x r 0.24 ϕ
3. Fatty alcohols
3.1 Use at least one; 3 to 8% each; total lower than 8%: m y m 1 ; 0.03 y m x m 0.08 y m ; m x m 0.08
3.2 When no thickening polymers are used, the concentration of fatty alcohols is at least twice that of cationic surfactants’ in a molar base [33] (mathematical formulation given in the main text).
3.3 If Heurist 3.2 holds, then μ 1 and μ 2 are expected to be within specifications (mathematical formulation in the main text).
4. Emollients
4.1 Use at least one emollient of each type [34]: i y i 1 ; j y j 1 ; k y k 1
4.2 Use a minimum of 1% of each; total greater than 6%: 0.03 y i x i y i , idem for j and k ; i x i + j x j + k x k 0.06
5. Thickening polymers
5.1 Use only one: n y n 1
5.2 Limits: 0.0015 y n 1 x n 1 0.03 y n 1 , 0.005 y n 1 x n 1 0.02 y n 1 , 0.004 y n 1 x n 1 0.03 y n 1
1 Unreferenced heuristics result from the author and co-workers experience.
Table 2. Viscosity model parameters.
Table 2. Viscosity model parameters.
a n b n d n e n
n 1 223.422.483977.1810.7636
n 2 271.132.001778.3511.5196
n 3 109.273.4731174.740.2526
Table 3. Data regarding the sets of formulations of Figure 2. N C is the number of selected components (out of a total of 32); C ¯ is the average cost in EUR/kg; CPU time is for Problem (P3) (Problem (P2) requires less than 0.3 s in all cases).
Table 3. Data regarding the sets of formulations of Figure 2. N C is the number of selected components (out of a total of 32); C ¯ is the average cost in EUR/kg; CPU time is for Problem (P3) (Problem (P2) requires less than 0.3 s in all cases).
Set N F N C N E N E E F E E C ¯ CPU (s)
F 1 3141950.2631.050.3
F 2 52031110.3551.213.0
F 3 72543180.4191.36395.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bernardo, F.P. Generation of Dissimilar Alternative Product Formulations Using Graphs. Processes 2023, 11, 3152.

AMA Style

Bernardo FP. Generation of Dissimilar Alternative Product Formulations Using Graphs. Processes. 2023; 11(11):3152.

Chicago/Turabian Style

Bernardo, Fernando P. 2023. "Generation of Dissimilar Alternative Product Formulations Using Graphs" Processes 11, no. 11: 3152.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop