Evolutionary Algorithms in Modeling Aerodynamic Properties of Spray-Dried Microparticulate Systems

: Spray drying is a single step process in which solutions or dispersions are converted into dry particles. It is widely used in pharmaceutical technology to produce inhalable particles. Dry particle behavior during inhalation, described as the emitted dose (ED) and ﬁne particle fraction (FPF), is determined in vitro by standardized procedures. A large number of factors inﬂuencing the spray drying process and particle interaction makes it di ﬃ cult to predict the ﬁnal product properties in advance. This work presents the development of predictive models based on experimental data obtained by aerodynamic assessment of respirable dry powders. Developed models were tested according to the 10-fold cross-validation procedure and yielded good predictive ability. Both models were characterized by normalized root-mean-square error (NRMSE) below 8.50% and coe ﬃ cient of determination ( R 2 ) above 0.90. Moreover, models were analyzed to establish a relationship between spray drying process parameters and the ﬁnal product quality measures. Presented work describes the strategy of implementing the evolutionary algorithms in empirical model’s development. Obtained models can be applied as an expert system during pharmaceutical formulation development. The models have the potential for product optimization and a knowledge extraction to improve ﬁnal quality of the drug.


Introduction
Current trends in pharmaceutical formulation development, reflected in guidelines for industry [1,2], indicate the need for a better understanding of manufacturing processes. The quality of pharmaceutical formulation depends on the qualitative and quantitative composition, applied production process, and its parameters. In-depth knowledge of production processes is crucial for quality assurance of the final product. Complexity, nonlinearity, and lack of full understanding of the processes are the main obstacles to the high-quality models development. Application of the artificial intelligence (AI) tools, such as evolutionary algorithms (EA), in the knowledge discovery of the manufacturing process, is beneficial for both the optimization of pharmaceutical formulations

Materials and Methods
The results of laboratory experiments were used to construct the database. Dry powders were prepared by spray drying using Nano Spray Dryer B-90 (Büchi Labortechnik AG, Flawil, Switzerland). Feed solutions contained a model active pharmaceutical ingredient (API) and an excipient dissolved in the water-ethanol mixture. During spray drying, six factors were controlled and modified: API to excipient ratio [m/m%], feed solution concentration [m/V%], ethanol to water ratio [V/V%], inlet air temperature [ • C], air flow [L/min], and pressure [mbar] inside the spray dryer. Particle aerodynamic performance was assessed by the Twin-stage Glass Impinger (Apparatus A, Copley Scientific, Colwick, UK) coupled with HandiHaler ® inhaler. Assays were carried out according to the 10th Ph. Eur. monograph on aerodynamic assessment of fine particles for powder inhalers (2.9.18) [10]. The shape and size of obtained particles were analyzed based on Scanning Electron Microscopy photographs using ImageJ software (1.52a, website: https://imagej.net/) [15]. The particles were spherical and their median size for a number distribution (Dn50) differed from 250 nm to 760 nm between formulations. The complete database, applied in further computational experiments, is shown in Table 1.

Models Development and Assessment
The computational experiment was designed to develop multiple input single output (MISO) models. ED and FPF models were developed separately without any interaction of these two variables. Model performance was evaluated according to the 10-fold cross validation and expressed by three goodness of fit metrics: root-mean-square error (RMSE, Equation (1)), normalized root-mean-square error (NRMSE, Equation (2)), and coefficient of determination (R 2 , Equation (3)). NRMSE was calculated according to experimental FPF and ED ranges. (1) where: obs i , pred i = observed and predicted values of i-th record, n = total number of records.
where: RMSE = root-mean-square error calculated for the model, X max = maximum value of the observed results, X min = minimum value of the observed results. where: R 2 = coefficient of determination, obs i , pred i = observed and predicted values, respectively, obs mean = arithmetic mean of observed values. Models were built and assessed using in-house software developed in R environment [16]. Additional packages delivering evolutionary algorithms, fuzzy logic, and genetic programming methods were used. Computations were executed using 29 workstations (542 threads) working under Linux openSUSE Tumbleweed operating system. All tools applied in the study including source code and examples are available on SourceForge website [17][18][19].

Evolutionary Algorithms
Evolutionary algorithms gained research focus through the work of John Holland and his group in the 1970s, who introduced framework for genetic algorithm (GA) development [20]. Evolutionary algorithms are classified as heuristic search methods inspired by evolution process. The concept is based on the mechanism of natural selection where within the population only the fittest subjects, the ones which can adapt quickly to the solution, are selected for reproduction. The offspring inherits the characteristics of both parents, and if they better adapt to the observed solution (better fitness), they have a better chance to survive [21].
The given problem is defined by the environment, where a population of candidate solutions are present. Every candidate has a chromosome which is used to decode its information to the real solution. The computational space in the population is represented by the genotype where the phenotype represents its solution in the real world. In other words, chromosomes describe the proposed real solution of the problem, and cost function evaluates its goodness of fit. The evolution process usually starts by the random initialization of the population. Then, it iterates over three critical steps: (1) selection of the best-fitted individuals, (2) application of the genetic operators such as cross-over, mutation, and (3) reproduction-the next generation is created. These three steps are repeated until a termination condition is reached [22]. The process of model development and its selection was adapted from Polak et al. [23], where the authors used a similar methodology to explore the physiological parameters influencing QT interval, indicating the broad area of application of evolutionary algorithms. In this work, in-house software implementing two machine learning frameworks (fugeR and rgp) delivering computational methods based on evolutionary algorithms was employed.

fugeR
The first framework named fugeR [24] delivers solutions to develop fuzzy logic (FL) models by applying evolutionary algorithms fundamentals. Fuzzy logic was introduced by Zadeh [25], who claimed intermediate values between absolute true and absolute false, so called many-valued logic. The fugeR uses a genetic algorithm to evolve a population of the fuzzy logic models. In general, solution development starts by generating a random population of fuzzy systems, test them with available data, and sort them based on the goodness of fit metric. The following generations are created as decedents of the best performing models by applying cross-over function and random mutations. At the end of the process, the fuzzy system with the best performance is returned. The fugeR framework delivers enclosed functions to develop FL models using EA. However, to make model development process efficient and in line with the adopted standards, a wrapper software was created [18]. The system was built using Bash and R languages and works under Linux OS. It allows generating computational tasks with grid-like generated parameters including: learn and test dataset files, factors pass to fugeR.run() function for model development and scheme of models creation and assessment. Such approach allows testing developed models according to k-fold cross-validation standards and returns model performance as structured reports. Moreover, parallelization of the computational tasks speeds up the model development process and allows to find global problem solution. The general pipeline presenting the process of model development and testing process using the software is shown in Figure 1. During computational experiments, both model architectures and optimization settings were tuned. The maximum number of rules differed from 10 to 50 with 1 to 20 variables within every fuzzy rule. The number of fuzzy membership functions (singletons) varied from 5 to 15. The learning strategy was controlled by population size (from 50 to 500) and generations number (from 100 to 500). advances and results. It is also possible to promptly use obtained models. The software is suited for parallelization of the processes by specifying a set of different parameters within the user interface. Moreover, computational tasks are run independently to speed up the model development process. The general scheme of the software workflow is shown in Figure 2. In the presented work, symbolic regression was used to develop a mathematical formula directly on the data provided. The size of the chromosome, representing the maximum length of the equation varied from 50 to 200. The population size was set to 200, and the total length of the evolution process was set to 100 million steps divided into 500 testing stages. Moreover, the GP tools were used to perform feature selection based on the principle that variables related to the evolution of the solution are connected with the defined problem. Figure 1. The workflow of in-house developed software implementing fugeR framework to model development and assessment system using fuzzy logic (FL) coupled with evolutionary algorithms. Software controls: data feed and model development process returning results in structured reports [18].

rgp
The second framework applied in presented work was rgp [26] developed by Olivier Flash, which delivers genetic programming methods. GP allows generating a computer program capable to solve a defined problem using EA. Moreover, using a symbolic regression module, the solution of a given problem can be represented as formal mathematical expression, which is easy to read and understand by humans. Similar to the previously described tool, at the first stage, the population of random equations is generated and tested. Next generations are created as descendants based on the best performing models by the exchange of a genetic material (cross-over function) and additional random modifications (mutations). To make the model development process efficient and in line with adopted methodology, in-house software for Linux OS was developed [19]. Source code was written in R and Bash programming languages. Software control parameters such as: data feed for the core function of model development through GP methods, flow of the evolution process (stop, resume), and performance metric the k-fold cross-validation (k-CV). Results are returned in structured report files. Model development and testing routine is continuous, which allows the user to follow recent advances and results. It is also possible to promptly use obtained models. The software is suited for parallelization of the processes by specifying a set of different parameters within the user interface. Moreover, computational tasks are run independently to speed up the model development process. The general scheme of the software workflow is shown in Figure 2. In the presented work, symbolic regression was used to develop a mathematical formula directly on the data provided. The size of the chromosome, representing the maximum length of the equation varied from 50 to 200. The population size was set to 200, and the total length of the evolution process was set to 100 million steps divided into 500 testing stages. Moreover, the GP tools were used to perform feature selection based on the principle that variables related to the evolution of the solution are connected with the defined problem.
Appl. Sci. 2020, 10, x 6 of 14 Figure 2. The workflow of in-house developed software implementing rgp framework to model development and assessment system using genetic programming methods. The first stage is based on a symbolic regression function allowing the development of formal mathematical equation fitting defined data. Later, the process is stopped, models are tested according to the k-fold cross-validation (k-CV) scheme, and the evolution process is resumed. Parameters affecting the evolutionary algorithms and model development process can be modified in the text user interface [19].

Results and Discussion
Prior to applying GP tools, linear regression models were developed and tested according to the 10-fold cross-validation scheme. Standard lm() function of R environment was applied, yet obtained results were far from being satisfactory. In the case of FPF prediction, RMSE was 12.03 (NRMSE = 20.44%), and in the case of ED predictions, RMSE was 11.30 (NRMSE = 32.2%).

Models for Prediction of FPF
The FPF prediction model performance is summarized in Table 2. Both fugeR and rgp models exhibited good FPF predictive ability. Fuzzy systems resulted in a slightly worse performance in comparison to genetic programming, as their NRMSE was 8.86 with R 2 below 0.85. The best model was developed using rgp package and it is presented as Equation (4) (NRMSE = 8.29%, and R 2 = 0.91). The model uses four variables: API to excipient ratio (X1), feedstock solution concentration (X2), ethanol to water ratio (X3), inlet air temperature (X4), and five constants (C1-C5). During the evolution of the solution, two variables describing flow rate and pressure inside the spray drying chamber were excluded. Comparison of FPF values predicted by the model and those obtained in laboratory experiments is presented in Figure 3. The workflow of in-house developed software implementing rgp framework to model development and assessment system using genetic programming methods. The first stage is based on a symbolic regression function allowing the development of formal mathematical equation fitting defined data. Later, the process is stopped, models are tested according to the k-fold cross-validation (k-CV) scheme, and the evolution process is resumed. Parameters affecting the evolutionary algorithms and model development process can be modified in the text user interface [19].

Results and Discussion
Prior to applying GP tools, linear regression models were developed and tested according to the 10-fold cross-validation scheme. Standard lm() function of R environment was applied, yet obtained results were far from being satisfactory. In the case of FPF prediction, RMSE was 12.03 (NRMSE = 20.44%), and in the case of ED predictions, RMSE was 11.30 (NRMSE = 32.2%).

Models for Prediction of FPF
The FPF prediction model performance is summarized in Table 2. Both fugeR and rgp models exhibited good FPF predictive ability. Fuzzy systems resulted in a slightly worse performance in comparison to genetic programming, as their NRMSE was 8.86 with R 2 below 0.85. The best model was developed using rgp package and it is presented as Equation (4) (NRMSE = 8.29%, and R 2 = 0.91).
The model uses four variables: API to excipient ratio (X 1 ), feedstock solution concentration (X 2 ), ethanol to water ratio (X 3 ), inlet air temperature (X 4 ), and five constants (C 1 -C 5 ). During the evolution of the solution, two variables describing flow rate and pressure inside the spray drying chamber were excluded. Comparison of FPF values predicted by the model and those obtained in laboratory experiments is presented in Figure 3.

Models for Prediction of ED
Results obtained by the best fuzzy systems and GP models are presented in Table 3. Both methods resulted in a significantly lower error in comparison to linear regression. The GP models yielded almost twice-fold lower NRMSE (8.14%) than fuzzy systems (below 15%). The best model is given in Equation (5). ED values calculated according to Equation (5) and observed in laboratory experiments are shown in Figure 4. The model uses all six variables (X1-X6) available in the database and two constants (C1 and C2). where:

Models for Prediction of ED
Results obtained by the best fuzzy systems and GP models are presented in Table 3. Both methods resulted in a significantly lower error in comparison to linear regression. The GP models yielded almost twice-fold lower NRMSE (8.14%) than fuzzy systems (below 15%). The best model is given in Equation (5). ED values calculated according to Equation (5) and observed in laboratory experiments are shown in Figure 4. The model uses all six variables (X 1 -X 6 ) available in the database and two constants (C 1 and C 2 ).

Model-Based Problem Analysis-Single Variable Impact
The predictive models were applied to explore the influence of input variables on predicted FPF and ED endpoints. Relationships between input and output variables are presented in Figure 5. Analysis of Figure 5A shows that changing the ratio of API to excipient content strongly impacts ED and FPF values. Increasing the API content in the liquid feed decreases both the FPF and ED. The steepest descent of ED is observed when API content is in the range of 50 to 72.5% ( Figure 5A). Figure  5B presents the influence of the total concentration of compounds dissolved in feed solution on the final particle characteristics. It is observed that lower concentrations of feed solution yield higher FPF and ED values. Ethanol to water ratio in solvent applied in the process has a small impact on FPF value while predicted ED is significantly lower for feed solvents with lower ethanol content ( Figure  5C). The temperature of the air entering spray drying chamber does not affect ED at all ( Figure 5D). On the contrary, low and high temperatures positively impact FPF values ( Figure 5D). ED(%) = e e (sin (X 6 +X 5 +sin ( √ (X 4 −C 1 ))+e (sin (X 3 )) )) + e e (sin (X 6 +X 2 )) + X 6 + e e (sin (sin (X 3 +X 2 −X 2 ))) + e e (sin (X 2 + √ X 1 )) + X 2 (5) C 1 -C 2 : constants C 1 = −5.029; C 2 = 8.417, X 1 : API to excipient ratio [m/m%], X 2 : concentration of feed solution [m/V%], X 3 : ethanol to water ratio in solvent applied in the process [V/V%], X 4 : inlet air temperature during spray drying process [ • C], X 5 : air flow during spray drying process [L/min], X 6 : pressure inside spray dryer during the process [mbar].

Model-Based Problem Analysis-Single Variable Impact
The predictive models were applied to explore the influence of input variables on predicted FPF and ED endpoints. Relationships between input and output variables are presented in Figure 5. Figure 5A shows that changing the ratio of API to excipient content strongly impacts ED and FPF values. Increasing the API content in the liquid feed decreases both the FPF and ED. The steepest descent of ED is observed when API content is in the range of 50 to 72.5% ( Figure 5A). Figure 5B presents the influence of the total concentration of compounds dissolved in feed solution on the final particle characteristics. It is observed that lower concentrations of feed solution yield higher FPF and ED values. Ethanol to water ratio in solvent applied in the process has a small impact on FPF value while predicted ED is significantly lower for feed solvents with lower ethanol content ( Figure 5C). The temperature of the air entering spray drying chamber does not affect ED at all ( Figure 5D). On the contrary, low and high temperatures positively impact FPF values ( Figure 5D).

Analysis of
Appl. Sci. 2020, 10, x 9 of 14 Figure 5. The relationship between ED and FPF predicted by models and numerical value of input variables: (A) API to excipient ratio in feed solution; (B) the total concentration of substances dissolved in feed solution; (C) ethanol to water ratio in solvent applied in the process; (D) the temperature of the air entering the spray drying chamber. FPF-fine particle fraction; ED-emitted dose.

Model-Based Problem Analysis-Multi-Variables Impact
Influence of the two input factors on the endpoints (FPF, ED) was analyzed by plotting threedimensional graphs. Of a plethora of possible combinations, we have selected the most significant ones. Figure 6A presents the relationship between ethanol content in feed solvent, the total concentration of the solution, and predicted FPF value. It was observed that both low concentration of the feed solution and low ethanol content positively influence FPF values. Figure 6B shows the relationship between FPF predicted by the model, inlet air temperature during spray drying process, and API to excipient ratio. It can be noted that FPF value is higher for formulations containing lower API to excipient ratio. At the same time, inlet air temperature influences FPF in a non-linear manner. It appears that higher inlet air temperature is beneficial in case of formulations with high API excipient content while lower inlet air temperature results in higher FPF in case of formulations with lower API content. The influence of individual variables on the predicted ED value by the model is shown in Figure 7. It can be observed that decreasing API to excipient ratio positively influence predicted ED, and this trend is strong in the case of spray drying solution with a lower content of dissolved ingredients ( Figure 7A). Figure 7B shows the influence of the inlet air temperature and solution concentration on the predicted ED. Decreasing amount of feed solution concentration positively influences ED, whereas high and low inlet air temperatures seem to have a small positive impact on predicted ED.

Model-Based Problem Analysis-Multi-Variables Impact
Influence of the two input factors on the endpoints (FPF, ED) was analyzed by plotting three-dimensional graphs. Of a plethora of possible combinations, we have selected the most significant ones. Figure 6A presents the relationship between ethanol content in feed solvent, the total concentration of the solution, and predicted FPF value. It was observed that both low concentration of the feed solution and low ethanol content positively influence FPF values. Figure 6B shows the relationship between FPF predicted by the model, inlet air temperature during spray drying process, and API to excipient ratio. It can be noted that FPF value is higher for formulations containing lower API to excipient ratio. At the same time, inlet air temperature influences FPF in a non-linear manner. It appears that higher inlet air temperature is beneficial in case of formulations with high API excipient content while lower inlet air temperature results in higher FPF in case of formulations with lower API content. The influence of individual variables on the predicted ED value by the model is shown in Figure 7. It can be observed that decreasing API to excipient ratio positively influence predicted ED, and this trend is strong in the case of spray drying solution with a lower content of dissolved ingredients ( Figure 7A). Figure 7B shows the influence of the inlet air temperature and solution concentration on the predicted ED. Decreasing amount of feed solution concentration positively influences ED, whereas high and low inlet air temperatures seem to have a small positive impact on predicted ED.

Discussion
The methodology applied during the development of evolutionary models stays in accordance with the Cross Industry Standard Process for Data Mining (CRISP-DM) protocol [27]. Therefore, the high quality of the models is assured. The obtained errors, below 8.3% for FPF and 8.2% for ED, in the 10-fold cross validation protocol suggests that the developed models exhibit significant extrapolation ability. With the latter confirmed, the models for ED and FPF could be used for prospective development of pulmonary drug delivery dosage forms, with models acting as decision support systems (DSS). This research demonstrates that the above presented models perfectly balance predictability with transparency [28,29]. Moreover, the GP derived models are transparent and easily validated (Equations (4) and (5)).
The model analysis suggests the highest impact of API to excipient ratio ( Figure 5A) on the final aerosolization properties. These findings are in accordance with work of Bosquilion et al. [30], who observed higher ED values for spray-dried powders with lower API content. Moreover, positive impact on the FPF value with increasing amount of excipient was reported by Corrigan et al. [31] for spray-dried particles composed of salbutamol sulfate and lactose. It was also noted that decreasing total feed concentration resulted in the higher FPF. Chew et al. [32] observed influence of feedstock concentration on particles surface characteristic and powders aerodynamic properties. Low feedstock concentration led to more corrugated particles and resulted in the higher FPF, which corresponds with our analysis (Figures 5B and 6A). Ethanol content in the spray drying feed rate was also reported as a factor influencing particles morphology and ED [33], which confirms the validity of the selection of this variable by evolutionary algorithm (Equations (4) and (5), Figures 5C and 6A).
The limitations and the possibility to extend the presented model should also be pointed out. First of all, the discussed effects are valid for the substances applied in the experimental setup; therefore, in the case of other drugs and excipients it may differ in magnitude and direction. There is a possibility that increasing the other API concentration may result in higher ED or FPF experimental value, because feedstock and concentration effects are dependent on chemical-physical properties of the API, such as cohesiveness, hygroscopicity, and solubility. The same is with different spray drying setup including the type of apparatus, nozzle, and mechanism of aerosol generation, they may modify temperature impact on ED and FPF. Despite some limitations on the direct use of proposed models, it seems that the presented methodology could help in the development of optimal pharmaceutical formulation and understand underlying phenomena. Presented models could be improved and expanded in the future whereas methodology and workflow may remain unchanged.

Conclusions
Aerosolization properties of inhalable formulations are crucial for therapy effectiveness. Development of inhalation dry powders is currently based on a trial-and-error approach. The current study presents a way of developing models applying genetic algorithms, capable of predicting final powder aerosolization behavior based on initial spray drying conditions. Developed models were used for analysis of variable influence on the properties of the particles and delivered few insights about possible mechanisms governing selected endpoints. Moreover, the models can be applied as a predictive tool in terms of optimization or process control. Considering current trends in pharmaceutical technology reflected in the guidelines for industry, the presented approach can fulfill the requirements for better processes understanding and in silico design of the final product quality at an early development stage.