Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes

Dong, Yachao; Yang, Ting; Xing, Yafeng; Du, Jian; Meng, Qingwei

doi:10.3390/pr11072096

Open AccessReview

Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes

by

Yachao Dong

^1,*,

Ting Yang

¹,

Yafeng Xing

^1,2

,

Jian Du

¹ and

Qingwei Meng

²

¹

Institute of Chemical Process Systems Engineering, School of Chemical Engineering, Dalian University of Technology, Dalian 116024, China

²

State Key Laboratory of Fine Chemicals, School of Pharmaceutical Science and Technology, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Processes 2023, 11(7), 2096; https://doi.org/10.3390/pr11072096

Submission received: 6 June 2023 / Revised: 10 July 2023 / Accepted: 12 July 2023 / Published: 13 July 2023

(This article belongs to the Special Issue Machine Learning and Data-Driven Techniques for Complex Industrial Processes)

Download

Browse Figures

Versions Notes

Abstract

As one of the most influential industries in public health and the global economy, the pharmaceutical industry is facing multiple challenges in drug research, development and manufacturing. With recent developments in artificial intelligence and machine learning, data-driven modeling methods and techniques have enabled fast and accurate modeling for drug molecular design, retrosynthetic analysis, chemical reaction outcome prediction, manufacturing process optimization, and many other aspects in the pharmaceutical industry. This article provides a review of data-driven methods applied in pharmaceutical processes, based on the mathematical and algorithmic principles behind the modeling methods. Different statistical tools, such as multivariate tools, Bayesian inferences, and machine learning approaches, i.e., unsupervised learning, supervised learning (including deep learning) and reinforcement learning, are presented. Various applications in the pharmaceutical processes, as well as the connections from statistics and machine learning methods, are discussed in the narrative procedures of introducing different types of data-driven models. Afterwards, two case studies, including dynamic reaction data modeling and catalyst-kinetics prediction of cross-coupling reactions, are presented to illustrate the power and advantages of different data-driven models. We also discussed current challenges and future perspectives of data-driven modeling methods, emphasizing the integration of data-driven and mechanistic models, as well as multi-scale modeling.

Keywords:

data-driven modeling; machine learning; multivariate tools; pharmaceutical processes; process systems engineering; modeling and optimization

1. Introduction

The pharmaceutical industry is one of the most influential industries in public health and the global economy. A flowsheet of the drug manufacturing process [1] is shown in Figure 1. The general pharmaceutical production process comprises the conversion of materials in different stages: sequentially from raw materials, to active pharmaceutical ingredients (APIs), to formulation, and finally the target drug product. In most cases, there are numerous production routes to be used for manufacturing the same product. For the essential API and excipient production, multiple reaction and separation steps might be necessary. Traditionally, the development of drugs and the manufacturing processes could be a long and risky task. However, several global pandemic outbreaks and the artificial intelligence technologies, such as chatbots, communication techniques, high-speed algorithms, etc., have pushed the industry to move toward a more efficient development paradigm.

In recent years, increased computational power and continued advances in automation hardware have rapidly improved laboratory efficiency and the ability to quickly access a wide range of data in the pharmaceutical process [2,3]. On this basis, various high-throughput automation technologies have been developed for use in different stages of the pharmaceutical process, particularly for finding routes for the synthesis of new drug molecules. Santanilla et al. [4] reported automation-friendly palladium-catalyzed cross-coupling reactions in dimethyl sulfoxide at room temperature. They achieved nanomole-scale modeling for several continuous variables by combining robotics with mass spectrometry-based high-throughput analysis techniques to perform more than 1500 chemical experiments in a day. Perera et al. [5] developed an automated flow-based synthesis platform capable of nanomole-scale reaction screening and micromole-scale synthesis. Such automated workflows speed up the design of drug molecules and the development of robust synthetic routes. Burger et al. [6] invented a mobile robot to find suitable photocatalysts for hydrogen production reactions in water, and this robot autonomously performed 688 experiments in 8 days. These works extend the development paradigm from the traditional instrumentation automation to researcher automation. The high-quality datasets generated by these high-throughput techniques and automated experimental platforms provide the basis for subsequent computational and modeling analysis.

In the pharmaceutical process, the development of detailed knowledge-driven models is expensive and time-consuming. To overcome this challenge, data-driven models based on datasets of high quality have been utilized to understand the internal mechanisms of the process, such as optimizing operating conditions and quantifying the impact of different process parameters. Design of experiments (DoE) methodology, one of the most fundamental and widely used data-driven methods, is capable of obtaining information while minimizing the number of experiments, and the companion response surface methodology (RSM) is a modeling approach based on mathematical and statistical techniques for fitting experimental data to polynomial equations [7,8]. Wang et al. [9] developed a generic modeling workflow, using DoE methodology combined with a data-driven model. By developing these statistical models, researchers achieved prediction of the operating window period, reaction optimization, and exploration of the effect of process parameters on reaction performance. The pharmaceutical industry is now expected to deliver products more consistently and at a lower cost, so efficient and stable process monitoring and analysis systems are playing an increasingly important role. On the side of process monitoring and operation, process analytical technology (PAT) has been used for the control of product quality in the pharmaceutical industry in recent decades. Singh et al. [10] proposed a model-based framework for the rapid design and evaluation of PAT systems to quantify the non-linear relationship between process parameters and final product quality. Chemical imaging has found various pharmaceutical applications of quality control on an industrial level [11]. Applying PAT to the pharmaceutical process can help improve process understanding, reduce waste and scrap, shorten production cycles, and enable real-time batch release.

With the rapid advances in machine learning algorithms and the increasing availability of large datasets, the paradigm of machine learning methods combined with big data has led to many new laboratorial and industrial applications [12]. Although many of these methods for research and development in the pharmaceutical and biotechnology industries are still in their early stages, the revolutionary advances brought about by machine learning methods have been demonstrated in a wide range of areas, such as drug molecular design [13,14], retrosynthetic analysis [15,16] and chemical reaction outcome prediction [17].

In this article, we present an overview of the data-driven modeling approaches in pharmaceutical processes. These modeling methods in the literature can be broadly categorized from statistical and machine learning perspectives, and, therefore, the paper is organized accordingly. For the statistical modeling methods, the fundamental idea and pharmaceutical usage of DoE tools, Bayesian inference, Monte Carlo simulation, hidden Markov process, etc., are discussed. Following a brief description of the linkage between statistics and machine learning methods, we visit different types of machine learning methods and their pharmaceutical development applications, including unsupervised learning, supervised learning, deep learning and reinforcement learning. Two cases are presented to show the application of multiple data-driven methods for dynamic modeling of homogeneous organic reactions, catalyst prediction and rate constant prediction. Finally, we discuss the current challenges and future perspectives of data-driven methods for pharmaceutical processes.

2. Process Modeling Tools Based on Statistics

The adoption of a systematic and scientific approach in pharmaceutical development and manufacturing is crucial for producing drugs with high quality, flexibility and efficiency [18]. The demand for robust pharmaceutical processes has increased in recent years due to the growing need for cost-effective production, the increased complexity of drug molecules, and the stringent regulatory requirements [19]. This necessitates a better scientific understanding of the underlying mechanisms of the drug synthesis reaction and an improved process configuration/parameter optimization, with a strong emphasis on ensuring the quality, consistency, and reliability of the final product [20]. In this context, different statistical models have been provided as a powerful tool for supporting the design, optimization, and control of pharmaceutical processes. While no model can perfectly represent the real process, some mathematical methods based on observable characteristics are good approximations of reality. The process modeling uncertainty arises from both the lack/imperfection of knowledge about the process (often referred as epistemic uncertainty) and inherent process variability (also known as aleatory uncertainty) [21]. To quantitatively characterize such uncertainty, researchers use statistical methods to understand and analyze complex processes, leading to the improvement of development or production efficiency [22,23].

2.1. DoE and Multivariate Tools

In pharmaceutical manufacturing, understanding the effect of various factors and optimizing them are essential for improving efficiency. The univariate or trial-and-error methods are used to find the optimal level for each factor, but these approaches have several limitations, such as high resource and time consumption [24,25]. In contrast, the multivariate approaches can be used to analyze multiple variables simultaneously, and various multivariate tools are available for pharmaceutical process development and innovation [26,27]. the design of experiment (DoE) approach is considered to be an efficient approach to understanding the relationship between factors and response variables [28]. DoE enables systematic determination of the operational parameters for a set of experiments while ensuring the maximum amount of information is obtained. DoE can be used to optimize the level of key variables contributing to an efficient scale-up and process validation for process stability.

DoE methods have found broad applications in many pharmaceutical processes. Gardner et al. [29] developed a high-throughput platform for selecting different pharmaceutical candidates and products. They used experimental design methods to test different conditions and determine the combinatorial testing conditions for robotic dispensing and sample harvesting. Patel et al. [30] developed a solid self-nanoemulsifying drug delivery system (S-SNEDDS) that optimizes the three-component system using DoE. Their optimization led to an increase in the oral bioavailability of the poorly soluble antiretroviral protease inhibitor, Lopinavir. Hsueh et al. [31] carried out DoE in a predicted model to solve difficulties in employing full-scale experiments with multiple variables in process characterization of recombinant protein production.

The selection of design methodology is an important decision for chemical researchers to make, so that the requirements of factor-level experiments are satisfied and experimentation efficiency is improved. Full factorial design is utilized to examine main and interactive effects when there are a limited number of factors involved [32,33]. When a large number of factors need to be studied, fractional factorial design [34] is commonly implemented for determining the impactful factors at an early stage. Furthermore, the Plackett–Burman design and Taguchi design enable researchers to identify the most significant variables, which are utilized to optimize manufacturing processes such as tablet compression and drug dissolution [35,36].

The classical DoE methods are effective when combined with statistical estimate and polynomial regression models, which is termed as response surface methodology (RSM) [7,8,37]. Central composite design (CCD), a frequently used design in RSM, is particularly useful for fitting full quadratic models to data and has been applied to optimize fermentation medium [38], separate bioderived acids from rejected stream [39] and improve chemical modification technique [40]. Box–Behnken design (BBD), though more restrictive than CCD, is more efficient in terms of reducing the number of experiments required. It has been effectively utilized in pharmaceutical processes to optimize both process parameters and formulation variables [41,42,43]. Designs that fill space sparsely may fall short when attempting to create more complex models, like artificial neural networks (ANNs), due to insufficient coverage of various factors involved. To overcome this issue, some space-filling designs have been developed, distributing design points uniformly across the entire range of each factor. Latin hypercube sampling (LHS) [44], uniform design (UD) [45], and Hammersley sequence sampling (HSS) [46] are among the most commonly utilized space-filling designs. These methods have demonstrated high effectiveness in addressing the limitations of earlier approaches, although they often require more conducted experiments. To deal with dynamic experimental conditions and dynamic output values that are changing with time, DoE and RSM have been respectively extended in methodology, as shown in Figure 2. Klebanov and Georgakis [47] introduced an innovative design of dynamic experiments (DoDE) methodology, an effective way to incorporate time-dependent input variables. Since then, a new and powerful data-driven model structure, dynamic response surface methodology (DRSM) [48,49], has been developed, which captures the intricate relationship among process inputs (whether they are time-variant or time-invariant) and time-resolved output variables. For instance, DRSM possesses the capability to model the concentration profiles acquired from chemical reaction experiments performed at different time intervals. Through the incorporation of time-varying inputs and outputs in these models, researchers can obtain a more comprehensive understanding of the complex dynamics of the system [9,50,51,52,53,54].

2.2. Bayesian Inferences

Bayesian inference is an important tool in statistics that has recently been revisited frequently and found its way into the field of pharmaceutical research [55]. Different from tools like DoE, Bayesian inference aims to describe variability with limited samples by incorporating prior knowledge about the parameters and updating these parameters as new data become available. Although Bayesian inference employs the same statistical models as many other analysis methods, it represents a powerful tool for tackling various questions in drug synthesis and process production (Table 1). This is especially important given the vast amount of sequence data generated from long-standing and complex inquiries in the pharmaceutical process. Here, we describe the statistical theorem of Bayesian inference and illustrate applications for prediction, optimization and process monitoring in the field.

The fundamental principle of Bayesian inference is Bayes’ theorem [65], which states that the posterior probability of a hypothesis (H) given some observed data (D) is proportional to the likelihood of the data given the hypothesis (P(D|H)) multiplied by the prior probability of the hypothesis (P(H)), divided by the marginal likelihood of the data (P(D)):

P(H|D) = P(D|H) × P(H)/P(D)

Bayes’ theorem is a powerful tool for updating prior beliefs into posterior beliefs about the parameters of a statistical model, based on observed data. The prior beliefs are typically expressed as probability distributions, and the process of updating these beliefs using Bayes’ theorem is known as Bayesian inference. This approach allows for the incorporation of prior knowledge or beliefs about the model parameters into the analysis and provides a flexible and robust framework for addressing a wide range of statistical problems. The principle of Bayesian inference is illustrated in Figure 3.

The Bayesian approach is widely used in various areas of the pharmaceutical industry due to its ability to quantify uncertainty. This approach has several advantages especially when dealing with noisy, complex, or difficult-to-evaluate objective functions. In virtual screening, Bayesian inferences have been used to predict drug solubility [66] and protein–protein binding sites [56] and optimize the selection of target inhibitors from large libraries of potential drug candidates [57,58]. Peterson et al. utilized the Bayesian approach to derive the design space (defined by ICH Q8) by incorporating the uncertainty associated with the model’s unknown parameters and leveraging the posterior predictive distribution to incorporate the correlation structure of the data [19,57]. Considering the large number of complex organic reactions involved in drug synthesis, the Bayesian approach is used in the reaction mechanism inference. Li et al. [59] developed a Bayesian chemical reaction neural network (B-CRNN) to reconstruct chemical kinetic models from data and perform uncertainty quantification on the identified reaction pathways. Cohen and Vlachos [61] developed a software library (Chemical Kinetics Bayesian Inference Toolbox, CKBIT) that is made available for users to estimate dynamic parameters and quantify uncertainties by combining it with other Python open-source packages. Bayesian inferences have been employed to optimize the selection of reaction routes to maximize the information gained from the screen while minimizing the number of experiments. When identifying impurities, Melanson et al. [63] combined the Bayesian statistical approach with LC-MS/MS, qNMR, and mass balance methods to improve the accuracy of peptide measurements. Using mass balance results as prior knowledge, they validated a candidate reference material for angiotensin II to achieve excellent agreement with the final purity value.

Nevertheless, Bayesian inferences can be susceptible to overfitting when the model is overly complex or the data are limited [67]. A cautious selection of priors, model structure, and computational resources is necessary to ensure robust and reliable results. Furthermore, the practical application of Bayesian optimization requires a thorough understanding of the theoretical basis of the Bayesian framework, the optimization algorithm used, and the specific problem domain to achieve optimal results.

2.3. Other Statistical Tools

Besides DoE and Bayesian inferences, there are other statistical tools that have been adopted by the pharmaceutical research and development community. For example, semiparametric regression models with splines or other functions have been found very useful for reaction analysis [9]. Here, we further review two additional statistical tools, Monte Carlo simulation and the hidden Markov model.

In some complex multivariate problems, Monte Carlo methods are utilized to obtain solutions for a variety of useful response surface models. A common approach involves formulating a distinct model structure for each type of response and sampling from a posterior predictive distribution to optimize the expected value of a quantity of interest. Mashayekhi et al. [68] modeled Silymarin and its ethylene glycol derivative compounds in the gas phase by applying DFT and calculated solvation free energies and association free energies by Monte Carlo simulation and perturbation methods. The application of ethylene glycol derivatives was explored to enhance the solubility of Silymarin, which is beneficial to its development as a medication for hepatoprotection. Bodnarchuk et al. [69] used the grand canonical Monte Carlo method to solve the problem of multiple interacting waters fluctuating during a simulation. When the probability distribution of interest is not easily sampled, Markov chain Monte Carlo (MCMC) [70] can be a powerful tool. This technique involves using a Markov chain to generate a sequence of random samples from the probability distribution. After generating a large number of samples, the distribution of these samples approximates the desired distribution. MCMC can be combined with Bayesian inference, with recent applications in the pharmaceutical industry [71,72,73]. The Hamiltonian Monte Carlo (HMC) method has been proposed to reduce the random walking behaviors observed in earlier variants of MCMC and facilitate the analysis of drug-resistance mutations (DRMs) [74].

Hidden Markov Models (HMMs) are statistical models that can be used to describe underlying processes that generate observations in a sequence, and benefit in describing the underlying dynamics of a process in terms of hidden states and observed outputs. Zhang et al. [75] present a new method for an online monitoring framework, using construct-independent hidden Markov models for each experimental condition to facilitate modifications. HMMs have also been applied in molecular dynamics simulations to determine the relative conformational sampling and protein dynamics of inhibitors within enzymes for the design of new antimalarials, as demonstrated by Yang et al. [76] Emdadi and Eslahchi introduced a new approach for feature selection using HMM and a multinomial mixture model to analyze single nucleotide mutation data [77]. Compared to the ensemble feature selection method (EFS), their proposed method showed superior performance in drug response prediction. These applications demonstrate the versatility and utility of HMMs in the field of pharmaceutics and chemistry.

2.4. From Statistical Tools to Machine Learning

The advent of modern machine learning (ML) techniques has significantly broadened the repertoire of tools available to chemists and engineers, allowing for data-driven modeling, which significantly reduces the developing reliance on expert analysis or chemical intuition. Machine learning algorithms, such as support vector machine (SVM) [78,79,80], and random forest (RF) classifiers [81,82], are partially based on statistics theory; however, they do not rely heavily on the traditional statistical analysis, such as confidence level and interval, p-value, power, etc. On the contrary, machine learning tools aim to identify the most impactful features and develop a model that has some capability to identify certain relationships and patterns that can be used for either classification or regression purposes. Statistical methods typically rely on well-defined mathematical assumptions and probability distributions, which can limit their flexibility. ML models, in contrast, are more flexible and can handle complex relationships without strong assumptions. The indicators, such as area under curve (AUC) based on receiver operating characteristic (ROC) curve, are more frequently used to test the prediction ability for the ML methods. These machine learning methods are potentially useful for various purposes, especially for facilitating a more balanced and improved identification of potential drug targets and prediction of drug toxicity and effectiveness. Unlike traditional linear and nonlinear regression techniques, ANNs can capture complex interactions and hidden non-linearities in data, making them more suitable for modeling multi-dimensional relationships that exist in drug processes [83]. In addition, the ability to combine machine learning with traditional statistical analysis to provide more explicit explanations for model outputs, has been found important for pharmaceutical chemists and engineers. For example, Bayesian inference has been integrated with neural networks and has found utilization in autonomous kinetic uncertainty. Thus, the boundaries between statistical tools and ML models are becoming more and more blurred in practice.

One of the most studied topics in pharmaceutical processes, where ML techniques are used to provide explicit revelations, is the understanding and optimization of chemical reaction networks in organic synthesis [84,85,86,87]. With the curation of extensive reaction databases such as the United States Patent and Trademark Office (USPTO), Reaxys, and SciFinder, researchers now have access to millions of tabulated examples of chemical reactions. Leveraging the power of ML, these vast data repositories can be mined to yield valuable insights into the underlying chemical reaction conditions, pathways, and dynamics of reaction systems. For example, Granda et al. [88] controlled a reaction system by a machine learning algorithm to explore the space of chemical reactions, and they assessed the reaction in real time using nuclear magnetic resonance and infrared spectroscopy. While ML is transforming the field of drug organic synthesis, there are several remaining challenges and opportunities [89]. In the next section, we will review the machine learning tools applied in the pharmaceutical industry in detail.

3. Process Modeling Tools Based on Machine Learning

As one of the fastest-growing disciplines in artificial intelligence, machine learning has made breakthroughs in processing various types of complex data [12]. The combination of big data and machine learning has led to a number of new applications in research and industry. In pharmaceutical manufacturing, where running large numbers of experimental reactions is time-consuming and resource-intensive, various machine learning-based modeling tools have shortened production cycles, reduced research costs, lowered labor requirements, and enabled product quality improvement. Machine learning methods can be broadly divided into unsupervised learning, supervised learning, and reinforcement learning, whose applications in pharmaceutical development will be discussed in this section; deep learning, as a special case of supervised learning with complex network structure, will be reviewed separately due to its extensive applications.

3.1. Unsupervised Learning

Generally speaking, unsupervised learning is a type of machine learning method merely based on the featured data (which could be viewed as the independent variables in the traditional statistics analysis), without the usage of labeled data (which could be regarded as the dependent variables). For many pharmaceutical and biopharmaceutical reactions, it is difficult to obtain large amounts of accurately labeled data due to the high time and labor costs [90]. In this context, unsupervised learning methods in which the input data are not labeled are essential, especially in the early stage of development.

In response to the problem of sparsely labeled data, unsupervised pre-training models can capture generic features in large amounts of unlabeled data, thereby improving the accuracy of downstream tasks [91]. Gómez-Bombarelli et al. [92] proposed a framework for de novo molecular design, using a variational autoencoder based on unsupervised methods to transform discrete molecular SMILES (Simplified Molecular Input Line Entry System) into multidimensional continuous representations, thereby generating new molecules through continuous optimization. Singhal et al. [93] proposed an improved K-means clustering algorithm based on PCA and Mahalanobis distance for clustering multivariate time-series data from both batch and continuous chemical systems. Zheng et al. [94] proposed a new unsupervised data mining method to construct a fault diagnosis model. With this data mining method, the different states of normal operation and faults in a chemical process can be effectively separated to create a database of markers. Winter et al. [95] extracted important molecular descriptors through unsupervised training on large chemical molecular structure datasets. In this way, it is able to capture the full range of information contained on discrete chemical molecules of varying sizes in the dataset. Zhang et al. [96] proposed a multi-task learning framework, called BMTL-BERT, which is pre-trained on label-free molecular data through large-scale unsupervised learning to mine the information in SMILES strings to provide information for downstream tasks such as drug molecule property prediction. Schwaller et al. [97] demonstrated that Transformer neural networks can learn atomic mapping information between products and reactants without supervision study or artificial data labeling. Based on Transformer’s attention weights, they proposed a reaction mapper to extract organic chemistry grammars from unlabeled reaction sets. The excellent performance of unsupervised pre-trained models on unlabeled datasets in molecular property prediction tasks has made them one of the most popular new trends [98,99].

3.2. Supervised Learning

Supervised learning methods use a set of labeled data to learn the mapping relationship from input to output variables, and then apply this mapping relationship to unknown data for the purpose of classification or regression. With the flexible structure and the outstanding capability, random forests and support vector machines (concepts shown in Figure 4) are currently two of the most widely used classes of supervised learning methods, and therefore, we will focus on these two methods in this subsection. Deep learning methods, with complex neural network structures, also belong to the supervised learning category, and they will be visited in the next subsection.

Random forest is a classifier that integrates multiple trees to train and predict samples through the idea of integration learning [100]. In a pioneering work, Derek et al. [101] demonstrated that the random forest algorithm outperformed linear regression analysis for the task of predicting the yield of C-N cross-coupling reactions. They used computationally derived molecular, atomic and vibrational descriptors, with a few other feature descriptors, and significantly improved the performance of their model. Sandfort et al. [5] developed a multiple fingerprint feature, MFF, as the input to the model. For the subsequent tasks such as predicting reaction yield, the random forest model performed more robustly on different datasets and was less prone to overfitting. Marcou et al. [102] presented the modeling of condition-specific feasibility of a Lewis acid catalyst and hydrophobic solvent for the Michael reaction case. In their work, compared to naive Bayes, random forest or SVM models have been shown to have better performance.

First used as a generalized linear classifier for binary classification problems, SVM uses a shallow linear separation model. When the data are linearly indistinguishable in low-dimensional space, SVM greatly enhances the applicability of the model by implicitly mapping the data into a high-dimensional feature space using a kernel function [103]. In recent work, Haywood et al. [104] demonstrated the performance of support vector regression models for the task of predicting chemical reaction yields. They showed that structure-based descriptors have better applicability than quantum chemistry-based descriptors based on their models. SVM is also one of the most popular methods for predicting molecular properties. Frçhlich et al. [105]. reported an SVM model using kernel functions that take molecular structure into account and executed tasks such as predicting human intestinal absorption (HIA). Applications of SVM methods also include prediction of molecular physicochemical properties [106,107], toxicity prediction [108,109] and many other aspects in the pharmaceutical industry.

3.3. Deep Learning

Unlike traditional machine learning methods, deep learning models use artificial neural networks as their basic architecture, adding multiple layers of non-linear processing units to learn data representations [110]. Deep neural networks are available in a variety of different frameworks, such as recurrent neural networks and convolutional neural networks (concepts shown in Figure 5), and may benefit from a multi-layer neural network architecture, which allows deep learning models to use more diverse descriptors and process larger datasets [12]. With powerful learning and data processing capabilities, deep learning methods are making a significant impact in areas such as reaction outcome prediction and inverse synthesis route planning.

Much endeavor has been made for the purpose of reaction condition optimization. Gao et al. [111] developed a multi-layer neural network model trained on a dataset of approximately 10 million reactions. The model was able to accurately predict or recommend reaction conditions for common reactions with the same functionality, such as reagents, catalysts and temperatures. For smaller datasets, more accurate models for predicting reaction conditions have also been developed. For cross-coupling reactions, Maser et al. [112] used graph convolutional networks and gradient boosting machines to build multi-label classification models to predict reaction conditions. Furthermore, Angello et al. [113] selected an uncertainty-minimizing ensemble of Gaussian processes supplemented with a neural-network kernel component method, and used the data-guided method to construct a spatial matrix of reaction conditions. Afterwards, they used robots to perform automated experiments that aimed at optimizing the general reaction conditions of the Suzuki–Miyaura coupling reaction.

In organic synthesis, correct prediction of reaction products enables researchers to obtain target compounds faster. Jin et al. [114] encoded the graph structure using a graph convolutional neural network to predict atom or bond configuration changes in the reaction. By enumerating all of the possible changes that result from the configuration changes in these atoms and bonds, a list of candidate products can be generated. Coley et al. [115] proposed a graph convolutional neural network for the prediction of organic reaction products. By training on a dataset of about 100,000 reactions, this model was able to correctly predict the main products with the best accuracy of 0.856 in only 100 ms. Reaction yield prediction can enable chemists to evaluate the overall yield of complex reaction routes and select the most appropriate reaction route. Schwaller et al. [116] proposed a new network architecture, using an encoder transformer model [117] combined with a regression layer to predict chemical reaction yields starting from SMILES. Different forms of deep learning models have been adopted and integrated for chemical simulation and fluid analysis, which can be used for drug design [118,119,120].

In drug discovery, computer-assisted synthetic planning (CASP) can significantly improve development efficiency and save research costs by reducing the probability of failure in synthetic route discovery. Retrosynthetic analysis is the canonical technique used to plan the synthesis of small organic molecules. Segler et al. [17] developed a deep neural network-based template recommendation algorithm. By learning the inverse synthesis templates extracted from the chemical reaction dataset, the model establishes the mapping of product molecules to associated templates and generates products corresponding to the recommended templates for the target molecules. The algorithm has subsequently been widely used in multi-step inverse synthesis models due to its fast execution speed and low computational complexity. On this basis, Segler et al. [121] made a breakthrough by applying the Monte Carlo tree search (MCTS) to the field of multi-stage inverse synthesis prediction. To address the problem of poor interpretability due to the black-box nature of deep learning methods, Ishida et al. [122] proposed an inverse synthetic response prediction framework based on graph convolutional neural networks and integrated gradients visualization. In this work, the GCN model had better predictive performance than the ECFP model.

3.4. Reinforcement Learning

Reinforcement learning is an important branch of machine learning that differs from other machine learning methods, as it learns the optimal strategy by continuously interacting with the environment with trial and error to maximize the cumulative reward to learn the optimal strategy and achieve a specific target. Its main strength is its ability to balance the exploration of uncharted space with the exploitation of current knowledge learned in the earlier steps.

Reinforcement learning has found various applications in pharmaceutical research and development, including organic synthesis planning and synthetic biological design. The evaluation of compounds in CASP search trees can be facilitated by reinforcement learning. Schreck et al. [123] incorporated reinforcement learning strategies into an existing inverse synthesis planning framework and demonstrated the applicability of reinforcement learning to the design evaluation of synthesis paths. Wang et al. [124] combined conditional prediction models with reinforcement learning methods to propose an improved MCTS. In this work, value networks trained by MCTS reinforcement were used to evaluate the ease/difficulty of synthesis of each compound, resulting in shorter path lengths and strategies with higher success rates. Koch et al. [125] have applied MCTS reinforcement learning algorithms to biochemistry and metabolic engineering. They integrated rule ranking with MCTS reinforcement learning to predict synthetic routes from target products, microbial strains and reaction templates. Reinforcement learning used in synthetic biology can improve planning ability and sample efficiency, while increasing design diversity. Angermueller et al. [126] proposed a proximal policy optimization (PPO)-based reinforcement learning approach to biological sequence design that outperforms existing methods for tasks such as the design of DNA transcription factor binding sites, the design of antimicrobial proteins, and others.

4. Case Studies for Pharmaceutical Processes

In this section, we will show the application of multiple data-driven methods for modeling homogeneous organic reactions. First, to investigate the dynamic modeling of complex pharmaceutical reactions, three data-driven methods, ANN, DRSM and SVM, were used in a pharmaceutical reaction case for a detailed modeling comparison. Second, we illustrated a workflow based on a convolutional neural network to predict the proper catalyst and the rate constants of the Suzuki–Miyaura cross-coupling reaction.

4.1. Dynamic Reaction Data Modeling

Due to the difficulty of analyzing complex reaction processes and intermediate species, traditional expert analysis and manual calculations have severely limited the efficiency of drug development. In the chemical and biopharmaceutical industries, machine learning methods can be applied to high-throughput data analysis to build reaction networks, with which drug development efficiency will be significantly improved. We conducted a comparative study of three data-driven modeling approaches for the dynamic output data in organic synthesis reactions using artificial neural networks (ANNs), dynamic response surface methodology (DRSM) and support vector machine (SVM), which have been widely used in recent years.

The dataset was derived from a modeling study by Dong et al. [53]. The reaction network contains 11 substances, eight reactions, and 17 sets of experiments, each containing 13 data points. The three experimental factors are the reaction temperature, the initial concentration of substance B and the initial concentration of substance D, and the 11 output values are the concentration of each substance in units of the ratio to the initial concentration of substance A (equiv.). After tuning the number of layers and neurons, the ANN model consists of three layers: an input layer of 11 neurons, a hidden layer of five neurons and an output layer of 11 neurons. In SVM the model, we used a Gaussian function as the kernel function and the iterative single data algorithm. For the DRSM model, we used exponential time transformation, and the number of polynomials, as well as the time constants, for different species were determined by the Bayesian information criterion. The DRSM model was developed from the case study of Dong et al. [53], while the other two models were built in this work. Due to the limited length of this article, some of the modeling results are illustrated here using substance 6 as an example.

Figure 6 shows a plot of the fit for species 6 as an example. For species 6, the DRSM model had high modeling accuracy in all datasets and also had good prediction for the test set data; the SVM model had good modeling for the training and validation sets, but the prediction accuracy for the test set was lower. In the overall regression results, the predicted results in the ANN and SVM models were more variate, while the predicted data in the DRSM model had a more stable and accurate fit.

Table 2 shows the model running time, number of parameters and decision coefficients of the three models in this case. As one can observe, with the relatively small volume of the dataset of this case study, DRSM had significantly better accuracy. In the dynamic modeling of pharmaceutical reactions, the reactants, intermediates and target products change dynamically with the reaction time, and the effects of time and other input variables on the modeling results should be reflected differently. The results in Table 2 show that the ANN model could not effectively deal with time-dependent reaction data and was not suitable for the dynamic modeling of pharmaceutical reactions. The DRSM model required a longer solution time than the ANN, but it provided higher accuracy in modeling reaction dynamics. We also tested the case with more data points. With the significant increase in data volume, the running time of the DRSM model increased significantly, while the ANN and SVM models could still complete the modeling quickly, showing the advantages of the high-speed computational capability of the neural network model and the SVM model compared to the statistics-based DRSM approach.

The resulting data-driven model can be used to optimize the response reaction conditions and identify reaction networks, which has been reported in the literature [51,54,87,127]. In addition to predicting standard values, statistical analysis-based models, including the DRSM model, can also provide upper and lower bounds (i.e., confidence intervals for the prediction) for a given confidence level. Due to the algorithmic limitations of machine learning models such as ANN and SVM, it is difficult to obtain confidence interval information. Therefore, for the case with a small dataset, statistics-based data-driven models have some advantages in interpretability and statistical analysis. On the other hand, for the case with a large dataset (with a large number of data points and data features), machine learning models have well-known advantages, which will be illustrated in the next case study.

4.2. Catalyst Kinetics Prediction of Cross-Coupling Reaction Based on Convolutional Neural Network

Transition metal-catalyzed cross-coupling reactions are one of the most efficient ways to construct carbon-carbon and carbon-heteroatom bonds. The Suzuki–Miyaura cross-coupling reaction happens between organoboron reagents and halides or halogen-like compounds in a palladium-catalyzed base environment. It has the advantages of mild reaction conditions and easy availability of reaction materials, and is widely used in drug synthesis. Here, we show the development of a convolutional neural network-based model (Figure 7) for catalyst prediction and rate constant prediction for a Suzuki–Miyaura cross-coupling reaction, based on the organic reaction database. These results are newly presented in this article.

The Suzuki–Miyaura dataset was obtained using a keyword search in the Reaxys database. For data processing, we screened the information on reactant and product amounts, as well as yields and reaction times, and calculated the reaction rate coefficient k according to the information provided by the database.

First, we built a catalyst prediction model to provide the most promising types of catalysts (including the ligands) for the reaction from the ECFP4 fingerprint of the reactants and products. By selecting the top three catalysts as the final prediction, we defined accurate prediction as whether the catalyst reported in the corresponding literature appeared in the top three predicted results. After screening, the Suzuki–Miyaura dataset contained a total of 4448 pieces of data and 85 types of catalysts. Data enhancement methods were then used to increase the data volume to 8896 pieces by switching the order of two reactants. For each dataset, an 80%/10%/10% train/validation/test split was used in modeling. Table 3 shows the modeling results and parameters of the catalyst prediction model. Model 1 was the final model with the best prediction performance, and Model 2 was obtained using slightly different CNN structures. With a top-three accuracy of 0.957 on the training set and 0.85 on the testing set, this convolutional neural network model can recommend the common catalysts for the Suzuki–Miyaura cross-coupling reaction. As we mentioned earlier, the scarcity of drug reaction labeled data will introduce the problem of overfitting the machine learning model, but this is within an acceptable range.

After obtaining the catalysts recommended by the model, the next step was building the rate constant prediction model. We first investigated the similarity of different reactions. Using the K-means method based on catalyst structure features, we performed reaction clustering for the Suzuki–Miyaura cross-coupling reactions. After filtering for catalysts and data enhancement, the 6100 pieces of data were divided into three clusters, and we then developed rate constant prediction models for each of the three clusters and compared the results with the prediction models developed for the full dataset. As can be seen from the resulting data in Table 4, the

R^{2}

results of the models built on each cluster after clustering analysis were all significantly higher than those of models built from the entire dataset without clustering, demonstrating that the reaction clustering method achieved a significant improvement in the prediction of Suzuki–Miyaura cross-coupling reaction rate constants. This also inspired us to investigate the effect of the method on the kinetic prediction of other coupling reactions.

5. Challenges and Future Perspectives

Data-driven modeling methods and techniques play critical roles in many aspects of pharmaceutical processes, including drug discovery, process development, and product manufacturing [128,129]. However, there are still several challenges and open questions that researchers and practitioners face in implementing and advancing these approaches, mainly in the following aspects:

Data availability and quality. Obtaining high-quality data is important for building a reliable and accurate model. In the practice of pharmaceutical processes, data might be scarce due to the difficulty in collection, or varying in terms of structure and quality. With the expanding usage of high-throughput experimentation systems, more data with high quality could be envisioned for utilization in future pharmaceutical development [2]. More effort is needed for the research community, in academia and industry, to work together and provide collective databases that could be adopted in different works.
Model generalization and interpretability. With the development of different data-driven methods, model validation within a given database itself is a well-answered question. Nevertheless, the ability to generalize well beyond the training data is crucial for real-world applications. This is especially true for pharmaceutical processes, which involve complex and nonlinear relationships. Ideally, the developed models should be robust enough when generalized to different process conditions or external factors for reliable predictions and recommendations, but that achievement might be difficult due to the nature of data-driven models. One possible solution is to increase the interpretability of the data-driven models [130]. By interpreting the complex relationships learned by machine-learning models, researchers might gain useful insights into the underlying mechanisms.
Scalability and online performance. Scaling data-driven models to handle large-scale pharmaceutical processes is still a challenge. The development of an accurate data-driven modeling framework is not complete if this model cannot be solved sufficiently in the application time-frame. For drug development purposes, the modeling of a complex (bio)-chemical reaction network could be difficult to solve sometimes. In practice, the online performance for monitoring and control purposes is even more reliant on efficient and scalable computational methods. Therefore, developing parallel computing techniques, highly efficient algorithms and distributed architectures is essential for intelligent computational tools to be more widely adopted in pharmaceutical manufacturing [131].
Regulatory compliance. As a relatively new approach, the application of machine learning and other data-driven modeling raises regulatory considerations in the pharmaceutical industry. Traditional approaches, including statistical analysis and mechanistic models, provide transparency and validation for the safety and efficacy of drugs. The establishment of guidelines and standards for data-driven techniques in the pharmaceutical industry is a challenge yet to be addressed, concerning aspects of model transparency, validation and interpretability. Improving the interpretability and explaining the predictions of data-driven models are important for regulatory compliance and trust.

These four points summarize some of the main challenges and future work directions essential to the further advance of data-driven modeling methods in pharmaceutical processes. As mentioned above, collaborative efforts among academic researchers, industry experts, and regulatory bodies are necessary to realize their full potential. Among various endeavors worth making, two research directions have drawn substantial attention in the academic community, namely, the integration of data-driven models and mechanistic models, and multi-scale modeling. We will discuss these two aspects, respectively.

Integrating data-driven and mechanistic models can provide a comprehensive understanding of the studied pharmaceutical system, enable model generalization, and facilitate robust control and optimization. Here is a simplistic explanatory case. Feature selection and engineering are critical steps in data-driven modeling and could be complex for some pharmaceutical processes. Therefore, they require domain expertise to carry out meaningful feature extraction and transformation. For example, hybrid modeling with machine learning and domain knowledge of lipidomics has been developed for metabolic detection of pancreatic diseases [132]. More generally speaking, mechanistic models based on fundamental principles and knowledge of the underlying principles are often time-consuming to develop and validate; combining them with data-driven models can leverage the advantages of both model types. The mechanistic part of the model provides domain knowledge and interpretability, while the data-driven part enables accuracy and flexibility. There are some pioneering works for such integration for various purposes, such as parameter estimation, model initiation and adaptation, hybrid model development, model validation, etc. [133,134,135,136], and they might be adopted in pharmaceutical processes after appropriate modifications. One of the most promising approaches is to build hybrid models for pharmaceutical process development, having a mechanistic core to capture the basic principles and data-driven modules built around the core that handle interaction with the experimental data, non-linear relationships, process variations and uncertain parameters that are challenging to model mechanistically. Additionally, data-driven models are trained using the available data to fine-tune parameters and improve accuracy, and two models can work together the other way around, i.e., a mechanistic model can provide a good starting point for data-driven models, speeding up convergence and improving computational performance. In the control of pharmaceutical processes (e.g., a biochemical fermentation process), real-time dynamic data are constantly provided; in such cases, a mechanistic module allows for robustness, while data-driven models can account for changing system behavior.

Multi-scale modeling is also critical for pharmaceutical processes, as it allows for a holistic understanding and optimization. Pharmaceutical processes involve multiple scales, ranging from molecular and cellular levels to unit operation and full-scale manufacturing [137,138,139]. At the molecular and cellular scale, data-driven models can be used to analyze and predict drug–target interactions, pharmacokinetics, and pharmacodynamics. These models can leverage large-scale genomic, proteomic, and metabolomic data to identify potential drug targets, understand drug mechanisms, and optimize drug design and dosage regimens. At the unit operation scale, data-driven models can be used to analyze and optimize mixing, reaction, separation, formulation and other unit operations. This is especially true in the scale-up optimization from laboratory to commercial production. Data-driven models, such as reinforcement learning, may capture the scaling effects and predict the behavior of the process at different scales, utilizing historical data from different scales, which reduces the need for extensive and costly experimentation during process scale-up. On the full manufacturing scale, supply chain management, enterprise resource planning and manufacturing execution systems can all benefit from leveraging historical data and market trends to make informed decisions for production plans and schedules, batch sizes, distribution plans, etc. It is more and more clear that integrating data-driven models across different scales is an important yet complicated task. Data compatibility, model transferability, and the contextual differences between scales need to be addressed and aligned in such an integration process. To achieve successful integration, two aspects are crucial. First, optimized decisions based on data-driven models need to be calibrated, validated and adapted based on experimental data. Second, dynamic behaviors are often ubiquitous in pharmaceutical processes, and advances in the development of data-driven models that handle time-dependent variations are required.

6. Conclusions

In this article, we summarized the data-driven approaches used in pharmaceutical processes, based on the algorithmic principles of the methods. They are grouped broadly in the two categories of statistics and machine learning. The former approach has been adopted for a long time and has been revived in the fields of multivariate analysis, Bayesian inferences, etc., with recent advances in computational ability. With a good connection to statistics, machine learning methods have enabled fast and well-performed feature extraction and outcome prediction for larger datasets. Such data-driven methods and techniques have been widely adopted with different applications in pharmaceutical processes, including the fields of drug discovery, retrosynthesis process design, reaction modeling and condition optimization, separation process design and optimization, dynamic operation control, etc., and they have guided drug development and manufacturing toward more automatous, efficient and intelligent processes. There are still challenges in several aspects, such as data availability and quality, model generalization and interpretability, scalability of online performance, as well as regulatory compliance, that require collaborative efforts from academia, pharmaceutical companies, and governmental and regulatory organizations to overcome.

Author Contributions

Conceptualization, Y.D. and J.D.; methodology, Y.D., T.Y., Y.X. and Q.M.; software, T.Y. and Y.X.; validation, Y.D., T.Y. and Y.X.; formal analysis, Y.D., T.Y. and Y.X.; investigation, Y.D., T.Y. and Y.X.; resources, Q.M., Y.D. and J.D.; data curation, T.Y. and Y.X.; writing—original draft preparation, Y.D., T.Y. and Y.X.; writing—review and editing, Q.M., Y.D. and J.D.; visualization, T.Y. and Y.X.; supervision, Q.M., Y.D. and J.D.; project administration, Q.M., Y.D. and J.D.; funding acquisition, Q.M., Y.D. and J.D. All authors have read and agreed to the published version of the manuscript.

Funding

The authors appreciate funding from Fundamental Research Funds for China Central Universities DUT20RC (3) 070 and DUT22LAB608, the National Natural Science Foundation of China (U20A20143) and Liaoning Province “Xingliao Talent Program” Project (XLYC1902086).

Data Availability Statement

Partial data supporting the results of this study may be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gernaey, K.V.; Cervera-Padrell, A.E.; Woodley, J.M. A Perspective on PSE in Pharmaceutical Process Development and Innovation. Comput. Chem. Eng. 2012, 42, 15–29. [Google Scholar] [CrossRef]
Selekman, J.A.; Qiu, J.; Tran, K.; Stevens, J.; Rosso, V.; Simmons, E.; Xiao, Y.; Janey, J. High-Throughput Automation in Chemical Process Development. Annu. Rev. Chem. Biomol. Eng. 2017, 8, 525–547. [Google Scholar] [CrossRef]
Coley, C.W.; Eyke, N.S.; Jensen, K.F. Autonomous Discovery in the Chemical Sciences Part I: Progress. Angew. Chem. Int. Ed. 2020, 59, 22858–22893. [Google Scholar] [CrossRef]
Buitrago Santanilla, A.; Regalado, E.L.; Pereira, T.; Shevlin, M.; Bateman, K.; Campeau, L.-C.; Schneeweis, J.; Berritt, S.; Shi, Z.-C.; Nantermet, P.; et al. Nanomole-Scale High-Throughput Chemistry for the Synthesis of Complex Molecules. Science 2015, 347, 49–53. [Google Scholar] [CrossRef]
Perera, D.; Tucker, J.W.; Brahmbhatt, S.; Helal, C.J.; Chong, A.; Farrell, W.; Richardson, P.; Sach, N.W. A Platform for Automated Nanomole-Scale Reaction Screening and Micromole-Scale Synthesis in Flow. Science 2018, 359, 429–434. [Google Scholar] [CrossRef]
Burger, B.; Maffettone, P.M.; Gusev, V.V.; Aitchison, C.M.; Bai, Y.; Wang, X.; Li, X.; Alston, B.M.; Li, B.; Clowes, R.; et al. A Mobile Robotic Chemist. Nature 2020, 583, 237–241. [Google Scholar] [CrossRef] [PubMed]
Bezerra, M.A.; Santelli, R.E.; Oliveira, E.P.; Villar, L.S.; Escaleira, L.A. Response Surface Methodology (RSM) as a Tool for Optimization in Analytical Chemistry. Talanta 2008, 76, 965–977. [Google Scholar] [CrossRef] [PubMed]
Hanrahan, G.; Lu, K. Application of Factorial and Response Surface Methodology in Modern Experimental Design and Optimization. Crit. Rev. Anal. Chem. 2006, 36, 141–151. [Google Scholar] [CrossRef]
Wang, K.; Han, L.; Mustakis, J.; Li, B.; Magano, J.; Damon, D.B.; Dion, A.; Maloney, M.T.; Post, R.; Li, R. Kinetic and Data-Driven Reaction Analysis for Pharmaceutical Process Development. Ind. Eng. Chem. Res. 2020, 59, 2409–2421. [Google Scholar] [CrossRef]
Singh, R.; Gernaey, K.V.; Gani, R. Model-Based Computer-Aided Framework for Design of Process Monitoring and Analysis Systems. Comput. Chem. Eng. 2009, 33, 22–42. [Google Scholar] [CrossRef]
Liu, L.; Qu, H. Recent Advancement of Chemical Imaging in Pharmaceutical Quality Control: From Final Product Testing to Industrial Utilization. J. Innov. Opt. Health Sci. 2020, 13, 1930014. [Google Scholar] [CrossRef]
Panteleev, J.; Gao, H.; Jia, L. Recent Applications of Machine Learning in Medicinal Chemistry. Bioorg. Med. Chem. Lett. 2018, 28, 2807–2815. [Google Scholar] [CrossRef]
Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular De-Novo Design through Deep Reinforcement Learning. J. Cheminform. 2017, 9, 48. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Müller, A.T.; Huisman, B.J.H.; Fuchs, J.A.; Schneider, P.; Schneider, G. Generative Recurrent Networks for De Novo Drug Design. Mol. Inform. 2018, 37, 1700111. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103–1113. [Google Scholar] [CrossRef] [PubMed]
Segler, M.H.S.; Waller, M.P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem. Eur. J. 2017, 23, 5966–5971. [Google Scholar] [CrossRef]
Mann, V.; Venkatasubramanian, V. Predicting Chemical Reaction Outcomes: A Grammar Ontology-based Transformer Framework. AIChE J. 2021, 67, e17190. [Google Scholar] [CrossRef]
Yu, L.X.; Kopcha, M. The Future of Pharmaceutical Quality and the Path to Get There. Int. J. Pharm. 2017, 528, 354–359. [Google Scholar] [CrossRef]
Peterson, J.J. A Bayesian Approach to the ICH Q8 Definition of Design Space. J. Biopharm. Stat. 2008, 18, 959–975. [Google Scholar] [CrossRef]
Grossmann, I.E.; Morari, M. Operability, resiliency and flexibility—Process design objectives for a changing world. In Proceedings of the 2nd International Conference on Foundations of Computer-Aided Process Design, Snowmass, CO, USA, 19–24 June 1983; Westerberg, A.W., Chien, H.H., Eds.; 1984; p. 931. [Google Scholar]
Tabora, J.E.; Lora Gonzalez, F.; Tom, J.W. Bayesian Probabilistic Modeling in Pharmaceutical Process Development. AIChE J. 2019, 65, e16744. [Google Scholar] [CrossRef]
Halemane, K.P.; Grossmann, I.E. Optimal Process Design under Uncertainty. AIChE J. 1983, 29, 425–433. [Google Scholar] [CrossRef]
Swaney, R.E.; Grossmann, I.E. An Index for Operational Flexibility in Chemical Process Design. Part I: Formulation and Theory. AIChE J. 1985, 31, 621–630. [Google Scholar] [CrossRef]
Design of Experiments: Basic Concepts and Its Application in Pharmaceutical Product Development. In Pharmaceutical Product Development; Patravale, V.B., Disouza, J.I., Rustomjee, M., Eds.; CRC Press: Boca Raton, FL, USA, 2016; pp. 132–177. ISBN 978-0-429-18881-7. [Google Scholar]
Zhang, L.; Mao, S. Application of Quality by Design in the Current Drug Development. Asian J. Pharm. Sci. 2017, 12, 1–8. [Google Scholar] [CrossRef]
Sangshetti, J.N.; Deshpande, M.; Zaheer, Z.; Shinde, D.B.; Arote, R. Quality by Design Approach: Regulatory Need. Arab J. Chem. 2017, 10, S3412–S3425. [Google Scholar] [CrossRef]
Yue, W.; Chen, X.; Gui, W.; Xie, Y.; Zhang, H. A Knowledge Reasoning Fuzzy-Bayesian Network for Root Cause Analysis of Abnormal Aluminum Electrolysis Cell Condition. Front. Chem. Sci. Eng. 2017, 11, 414–428. [Google Scholar] [CrossRef]
Montgomery, D.C. Design and Analysis of Experiments, 8th ed.; John Wiley & Sons, Inc: Hoboken, NJ, USA, 2013; ISBN 978-1-118-14692-7. [Google Scholar]
Gardner, C.R.; Almarsson, O.; Chen, H.; Morissette, S.; Peterson, M.; Zhang, Z.; Wang, S.; Lemmo, A.; Gonzalez-Zugasti, J.; Monagle, J.; et al. Application of High Throughput Technologies to Drug Substance and Drug Product Development. Comput. Chem. Eng. 2004, 28, 943–953. [Google Scholar] [CrossRef]
Patel, G.; Shelat, P.; Lalwani, A. Statistical Modeling, Optimization and Characterization of Solid Self-Nanoemulsifying Drug Delivery System of Lopinavir Using Design of Experiment. Drug Deliv. 2016, 23, 3027–3042. [Google Scholar] [CrossRef]
Hsueh, K.-L.; Lin, T.-Y.; Lee, M.-T.; Hsiao, Y.-Y.; Gu, Y. Design of Experiments for Modeling of Fermentation Process Characterization in Biological Drug Production. Processes 2022, 10, 237. [Google Scholar] [CrossRef]
Kumar, P.M.; Ghosh, A. Development and Evaluation of Silver Sulfadiazine Loaded Microsponge Based Gel for Partial Thickness (Second Degree) Burn Wounds. Eur. J. Pharm. Sci. 2017, 96, 243–254. [Google Scholar] [CrossRef] [PubMed]
Kanojia, G.; Willems, G.-J.; Frijlink, H.W.; Kersten, G.F.A.; Soema, P.C.; Amorij, J.-P. A Design of Experiment Approach to Predict Product and Process Parameters for a Spray Dried Influenza Vaccine. Int. J. Pharm. 2016, 511, 1098–1111. [Google Scholar] [CrossRef] [PubMed]
Badawi, M.A.; El-Khordagui, L.K. A Quality by Design Approach to Optimization of Emulsions for Electrospinning Using Factorial and D-Optimal Designs. Eur. J. Pharm. Sci. 2014, 58, 44–54. [Google Scholar] [CrossRef]
Yu, S.; Bu, H.; Dong, W.; Jiang, Z.; Zhang, L.; Xia, Y. Calibration of Physical Characteristic Parameters of Granular Fungal Fertilizer Based on Discrete Element Method. Processes 2022, 10, 1564. [Google Scholar] [CrossRef]
Barman, S.; Chakraborty, R. Kinetics of Combined Noncatalytic and Catalytic Hydrolysis of Jute Fiber under Ultrasonic–Far Infrared Energy Synergy. AIChE J. 2019, 65, e16677. [Google Scholar] [CrossRef]
Myers, R.H.; Montgomery, D.C. Response Surface Methodology: Process and Product Optimization Using Designed Experiments; Wiley Series in Probability and Statistics; Wiley: New York, NY, USA, 1995; ISBN 978-0-471-58100-0. [Google Scholar]
Ibrahim, H.M.; Yusoff, W.M.W.; Hamid, A.A.; Illias, R.M.d.; Hassan, O.; Omar, O. Optimization of Medium for the Production of β-Cyclodextrin Glucanotransferase Using Central Composite Design (CCD). Process Biochem. 2005, 40, 753–758. [Google Scholar] [CrossRef]
Kumar, A.; Shende, D.; Wasewar, K. Central Composite Design Approach for Optimization of Levulinic Acid Separation by Reactive Components. Ind. Eng. Chem. Res. 2021, 60, 13692–13700. [Google Scholar] [CrossRef]
Santinon, C.; Beppu, M.M.; Vieira, M.G.A. Optimization of Kappa-Carrageenan Cationization Using Experimental Design for Model-Drug Release and Investigation of Biological Properties. Carbohydr. Polym. 2023, 308, 120645. [Google Scholar] [CrossRef] [PubMed]
Gupta, B.; Poudel, B.K.; Pathak, S.; Tak, J.W.; Lee, H.H.; Jeong, J.-H.; Choi, H.-G.; Yong, C.S.; Kim, J.O. Effects of Formulation Variables on the Particle Size and Drug Encapsulation of Imatinib-Loaded Solid Lipid Nanoparticles. AAPS PharmSciTech 2016, 17, 652–662. [Google Scholar] [CrossRef] [PubMed]
Bayat, M.; Javanbakht, V.; Esmaili, J. Synthesis of Zeolite/Nickel Ferrite/Sodium Alginate Bionanocomposite via a Co-Precipitation Technique for Efficient Removal of Water-Soluble Methylene Blue Dye. Int. J. Biol. Macromol. 2018, 116, 607–619. [Google Scholar] [CrossRef] [PubMed]
Pereira, R.R.; Testi, M.; Rossi, F.; Silva Junior, J.O.C.; Ribeiro-Costa, R.M.; Bettini, R.; Santi, P.; Padula, C.; Sonvico, F. Ucuùba (Virola Surinamensis) Fat-Based Nanostructured Lipid Carriers for Nail Drug Delivery of Ketoconazole: Development and Optimization Using Box-Behnken Design. Pharmaceutics 2019, 11, 284. [Google Scholar] [CrossRef]
McKay, M.D.; Beckman, R.J.; Conover, W.J. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 1979, 21, 239–245. [Google Scholar] [CrossRef]
Fang, K.-T.; Wang, Y.; Bentler, P.M. Some Applications of Number-Theoretic Methods in Statistics. Stat. Sci. 1994, 9, 416–428. [Google Scholar] [CrossRef]
Kalagnanam, J.R.; Diwekar, U.M. An Efficient Sampling Technique for Off-Line Quality Control. Technometrics 1997, 39, 308–319. [Google Scholar] [CrossRef]
Georgakis, C. Design of Dynamic Experiments: A Data-Driven Methodology for the Optimization of Time-Varying Processes. Ind. Eng. Chem. Res. 2013, 52, 12369–12382. [Google Scholar] [CrossRef]
Wang, Z.; Georgakis, C. A Dynamic Response Surface Model for Polymer Grade Transitions in Industrial Plants. Ind. Eng. Chem. Res. 2019, 58, 11187–11198. [Google Scholar] [CrossRef]
Klebanov, N.; Georgakis, C. Dynamic Response Surface Models: A Data-Driven Approach for the Analysis of Time-Varying Process Outputs. Ind. Eng. Chem. Res. 2016, 55, 4022–4034. [Google Scholar] [CrossRef]
Dong, Y.; Georgakis, C.; Mustakis, J.; Hawkins, J.M.; Han, L.; Wang, K.; McMullen, J.P.; Grosser, S.T.; Stone, K. Constrained Version of the Dynamic Response Surface Methodology for Modeling Pharmaceutical Reactions. Ind. Eng. Chem. Res. 2019, 58, 13611–13621. [Google Scholar] [CrossRef]
Dong, Y.; Georgakis, C.; Mustakis, J.; Han, L.; McMullen, J.P. Optimization of Pharmaceutical Reactions Using the Dynamic Response Surface Methodology. Comput. Chem. Eng. 2020, 135, 106778. [Google Scholar] [CrossRef]
Dong, Y.; Georgakis, C.; Mustakis, J.; McMullen, J.P. New Time Sampling Strategy for the Estimation of the Parameters in DRSM Models. Ind. Eng. Chem. Res. 2020, 59, 12792–12800. [Google Scholar] [CrossRef]
Dong, Y.; Georgakis, C.; Santos-Marques, J.; Du, J. Dynamic Response Surface Methodology Using Lasso Regression for Organic Pharmaceutical Synthesis. Front. Chem. Sci. Eng. 2022, 16, 221–236. [Google Scholar] [CrossRef]
Xing, Y.; Dong, Y.; Goergakis, C.; Zhuang, Y.; Zhang, L.; Du, J.; Meng, Q. Automatic Data-driven Stoichiometry Identification and Kinetic Modeling Framework for Homogeneous Organic Reactions. AIChE J. 2022, 68, e17713. [Google Scholar] [CrossRef]
Peterson, J.J.; Miró-Quesada, G.; del Castillo, E. A Bayesian Reliability Approach to Multiple Response Optimization with Seemingly Unrelated Regression Models. Qual. Technol. Quant. Manag. 2009, 6, 353–369. [Google Scholar] [CrossRef]
Bradford, J.R.; Needham, C.J.; Bulpitt, A.J.; Westhead, D.R. Insights into Protein–Protein Interfaces Using a Bayesian Network Prediction Method. J. Mol. Biol. 2006, 362, 365–386. [Google Scholar] [CrossRef] [PubMed]
Kang, D.; Pang, X.; Lian, W.; Xu, L.; Wang, J.; Jia, H.; Zhang, B.; Liu, A.-L.; Du, G.-H. Discovery of VEGFR2 Inhibitors by Integrating Naïve Bayesian Classification, Molecular Docking and Drug Screening Approaches. RSC Adv. 2018, 8, 5286–5297. [Google Scholar] [CrossRef] [PubMed]
Liao, Y.; Cao, P.; Luo, L. Identification of Novel Arachidonic Acid 15-Lipoxygenase Inhibitors Based on the Bayesian Classifier Model and Computer-Aided High-Throughput Virtual Screening. Pharmaceuticals 2022, 15, 1440. [Google Scholar] [CrossRef] [PubMed]
Peterson, J.J.; Yahyah, M. A Bayesian Design Space Approach to Robustness and System Suitability for Pharmaceutical Assays and Other Processes. Stat. Biopharm. Res. 2009, 1, 441–449. [Google Scholar] [CrossRef]
Li, Q.; Chen, H.; Koenig, B.C.; Deng, S. Bayesian Chemical Reaction Neural Network for Autonomous Kinetic Uncertainty Quantification. Phys. Chem. Chem. Phys. 2023, 25, 3707–3717. [Google Scholar] [CrossRef]
Cohen, M.; Vlachos, D.G. Chemical Kinetics Bayesian Inference Toolbox (CKBIT). Comput. Phys. Commun. 2021, 265, 107989. [Google Scholar] [CrossRef]
Li, Y.F.; Venkatasubramanian, V. Leveraging Bayesian Approach to Predict Drug Manufacturing Performance. J. Pharm. Innov. 2016, 11, 331–338. [Google Scholar] [CrossRef]
Melanson, J.E.; Thibeault, M.-P.; Stocks, B.B.; Leek, D.M.; McRae, G.; Meija, J. Purity Assignment for Peptide Certified Reference Materials by Combining QNMR and LC-MS/MS Amino Acid Analysis Results: Application to Angiotensin II. Anal. Bioanal. Chem. 2018, 410, 6719–6731. [Google Scholar] [CrossRef]
Wang, C.; Gheyas, F. Sampling Strategies for Detecting Rare Impurities: An Application in Gene Therapy Products. J. Biopharm. Stat. 2005, 15, 241–252. [Google Scholar] [CrossRef]
Bayesian Statistics for Beginners—Therese, M.; Donovan, Ruth M. Mickey—Oxford University Press. Available online: https://global.oup.com/ukhe/product/bayesian-statistics-for-beginners-9780198841302 (accessed on 29 May 2019).
Abdelbasset, W.K.; Elkholi, S.M.; Ahmed Ismail, K.; Alalwani, T.A.A.M.; Hachem, K.; Mohamed, A.; Agustiono Kurniawan, T.; Andreevna Rushchitc, A. Modeling and Computational Study on Prediction of Pharmaceutical Solubility in Supercritical CO₂ for Manufacture of Nanomedicine for Enhanced Bioavailability. J. Mol. Liq. 2022, 359, 119306. [Google Scholar] [CrossRef]
Katakami, S.; Sakamoto, H.; Okada, M. Bayesian Hyperparameter Estimation Using Gaussian Process and Bayesian Optimization. J. Phys. Soc. Jpn. 2019, 88, 074001. [Google Scholar] [CrossRef]
Mashayekhi, M.; Ketabi, S.; Qomi, M.; Sadroleslami, S. Hydration Study of Silymarin and Its Ethylene Glycol Derivatives Compounds by Monte Carlo Simulation Method. Struct. Chem. 2023, 1–12. [Google Scholar] [CrossRef]
Bodnarchuk, M.S.; Packer, M.J.; Haywood, A. Utilizing Grand Canonical Monte Carlo Methods in Drug Discovery. ACS Med. Chem. Lett. 2020, 11, 77–82. [Google Scholar] [CrossRef] [PubMed]
Gasparini, M. Markov Chain Monte Carlo in Practice. Technometrics 1997, 39, 338. [Google Scholar] [CrossRef]
Earl, D.J.; Deem, M.W. Markov Chains of Infinite Order and Asymptotic Satisfaction of Balance: Application to the Adaptive Integration Method. J. Phys. Chem. B 2005, 109, 6701–6704. [Google Scholar] [CrossRef]
Endo, A.; van Leeuwen, E.; Baguelin, M. Introduction to Particle Markov-Chain Monte Carlo for Disease Dynamics Modellers. Epidemics 2019, 29, 100363. [Google Scholar] [CrossRef]
Lewicki, M.P.; Lewicka-Szczebak, D.; Skrzypek, G. FRAME—Monte Carlo Model for Evaluation of the Stable Isotope Mixing and Fractionation. PLoS ONE 2022, 17, e0277204. [Google Scholar] [CrossRef]
Choudhuri, I.; Biswas, A.; Haldane, A.; Levy, R.M. Contingency and Entrenchment of Drug-Resistance Mutations in HIV Viral Proteins. J. Phys. Chem. B 2022, 126, 10622–10636. [Google Scholar] [CrossRef]
Zhang, H.; Jiang, Z.; Pi, J.Y.; Xu, H.K.; Du, R. On-Line Monitoring of Pharmaceutical Production Processes Using Hidden Markov Model. J. Pharm. Sci. 2009, 98, 1487–1498. [Google Scholar] [CrossRef]
Yang, W.; Riley, B.T.; Lei, X.; Porebski, B.T.; Kass, I.; Buckle, A.M.; McGowan, S. Mapping the Pathway and Dynamics of Bestatin Inhibition of the Plasmodium Falciparum M1 Aminopeptidase Pf A-M1. ChemMedChem 2018, 13, 2504–2513. [Google Scholar] [CrossRef] [PubMed]
Emdadi, A.; Eslahchi, C. Auto-HMM-LMF: Feature Selection Based Method for Prediction of Drug Response via Autoencoder and Hidden Markov Model. BMC Bioinform. 2021, 22, 33. [Google Scholar] [CrossRef] [PubMed]
Heikamp, K.; Bajorath, J. Prediction of Compounds with Closely Related Activity Profiles Using Weighted Support Vector Machine Linear Combinations. J. Chem. Inf. Model. 2013, 53, 791–801. [Google Scholar] [CrossRef] [PubMed]
Jasial, S.; Balfer, J.; Vogt, M.; Bajorath, J. Determination of Meta-Parameters for Support Vector Machine Linear Combinations. Mol. Inform. 2015, 34, 127–133. [Google Scholar] [CrossRef]
Li, H.; Yap, C.W.; Ung, C.Y.; Xue, Y.; Cao, Z.W.; Chen, Y.Z. Effect of Selection of Molecular Descriptors on the Prediction of Blood−Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods. J. Chem. Inf. Model. 2005, 45, 1376–1384. [Google Scholar] [CrossRef]
Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C. PSuc-Lys: Predict Lysine Succinylation Sites in Proteins with PseAAC and Ensemble Random Forest Approach. J. Theor. Biol. 2016, 394, 223–230. [Google Scholar] [CrossRef]
Lenhof, K.; Eckhart, L.; Gerstner, N.; Kehl, T.; Lenhof, H.-P. Simultaneous Regression and Classification for Drug Sensitivity Prediction Using an Advanced Random Forest Method. Sci. Rep. 2022, 12, 13458. [Google Scholar] [CrossRef]
Wang, S.; Di, J.; Wang, D.; Dai, X.; Hua, Y.; Gao, X.; Zheng, A.; Gao, J. State-of-the-Art Review of Artificial Neural Networks to Predict, Characterize and Optimize Pharmaceutical Formulation. Pharmaceutics 2022, 14, 183. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Q.; Wu, X.; Zhang, L.; Du, J.; Meng, Q. De Novo Drug Design Framework Based on Mathematical Programming Method and Deep Learning Model. AIChE J. 2022, 68, e17748. [Google Scholar] [CrossRef]
Baylon, J.L.; Cilfone, N.A.; Gulcher, J.R.; Chittenden, T.W. Enhancing Retrosynthetic Reaction Prediction with Deep Learning Using Multiscale Reaction Classification. J. Chem. Inf. Model. 2019, 59, 673–688. [Google Scholar] [CrossRef]
Miyazato, I.; Nishimura, S.; Takahashi, L.; Ohyama, J.; Takahashi, K. Data-Driven Identification of the Reaction Network in Oxidative Coupling of the Methane Reaction via Experimental Data. J. Phys. Chem. Lett. 2020, 11, 787–795. [Google Scholar] [CrossRef] [PubMed]
Xing, Y.; Dong, Y.; Zhou, W.; Du, J.; Meng, Q. Optimization-Based Simultaneous Modelling of Stoichiometries and Kinetics in Complex Organic Reaction System. Chem. Eng. Sci. 2023, 276, 118758. [Google Scholar] [CrossRef]
Granda, J.M.; Donina, L.; Dragone, V.; Long, D.-L.; Cronin, L. Controlling an Organic Synthesis Robot with Machine Learning to Search for New Reactivity. Nature 2018, 559, 377–381. [Google Scholar] [CrossRef]
Coley, C.W.; Green, W.H.; Jensen, K.F. Machine Learning in Computer-Aided Synthesis Planning. Acc. Chem. Res. 2018, 51, 1281–1289. [Google Scholar] [CrossRef]
Walters, W.P.; Barzilay, R. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction. Acc. Chem. Res. 2021, 54, 263–270. [Google Scholar] [CrossRef]
Yu, L.; Su, Y.; Liu, Y.; Zeng, X. Review of Unsupervised Pretraining Strategies for Molecules Representation. Brief. Funct. Genom. 2021, 20, 323–332. [Google Scholar] [CrossRef]
Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef] [PubMed]
Singhal, A.; Seborg, D.E. Clustering Multivariate Time-Series Data. J. Chemom. 2005, 19, 427–438. [Google Scholar] [CrossRef]
Zheng, S.; Zhao, J. A New Unsupervised Data Mining Method Based on the Stacked Autoencoder for Chemical Process Fault Diagnosis. Comput. Chem. Eng. 2020, 135, 106755. [Google Scholar] [CrossRef]
Winter, R.; Montanari, F.; Noé, F.; Clevert, D.-A. Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations. Chem. Sci. 2019, 10, 1692–1701. [Google Scholar] [CrossRef]
Zhang, X.-C.; Wu, C.-K.; Yi, J.-C.; Zeng, X.-X.; Yang, C.-Q.; Lu, A.-P.; Hou, T.-J.; Cao, D.-S. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. Research 2022, 2022, 0004. [Google Scholar] [CrossRef]
Schwaller, P.; Hoover, B.; Reymond, J.-L.; Strobelt, H.; Laino, T. Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions. Sci. Adv. 2021, 7, eabe4166. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.-C.; Wu, C.-K.; Yang, Z.-J.; Wu, Z.-X.; Yi, J.-C.; Hsieh, C.-Y.; Hou, T.-J.; Cao, D.-S. MG-BERT: Leveraging Unsupervised Atomic Representation Learning for Molecular Property Prediction. Brief. Bioinform. 2021, 22, bbab152. [Google Scholar] [CrossRef] [PubMed]
Honda, S.; Shi, S.; Ueda, H.R. SMILES Transformer: Pre-Trained Molecular Fingerprint for Low Data Drug Discovery. arXiv 2019, arXiv:1911.04738. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ahneman, D.T.; Estrada, J.G.; Lin, S.; Dreher, S.D.; Doyle, A.G. Predicting Reaction Performance in C–N Cross-Coupling Using Machine Learning. Science 2018, 360, 186–190. [Google Scholar] [CrossRef]
Marcou, G.; Aires de Sousa, J.; Latino, D.A.R.S.; de Luca, A.; Horvath, D.; Rietsch, V.; Varnek, A. Expert System for Predicting Reaction Conditions: The Michael Reaction Case. J. Chem. Inf. Model. 2015, 55, 239–250. [Google Scholar] [CrossRef]
Smola, A.J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Haywood, A.L.; Redshaw, J.; Hanson-Heine, M.W.D.; Taylor, A.; Brown, A.; Mason, A.M.; Gärtner, T.; Hirst, J.D. Kernel Methods for Predicting Yields of Chemical Reactions. J. Chem. Inf. Model. 2022, 62, 2077–2092. [Google Scholar] [CrossRef]
Fröhlich, H.; Wegner, J.K.; Sieker, F.; Zell, A. Kernel Functions for Attributed Molecular Graphs—A New Similarity-Based Approach to ADME Prediction in Classification and Regression. QSAR Comb. Sci. 2006, 25, 317–326. [Google Scholar] [CrossRef]
Harding, A.P.; Wedge, D.C.; Popelier, P.L.A. PK_a Prediction from “Quantum Chemical Topology” Descriptors. J. Chem. Inf. Model. 2009, 49, 1914–1924. [Google Scholar] [CrossRef] [PubMed]
Hughes, L.D.; Palmer, D.S.; Nigsch, F.; Mitchell, J.B.O. Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 2008, 48, 220–232. [Google Scholar] [CrossRef] [PubMed]
Doddareddy, M.R.; Klaasse, E.C.; Shagufta; IJzerman, A.P.; Bender, A. Prospective Validation of a Comprehensive In Silico HERG Model and Its Applications to Commercial Compound and Drug Databases. ChemMedChem 2010, 5, 716–729. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Shahane, S.; Xia, M.; Austin, C.P.; Huang, R. Structure Based Model for the Prediction of Phospholipidosis Induction Potential of Small Molecules. J. Chem. Inf. Model. 2012, 52, 1798–1805. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Gao, H.; Struble, T.J.; Coley, C.W.; Wang, Y.; Green, W.H.; Jensen, K.F. Using Machine Learning to Predict Suitable Conditions for Organic Reactions. ACS Cent. Sci. 2018, 4, 1465–1476. [Google Scholar] [CrossRef]
Maser, M.R.; Cui, A.Y.; Ryou, S.; DeLano, T.J.; Yue, Y.; Reisman, S.E. Multilabel Classification Models for the Prediction of Cross-Coupling Reaction Conditions. J. Chem. Inf. Model. 2021, 61, 156–166. [Google Scholar] [CrossRef]
Angello, N.H.; Rathore, V.; Beker, W.; Wołos, A.; Jira, E.R.; Roszak, R.; Wu, T.C.; Schroeder, C.M.; Aspuru-Guzik, A.; Grzybowski, B.A.; et al. Closed-Loop Optimization of General Reaction Conditions for Heteroaryl Suzuki-Miyaura Coupling. Science 2022, 378, 399–405. [Google Scholar] [CrossRef]
Jin, W.; Coley, C.W.; Barzilay, R.; Jaakkola, T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. arXiv 2017, arXiv:1709.04555v3. [Google Scholar] [CrossRef]
Coley, C.W.; Jin, W.; Rogers, L.; Jamison, T.F.; Jaakkola, T.S.; Green, W.H.; Barzilay, R.; Jensen, K.F. A Graph-Convolutional Neural Network Model for the Prediction of Chemical Reactivity. Chem. Sci. 2019, 10, 370–377. [Google Scholar] [CrossRef]
Schwaller, P.; Vaucher, A.C.; Laino, T.; Reymond, J.-L. Prediction of Chemical Reaction Yields Using Deep Learning. Mach. Learn. Sci. Technol. 2021, 2, 015016. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Santos, J.E.; Yin, Y.; Jo, H.; Pan, W.; Kang, Q.; Viswanathan, H.S.; Prodanović, M.; Pyrcz, M.J.; Lubbers, N. Computationally efficient multiscale neural networks applied to fluid flow in complex 3D porous media. Transp. Porous Media 2021, 140, 241–272. [Google Scholar] [CrossRef]
Marcato, A.; Santos, J.E.; Boccardo, G.; Viswanathan, H.; Marchisio, D.; Prodanović, M. Prediction of local concentration fields in porous media with chemical reaction using a multi scale convolutional neural network. Chem. Eng. J. 2023, 455, 140367. [Google Scholar] [CrossRef]
Di Pasquale, N.; Finney, A.R.; Elliott, J.D.; Carbone, P.; Salvalaglio, M. Constant chemical potential–quantum mechanical–molecular dynamics simulations of the graphene–electrolyte double layer. J. Chem. Phys. 2023, 158, 134714. [Google Scholar] [CrossRef]
Segler, M.H.S.; Preuss, M.; Waller, M.P. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555, 604–610. [Google Scholar] [CrossRef]
Ishida, S.; Terayama, K.; Kojima, R.; Takasu, K.; Okuno, Y. Prediction and Interpretable Visualization of Retrosynthetic Reactions Using Graph Convolutional Networks. J. Chem. Inf. Model. 2019, 59, 5026–5033. [Google Scholar] [CrossRef]
Schreck, J.S.; Coley, C.W.; Bishop, K.J.M. Learning Retrosynthetic Planning through Simulated Experience. ACS Cent. Sci. 2019, 5, 970–981. [Google Scholar] [CrossRef]
Zhang, L.; Liang, D.; Wang, Y.; Li, D.; Zhang, J.; Wu, L.; Feng, M.; Yi, F.; Xu, L.; Lei, L.; et al. Caged Circular SiRNAs for Photomodulation of Gene Expression in Cells and Mice. Chem. Sci. 2018, 9, 44–51. [Google Scholar] [CrossRef]
Koch, M.; Duigou, T.; Faulon, J.-L. Reinforcement Learning for Bioretrosynthesis. ACS Synth. Biol. 2020, 9, 157–168. [Google Scholar] [CrossRef]
Angermueller, C.; Belanger, D.; Murphy, K.; Dohan, D.; Deshpande, R.; Colwell, L. Model-based reinforcement learning for bio-logical sequence design. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Dong, Y.; Georgakis, C.; Mustakis, J.; Hawkins, J.M.; Han, L.; Wang, K.; McMullen, J.P.; Grosser, S.T.; Stone, K. Stoichiometry Identification of Pharmaceutical Reactions Using the Constrained Dynamic Response Surface Methodology. AIChE J. 2019, 65, e16726. [Google Scholar] [CrossRef]
Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of Machine Learning in Drug Discovery and Development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef]
Liang, G.; Fan, W.; Luo, H.; Zhu, X. The Emerging Roles of Artificial Intelligence in Cancer Drug Development and Precision Therapy. Biomed. Pharmacother. 2020, 128, 110255. [Google Scholar] [CrossRef] [PubMed]
Yin, J.; Li, J.; Karimi, I.A.; Wang, X. Generalized Reactor Neural ODE for Dynamic Reaction Process Modeling with Physical Interpretability. Chem. Eng. J. 2023, 452, 139487. [Google Scholar] [CrossRef]
Feinstein, W.; Brylinski, M. Structure-Based Drug Discovery Accelerated by Many-Core Devices. Current Drug Targets. 2016, 17, 1595–1609. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Yao, H.; Gong, Y.; Lu, Z.; Pang, R.; Li, Y.; Yuan, Y.; Song, H.; Liu, J.; Jin, Y.; et al. Metabolic Detection and Systems Analyses of Pancreatic Ductal Adenocarcinoma through Machine Learning, Lipidomics, and Multi-Omics. Sci. Adv. 2021, 7, eabh2724. [Google Scholar] [CrossRef]
Zhou, T.; Gani, R.; Sundmacher, K. Hybrid Data-Driven and Mechanistic Modeling Approaches for Multiscale Material and Process Design. Engineering 2021, 7, 1231–1238. [Google Scholar] [CrossRef]
Alshehri, A.S.; Gani, R.; You, F. Deep Learning and Knowledge-Based Methods for Computer-Aided Molecular Design—Toward a Unified Approach: State-of-the-Art and Future Directions. Comput. Chem. Eng. 2020, 141, 107005. [Google Scholar] [CrossRef]
Bradley, W.; Kim, J.; Kilwein, Z.; Blakely, L.; Eydenberg, M.; Jalvin, J.; Laird, C.; Boukouvala, F. Perspectives on the integration between first-principles and data-driven modeling. Comput. Chem. Eng. 2022, 166, 107898. [Google Scholar] [CrossRef]
Marcato, A.; Marchisio, D.; Boccardo, G. Reconciling deep learning and first-principle modelling for the investigation of transport phenomena in chemical engineering. Can. J. Chem. Eng. 2023, 101, 3013–3018. [Google Scholar] [CrossRef]
Wang, W.; Ye, Z.; Gao, H.; Ouyang, D. Computational Pharmaceutics—A New Paradigm of Drug Delivery. J. Control. Release 2021, 338, 119–136. [Google Scholar] [CrossRef] [PubMed]
Colvin, M.; Maravelias, C.T. Modeling Methods and a Branch and Cut Algorithm for Pharmaceutical Clinical Trial Planning Using Stochastic Programming. Eur. J. Oper. Res. 2010, 203, 205–215. [Google Scholar] [CrossRef]
Poozesh, S.; Bilgili, E. Scale-up of Pharmaceutical Spray Drying Using Scale-up Rules: A Review. Int. J. Pharm. 2019, 562, 271–292. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Typical flowsheet for a pharmaceutical manufacturing process.

Figure 2. The connections between DoE, RSM and their dynamic counterparts.

Figure 3. The main components of Bayesian inference.

Figure 4. Concepts of random forest and support vector machines.

Figure 5. Deep neural network concepts.

Figure 6. Fitting results of training set, validation set and testing set with species 6.

Figure 7. A convolutional neural network-based framework.

Table 1. The Bayesian approach to problems in pharmaceutical processes.

Problem	Key Steps Using the Bayesian Approach	References
Drug property prediction	Select a suitable statistical model that captures the relationship between the drug’s properties and available data; estimate the posterior distribution.	[56,57,58]
Robust design of operational parameters	Construct a design space with sufficiently large reliability for manufacturing; change conditional parameters within the design space.	[19,59]
Reaction mechanism inference and synthetic route optimization	Train ordinary differential equations or other models with prior knowledge; perform analysis on different routes by the probability that a certain route is correct.	[60,61,62]
Impurity identification	Collect data on the impurities using analytical techniques; handle missing or incomplete data; specify prior probabilities for relevant variables; calculate the posterior distribution to evaluate the sources and types of impurities in the sample.	[63,64]

Table 2. Calculation results of three models.

	$R^{2}$		Parameter	Run Time/s
ANN	Training Set	0.742	91	1.777
	Validation Set	0.664
	Testing Set	0.621
DRSM	Training Set	0.944	138	40.036
	Validation Set	0.957
	Testing Set	0.908
SVM	Training Set	0.842	98	0.018
	Validation Set	0.842
	Testing Set	0.581

Table 3. The results and parameters of the catalyst prediction model.

Calculation result of Model 1
Top-3 Accuracy	Training Set	Validation Set	Testing Set
Top-3 Accuracy	0.957	0.843	0.85
Parameters of Model 1
Batch size	Convolutional layers	Fully connected layers	Kernel size
300	3	2	$1 \times 5, 3 \times 3$
Calculation result of Model 2
Top-3 Accuracy	Training Set	Validation Set	Testing Set
Top-3 Accuracy	0.932	0.803	0.838
Parameters of Model 2
Batch size	Convolutional layers	Fully connected layers	Kernel size
300	4	3	$3 \times 3$

Table 4. Rate constant prediction model results.

	$R^{2}$
	Data Volume	Training Set	Validation Set	Testing Set
Cluster 1	988	0.891	0.745	0.721
Cluster 2	3384	0.956	0.832	0.855
Cluster 3	1728	0.959	0.826	0.864
Average value	/	0.935	0.801	0.813
All data without clustering	6100	0.805	0.631	0.622

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, Y.; Yang, T.; Xing, Y.; Du, J.; Meng, Q. Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes. Processes 2023, 11, 2096. https://doi.org/10.3390/pr11072096

AMA Style

Dong Y, Yang T, Xing Y, Du J, Meng Q. Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes. Processes. 2023; 11(7):2096. https://doi.org/10.3390/pr11072096

Chicago/Turabian Style

Dong, Yachao, Ting Yang, Yafeng Xing, Jian Du, and Qingwei Meng. 2023. "Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes" Processes 11, no. 7: 2096. https://doi.org/10.3390/pr11072096

APA Style

Dong, Y., Yang, T., Xing, Y., Du, J., & Meng, Q. (2023). Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes. Processes, 11(7), 2096. https://doi.org/10.3390/pr11072096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Modeling Methods and Techniques for Pharmaceutical Processes

Abstract

1. Introduction

2. Process Modeling Tools Based on Statistics

2.1. DoE and Multivariate Tools

2.2. Bayesian Inferences

2.3. Other Statistical Tools

2.4. From Statistical Tools to Machine Learning

3. Process Modeling Tools Based on Machine Learning

3.1. Unsupervised Learning

3.2. Supervised Learning

3.3. Deep Learning

3.4. Reinforcement Learning

4. Case Studies for Pharmaceutical Processes

4.1. Dynamic Reaction Data Modeling

4.2. Catalyst Kinetics Prediction of Cross-Coupling Reaction Based on Convolutional Neural Network

5. Challenges and Future Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI