Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding

Van Hauwermeiren, Daan; Stock, Michiel; De Beer, Thomas; Nopens, Ingmar

doi:10.3390/pharmaceutics12030271

Open AccessArticle

Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding

¹

BIOMATH—Department of data analysis and mathematical modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium

²

Laboratory of Pharmaceutical Process Analytical Technology—Department of pharmaceutical analysis, Ghent University, Ottergemsesteenweg 460, 9000 Gent, Belgium

³

KERMIT—Department of data analysis and mathematical modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium

^*

Author to whom correspondence should be addressed.

Pharmaceutics 2020, 12(3), 271; https://doi.org/10.3390/pharmaceutics12030271

Submission received: 27 January 2020 / Revised: 26 February 2020 / Accepted: 9 March 2020 / Published: 16 March 2020

(This article belongs to the Special Issue Continuous Pharmaceutical Manufacturing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the pharmaceutical industry, the transition to continuous manufacturing of solid dosage forms is adopted by more and more companies. For these continuous processes, high-quality process models are needed. In pharmaceutical wet granulation, a unit operation in the ConsiGma

^{TM}

-25 continuous powder-to-tablet system (GEA Pharma systems, Collette, Wommelgem, Belgium), the product under study presents itself as a collection of particles that differ in shape and size. The measurement of this collection results in a particle size distribution. However, the theoretical basis to describe the physical phenomena leading to changes in this particle size distribution is lacking. It is essential to understand how the particle size distribution changes as a function of the unit operation’s process settings, as it has a profound effect on the behavior of the fluid bed dryer. Therefore, we suggest a data-driven modeling framework that links the machine settings of the wet granulation unit operation and the output distribution of granules. We do this without making any assumptions on the nature of the distributions under study. A simulation of the granule size distribution could act as a soft sensor when in-line measurements are challenging to perform. The method of this work is a two-step procedure: first, the measured distributions are transformed into a high-dimensional feature space, where the relation between the machine settings and the distributions can be learnt. Second, the inverse transformation is performed, allowing an interpretation of the results in the original measurement space. Further, a comparison is made with previous work, which employs a more mechanistic framework for describing the granules. A reliable prediction of the granule size is vital in the assurance of quality in the production line, and is needed in the assessment of upstream (feeding) and downstream (drying, milling, and tableting) issues. Now that a validated data-driven framework for predicting pharmaceutical particle size distributions is available, it can be applied in settings such as model-based experimental design and, due to its fast computation, there is potential in real-time model predictive control.

Keywords:

granulation; wet granulation; continuous manufacturing; process modeling; particle size distributions; kernel methods; kernel mean embedding; predictive modeling; data-driven; machine learning

1. Introduction

In pharmaceutical twin-screw wet granulation (TSWG), a dry powder is granulated using a liquid to wet granules. The collection of granules differs in shape and size. Its measurement is called a particle size distribution (PSD). The measurement of this PSD is important to determine the input setting of the TSWG. A fine balance is sought between aggregating the dry powder enough so that its flow properties are improved but not too much so that problems in the next unit operations arise. Too little aggregation results in dry powder that does not have good flow properties and could be blow out to the filters in the fluid bed dryer. Too much aggregation could lead to particles that are too large, needing longer drying times.

For the rapid development of a new formulation on the powder-to-tablet line under study (ConsiGma

^{TM}

-25), it is essential to have a predictive model linking the settings of the TSWG to the PSD. One possible approach is to use a population balance model (PBM), where the dynamical changes in particle size are described by making assumptions on how particles can aggregate and break in the TSWG [1].

This work takes a different look at the problem, using a data-driven approach to directly link the TSWG settings and the resulting PSD at the end of the granulator. The benefit this apprroach is that there is no need to make assumptions about the nature of the PSD itself. This is in contrast to the PBM framework, where thorough knowledge of the dynamics of aggregation and breakage of particles is essential. In Section 2, a general overview of the theory used in this work is given before diving into the mathematical descriptions of all the ideas in Section 3. In Section 4, the experimental set up and data collection are described. Section 5 describes the calibration procedure of the data-driven model, for which the results are presented and discussed in Section 6. General conclusions on this work are drawn in Section 7, and Section 8 lists some potential future research topics.

2. General Principles

In this section, a high-level overview explaining the general principles of the methodology in this paper is presented to give the reader an overview of the whole approach before diving into the details.

Supervised machine learning models learn a mapping from an arbitrary input to an arbitrary output space. How does one make a predictive model of a distribution of particle size as a function of the machine settings? One possible approach would be to aggregate the information contained in the whole distribution into a mean particle size, or an indication of the size of the largest or smallest particles. Typically,

d_{10}

,

d_{50}

, and

d_{90}

are used to describe the particle size in experimental papers in this application field, as in the work of Verstraeten et al. [2]. This approach is sensible if the underlying distribution is known. For instance, if the particle distribution is adequately approximated by a Gaussian distribution, then information on the mean particle size and the standard deviation is enough to fully characterize the distribution. However, in this application there is no knowledge of an underlying theoretical distribution to describe particle size. Hence, we need a framework that is agnostic with respect to the potential distribution types and which can link these with the process parameters.

To model distributions, we have to be able to construct manageable numerical representations. For example, we can compute all the moments of the distribution. In the jargon of the machine learning field, this would be called “feature generation” or a “feature map”. Knowing all the moments of the distribution, the distribution itself is completely characterized. It is not convenient to work with an infinite number of moments if we want to link that to process settings.

Fortunately, there is a way to translate a distribution into a point in an implicit feature space in such a way that all information is retained. This procedure is explained in Section 3.2. This translation into a new space does not require the explicit calculation of a large number of features. It expresses all operations in terms of an inner product between pairs of data points in a feature space. These inner products are calculated using a class of functions called kernel functions. This approach of bypassing the generation of a large number of features and only using inner products is the core of a class of algorithms called kernel methods. A brief introduction to learning with kernels is written in Section 3.1. This transformation should be interpreted as a mapping to a mean in that feature space, hence the name of this technique is kernel mean embedding (KME).

Next, this theory is extended to conditional distributions (i.e., a PSD given some TSWG input settings in Section 3.3). If we have a way to deal with conditional distributions, the next logical step is to derive a learning framework—more specifically, structured output prediction. This framework makes it possible to learn the relationship between the mean embedding of a distribution in a feature space and the TSWG settings. To test if this approach can be generalized in the design space, a leave-one-out cross-validation is performed. The details to derive the learning theory are described in Section 3.4.

This work presents a way to manipulate distributions in such a way that all information is maintained and can be put this into a framework where the relation between the process parameters and the distribution can be learnt. The upside is that this framework allows for easy implementation of cross-validation so that the quality of the learnt relation can be assessed. However, the distributions are still expressed in that new feature space, which is not convenient to interpret. Section 3.5 deals with inverse operation from that feature space to a PSD, that is, recovering the function from the embedding.

In summary, the theory makes it possible to learn the (cross-validated) relationship between an input space and distributions and still maintain interpretability by allowing the inverse transformation to a distribution at the end. For a visual cue, this whole procedure is summarized in Figure 1.

3. Theoretical Background

In the field of machine learning, the concept of the KME of distributions is a class of non-parametric methods in which a probability distribution is mapped to an element of a recurrent kernel Hilbert space (RKHS). These methods are a generalization of the classical feature mapping in kernel methods, which use individual data points. The learning framework is general in the sense that it can be applied to arbitrary distributions over any space on which a sensible kernel function can be defined. The focus of this work is on representing the distribution as real-valued scalars (i.e., particle sizes). However, various kernels have been proposed for other data types, such as strings, graphs, manifolds, and dynamical systems [3,4].

This part consists of five subsections: first, a general introduction to kernel methods is given. Next, the Hilbert space embedding of marginal and conditional distributions are discussed, which leads to the formulation of a framework for learning on distributional data. Finally, a method to recover distributions from Gaussian RKHS embeddings is discussed. In this work, we only discuss the relevant background information to understand the applications in this work. For an in-depth review of KME, its properties, and its applications, the reader is referred to the review article by Muandet et al. [5]. More details on the backbone of KME: RKHS, learning with kernels, and probability theory can be found in Hofmann et al. [4], Schölkopf et al. [6], Berlinet and Thomas-Agnan [7], respectively.

3.1. Learning with Kernels

The classical machine learning algorithms perceptron [8], support vector machine [9], and principal component analysis [10,11] consider the data,

x, x^{'} \in X

, with

X

a non empty set, through their inner product

〈 x, x^{'} 〉

. This inner product can be interpreted as a similarity measure between the elements of

X

. This class of linear functions may be too restrictive for many applications if more complex relations between input and output data are sought. The core of kernel methods is to replace the inner product

〈 x, x^{'} 〉

with another (non-linear) similarity measure. As an example, one can explicitly apply a non-linear transformation:

\begin{matrix} ϕ : X & \to F \\ x & \mapsto ϕ (x) \end{matrix}

(1)

from

X

to the high-dimensional feature space

F

and evaluate the inner product in the newly constructed space:

k (x, x^{'}) = {〈 ϕ (x), ϕ (x^{'}) 〉}_{F},

(2)

where

{〈 \cdot, \cdot 〉}_{F}

is the inner product of

F

,

ϕ

is the feature map, and k is the kernel function which defines a non-linear similarity measure between

x

and

x^{'}

. Given a learning algorithm that operates on the data through the inner product

〈 x, x^{'} 〉

, a non-linear extension of the algorithm can be made by substituting

〈 x, x^{'} 〉

with

{〈 ϕ (x), ϕ (x^{'}) 〉}_{F}

. The principles of the algorithms do not change, only the space in which the algorithms operate. The complexity of the algorithm is controlled by the complexity of the non-linear transformation

ϕ

. The evaluation of Equation (2) requires two steps: explicitly constructing the feature maps

ϕ (x)

and subsequently evaluating the inner product

{〈 ϕ (x), ϕ (x^{'}) 〉}_{F}

. Issues can arise when

ϕ (x)

defines a transformation to a high-dimensional feature space. However, it is possible to evaluate

{〈 ϕ (x), ϕ (x^{'}) 〉}_{F}

directly without explicitly constructing the feature maps. This is an essential part of kernel methods, and in the machine learning community this is called “the kernel trick”. A visual aid for this kernel trick is shown in Figure 2.

This kernel trick can only be applied if k is positive definite. The positive definite kernel function

k (x, x^{'})

is central to the successful application of KME. This kernel function initially arises as a way to perform an inner product

〈 x, x^{'} 〉

in a high-dimensional feature space

H

for some data points

x, x^{'} \in X

. The collection of all pairwise inner products within the set of data vectors

x

is called the

n \times n

-Gram or kernel matrix

K_{i j} : = k (x_{i}, x_{j})

. In general, a symmetric function k, is called a positive definite kernel on

X

if the Gram matrix is positive definite, that is,

\sum_{i = 1}^{n} \sum_{j = 1}^{n} c_{i} c_{j} k (x_{i}, x_{j}) \geq 0, \forall x_{i} \in X .

(3)

Equation (3) holds for any

n \in N

, all finite sequences of points

x_{1}, \dots, x_{n}

in

X

and all choices of n real-valued coefficients

c_{1}, \dots, c_{n} \in R

[12]. The positive definiteness of the kernel guarantees the existence of a dot product space

F

and a feature map

ϕ : X \to F

such that

k (x, x^{'}) = {〈 ϕ (x), ϕ (x^{'}) 〉}_{F}

[13] without needing to compute

ϕ

explicitly [6,9,14,15]. Moreover, a positive definite kernel induces a space of functions from

X

to

R

called an RKHS

H

, hence also called a reproducing kernel [13]. An RKHS has two important properties: first, for any

x \in X

, the function

k (x, \cdot) : y \mapsto k (x, y)

is an element of

H

. That is, whenever the kernel k is used, the feature space

F

is essentially the RKHS

H

associated with this kernel and it can be interpreted as a canonical feature map:

\begin{matrix} k : X & \to H \subset R^{X} \\ x & \mapsto k (x, \cdot) \end{matrix}

(4)

where

R^{X}

denotes the vector space of functions from

X

to

R

. The second property is that an inner product in

H

satisfies the reproducing property, i.e., for all functions

f \in H

and

x \in X

,

f (x) = {〈 f, k (x, \cdot) 〉}_{H} .

(5)

In particular:

k (x, x^{'}) = {〈 k (x, \cdot), k (x^{'}, \cdot) 〉}_{H}

. Further details on how

ϕ (x) = k (x, \cdot)

can be derived directly from the kernel k can be found in Schölkopf et al. [6]. The kernel used in this work is a Gaussian kernel, which is part of a class of kernels with interesting properties called radial basis functions (RBFs):

k^{RBF} (x, x^{'}) = exp (- \frac{{‖ x - x^{'} ‖}^{2}}{2 σ^{2}})

(6)

with

σ > 0

a bandwidth parameter. For

σ \to \infty

, the Gram matrix of this kernel becomes a matrix of ones, for

σ \to 0

, it becomes an identity matrix. The former situation implies that all instances are the same, the latter implies that they are all completely unique. The RBF kernel is a stationary kernel: it can be described as a function of the difference of its inputs. The RBF kernel is also called a universal kernel because any smooth function can be represented with a high degree of accuracy, assuming we can find a suitable value of the bandwidth. More details on different classes of kernel functions and their application domains can be found in Genton [16]. For further details on the properties of the RKHS and important theorems such as Mercer’s and Bochner’s theorem, the reader is referred to Muandet et al. [5], Mercer [12], and Bochner [17], respectively.

3.2. Hilbert Space Embedding of Marginal Distributions

In KME the concept of a feature map

ϕ

is extended to the space of probability distributions through the mapping

μ

which defines the representer in

H

of any distribution

P

:

\begin{matrix} μ : M_{+}^{1} (X) & \to H \\ P & \mapsto \int_{X} k (x, \cdot) d P (x), \end{matrix}

(7)

with

M_{+}^{1} (X)

the space of probability measures over a measurable space

X

[7,18]. The above mapping is the kernel mean embedding that is considered in this work:

ϕ (P) = μ_{P} : = E_{X \sim P} [k (X, \cdot)] = \int_{X} k (x, \cdot) d P (x) .

(8)

A visual representation of this mean embedding is given in Figure 3. In essence, the distribution

P

is transformed into an element in the feature space

H

, which is just an RKHS corresponding to the kernel positive definite kernel k. This element (i.e., the mean embedding

μ_{P}

) is the expected value in that feature space. Since

P

is a probability density distribution, the expected value can be written as an integral. The derivation, proof, and properties of Equation (8) can be found in Muandet et al. [5], Berlinet and Thomas-Agnan [7], and Smola et al. [18].

Through Equation (8), most RKHS methods can therefore be extended to probability measures. When embedding a distribution in another space, it is crucial to understand what information of the distribution is retained by the kernel mean embedding. Consider the class of inhomogeneous polynomial kernels of order

p \in N

:

k^{poly} (x, x^{'}) = {(〈 x, x^{'} 〉 + 1)}^{p} = 1 + (\binom{p}{1}) 〈 x, x^{'} 〉 + (\binom{p}{2}) {〈 x, x^{'} 〉}^{2} + \dots + (\binom{p}{p}) {〈 x, x^{'} 〉}^{p} .

(9)

Polynomial kernels of order p allow for learning a p-th order polynomial model w.r.t. the features. For our purposes, a polynomial kernel would model the p first moments of a distribution when used in KME. For a linear kernel, which is equal to computing the inner product,

μ_{P}

equals the first moment of

P

, whereas the polynomial kernel of order 2 allows the mean map to retain information on both the first and the second moments of

P

. Generally speaking, the mean map using the inhomogeneous polynomial kernel of order p captures information up to the p-th moment of

P

. Other explicit examples for some kernels can be found in Smola et al. [18], Fukumizu et al. [19], Sriperumbudur et al. [20], Gretton et al. [21], Schölkopf et al. [22]. There exists a class of kernel functions known as characteristic kernels for which the kernel mean representation captures all information about the distribution

P

, with the Gaussian kernel used in this work as an example [23,24]. It follows that the RKHS endowed with the kernel k should contain a sufficiently rich class of functions to represent all higher-order moments of

P

[24]. The map

P \mapsto μ_{P}

is injective, implying that

{‖ μ_{P} - μ_{Q} ‖}_{H} = 0

if and only if

P = Q

, that is,

P

and

Q

are the same distribution. Injectivity of the map

P \mapsto μ_{P}

makes the RKHS embedding suitable for regression problems since this map is inherently structurally identifiable (i.e., each element in the feature space corresponds to one unique distribution in the original space). Lastly, it is necessary to point out that, in practice, access to the true distribution

P

is often lacking, and thereby the mean embedding

μ_{P}

cannot be computed. Instead, often only an independent and identically distributed (iid) sample

{x_{1}, \dots, x_{n}}

of the distribution is available. The standard estimator

{\hat{μ}}_{P}

of the kernel mean

μ_{P}

is an empirical average:

{\hat{μ}}_{P} : = \frac{1}{n} \sum_{i = 1}^{n} k (x_{i}, \cdot),

(10)

with

{\hat{μ}}_{P}

an unbiased estimate of

μ_{P}

. By the weak law of larger numbers,

{\hat{μ}}_{P}

converges to

μ_{P}

as

n \to \infty

[25]. In this work, the data should be interpreted as a probability mass distribution associated with the sample

X

. For example,

\hat{P} : = \frac{1}{n} \sum_{i = 1}^{n} δ_{x_{i}}

, with

δ_{x}

the Dirac measure defined for

x

in

X

, such that the mean embedding takes the form of a weighted sum of feature vectors:

{\hat{μ}}_{P} : = \sum_{i = 1}^{n} w_{i} k (x_{i}, \cdot),

(11)

with

w = [w_{i}] \in Δ^{n - 1}

, that is, a histogram with weights

w_{i} > 0

and subject to the constraint

\sum_{i}^{n} w_{i} = 1

[26]. A comparison of Equations (8), (10), and (11) is visualized in Figure 4.

In summary, the described framework allows the transformation of marginal distributions into a rich feature space without making assumptions about the underlying distribution (e.g., belonging to a particular class of distributions) by using an extension of previously established kernel methods. By carefully choosing the nature of this transformation, all information on this distribution is retained. Next, it can be proven that this map is identifiable, which makes this representation suitable for regression problems. Finally, compared to density estimation approaches, the kernel mean representation is less prone to the curse of dimensionality [27,28,29].

3.3. Hilbert Space Embedding of Conditional Distributions

In the previous subsection, the fundamentals of the mean map for marginal distributions were laid out. In this subsection, the extension of kernel mean embedding to a conditional distribution

P (Y | X)

and

P (Y | X = x)

for some

x \in X

is discussed [26,30]. The conditional distribution captures the functional relationship between the two random variables X and Y. Conditional mean embedding thus extends the capability of kernel mean embedding to model more complex dependence.

Let

k : X \times X \to R

and

l : Y \times Y \to R

be positive definite kernels for the domains of X and Y, respectively. The RKHSs associated with these kernels are

H

and

G

. The conditional mean embeddings of the conditional distributions

P (Y | X)

and

P (Y | X = x)

can be written as

U_{Y | X} : H \to G

and

U_{Y | x} \in G

, such that they satisfy:

\begin{matrix} U_{Y | x} & = E_{Y | x} [φ (Y) | X = x] = U_{Y | X} k (x, \cdot) \end{matrix}

(12)

\begin{matrix} E_{Y | x} [g (Y) | X = x] & = {〈 g, U_{Y | x} 〉}_{G}, \forall g \in G . \end{matrix}

(13)

U_{Y | X}

is an operator from one RKHS

H

to the other RKHS

G

, and

U_{Y | x}

is an element in

G

. Equation (12) states that the conditional mean embedding of the conditional distribution

P (Y | X = x)

corresponds to the conditional expectation of the feature map of Y given that

X = x

. The operator

U_{Y | X}

is the conditioning operation that when applied to

ϕ (x) \in H

yields the conditional mean embedding

U_{Y | x}

. Equation (13) describes the reproducing property of

U_{Y | x}

, that is, it should be a representer of conditional expectation in

G

w.r.t.

P (Y | X = x)

. Using the definition of Song et al. [26,30]: let

C_{X X} : H \to H

and

C_{X Y} : H \to G

be the covariance operator on X and cross-covariance operator from X to Y, respectively. Then, the conditional mean embedding

U_{Y | X}

and

U_{Y | x}

are defined as:

\begin{matrix} U_{Y | X} & : = C_{X Y} C_{X X}^{- 1} \end{matrix}

(14)

\begin{matrix} U_{Y | x} & : = C_{X Y} C_{X X}^{- 1} k (x, \cdot) . \end{matrix}

(15)

A visual explanation of these concepts can be found in Figure 5.

Further, Fukumizu et al. [23,31] state that if

E_{Y | X} [g (Y) | X = \cdot] \in H

for any

g \in G

, then

C_{X X} E_{Y X} [g (Y) | X = \cdot] = C_{X Y} g .

(16)

For some

x \in X

, by virtue of the reproducing property, we have that

E_{Y | x} [g (Y) | X = x] = {〈 E_{Y | X} [g (Y) | X], k (x, \cdot) 〉}_{H} .

(17)

Combining Equations (16) and (17) and taking the conjugate transpose of

C_{X X}^{- 1} C_{X Y}

yields

E_{Y | x} [g (Y) | X = x] = {〈 g, C_{Y X} C_{X X}^{- 1} k (x, \cdot) 〉}_{G} = {〈 g, U_{Y | x} 〉}_{G} .

(18)

It is important to note that the operator

C_{Y X} C_{X X}^{- 1}

may not exist in the continuous domain because the assumption that

E_{Y X} [g (Y) | X = \cdot] \in H, \forall g \in G

may not hold in general [23,30]. To ensure existence, a regularised version of Equation (13) can be used, that is,

C_{Y X} {(C_{X X} + λ I)}^{- 1} k (x, \cdot)

, where

λ > 0

is a regularization parameter and

I

is the identity operator in

H

. Fukumizu et al. [31] showed that under mild conditions, its empirical estimator is a consistent estimator of

E_{Y | X} [g (Y) | X = x]

.

In practice, some technical issues arise: since the joint distribution

P (X, Y)

is unknown,

C_{X X}

and

C_{Y X}

cannot be computed directly. The solution is to rely on the iid sample

(x_{1}, y_{1}), \dots, (x_{n}, y_{n})

from

P (X, Y)

. Let

Υ : = {[ϕ (x_{1}), \dots, ϕ (x_{n})]}^{T}

and

Φ : = {[φ (y_{1}), \dots, φ (y_{n})]}^{T}

where

ϕ : X \to H

and

φ : Y \to G

are the feature maps associated with the kernels k and l, respectively. The corresponding Gram matrices are defined as

K = Υ^{T} Υ

and

L = Φ^{T} Φ

. Using the former definitions, the empirical estimator of the conditional mean embedding is given by

\begin{matrix} {\hat{C}}_{Y X} {({\hat{C}}_{X X} + λ I)}^{- 1} k (x, \cdot) & = \frac{1}{n} Φ Υ^{T} {(\frac{1}{n} Υ Υ^{T} + λ I)}^{- 1} k (x, \cdot) \\ = Φ Υ^{T} {(Υ Υ^{T} + n λ I)}^{- 1} k (x, \cdot) \\ = Φ {(Υ^{T} Υ + n λ I_{n})}^{- 1} Υ^{T} k (x, \cdot) \\ = Φ {(K + n λ I_{n})}^{- 1} k_{x} . \end{matrix}

(19)

As derived by Song et al. [30], the conditional mean embedding of

μ_{Y | x}

can be estimated using

{\hat{μ}}_{Y | x} = Φ {(K + n λ I_{n})}^{- 1} k_{x} .

(20)

Let

{\hat{β}}_{λ} : = {(K + n λ I_{n})}^{- 1} k_{x} \in R^{n}

, then Equation (20) can be written as

{\hat{μ}}_{Y | x} = Φ {\hat{β}}_{λ} = \sum_{i = 1}^{n} {({\hat{β}}_{λ})}_{i} φ (y_{i})

. It should be noted that this last equation is in a form similar to Equation (11). So, in conclusion, x determines the weights for the embedding of

P (Y | x)

.

3.4. Learning on Distributional Data

Zhang et al. [32] and Grünewälder et al. [33] observed that the conditional mean embedding has a natural interpretation as a solution to the vector-valued regression problem. Recall that the conditional mean embedding is defined via

E [g (Y) | X = x] = {〈 g, {\hat{μ}}_{Y | x} 〉}_{G}

. That is, for every

x \in X

,

{\hat{μ}}_{Y | x}

is a function on

Y

and thereby defines a mapping from

X

to

G

. Furthermore, the empirical estimator in Equation (20) can be expressed as

{\hat{μ}}_{Y | x} = Φ {(K + n λ I_{n})}^{- 1} k_{x}

, which already suggests that the conditional mean embedding is the solution to an underlying regression problem. Given an iid sample

(x_{1}, z_{1}), \dots, (x_{n}, z_{n}) \in X \times G

, a vector-valued regression problem can be formulated as:

{\hat{E}}_{λ} (f) = \sum_{i = 1}^{n} | | z_{i} - f (x_{i}) {| |}_{G}^{2} + {λ | | f | |}_{H_{Γ}}^{2},

(21)

where

G

is a Hilbert space,

H_{Γ}

denotes an RKHS of vector-valued functions from

X

to

G

and

{\hat{E}}_{λ}

is the error associated with this regression problem [34]. Grünewälder et al. [33] show that

{\hat{μ}}_{Y | X}

can be obtained as a minimizer of the optimization of Equation (21). A natural optimization problem for the conditional mean embedding is to find a function

μ : X \to G

that minimizes an objective. Grünewälder et al. [33] shows that that objective can be bounded from above by a surrogate loss function, which in its empirical counterpart is described as

{\hat{E}}_{s} [μ] = \sum_{i = 1}^{n} | | l (y_{i}, \cdot) - μ (x_{i}) {| |}_{G}^{2} + {λ | | μ | |}_{H_{Γ}}^{2},

(22)

with an added regularization term to provide a well-posed problem and prevent overfitting. This vector-values regression interpretation of conditional mean embedding has the advantage that a cross-validation procedure for parameter or model selection can be used because the loss function is well-defined. Since the analysis is done under the assumption that

G

is finite-dimensional, the conditional mean embedding is simply the ridge regression of feature vectors. Given

{\hat{β}}_{λ} : = {(K + n λ I_{n})}^{- 1} k_{x}

, the hat matrix,

H_{λ}

, in the ridge regression context is defined as:

\begin{matrix} H_{λ} k_{x} & = {\hat{k}}_{x} \\ = Φ {\hat{β}}_{λ} \end{matrix}

(23)

\begin{matrix} H_{λ} & = K {(K + λ I)}^{- 1} . \end{matrix}

(24)

The estimated conditional embedding using leave-one-out cross-validation (LOOCV) is then defined as [35]:

{\hat{μ}}_{Y | x}^{LOOCV} = {(I - diag (H_{λ}))}^{- 1} (H_{λ} - diag (H_{λ})) Φ,

(25)

with

diag (\cdot)

denoting the diagonal matrix. Note that using Equation (25), it possible to calculate all LOOCV conditional embeddings at once using matrix multiplications. For the interpretability of the results, the underlying distribution needs to be recovered from the LOOCV conditional mean embedding. This is described in the next paragraph.

3.5. Recovering Distributions from RKHS Embeddings

Given a kernel mean embedding

μ_{P}

, is it possible to recover essential properties of

P

from

μ_{P}

? This problem is known in the literature as the distributional pre-image problem [36,37,38]. It is important to note that there is a distinction with the classical pre-image problem, which does not involve probability distributions [5]. In this problem, objects in the input space are sought which correspond with a specific KME in the feature space. In this way, meaningful information of an underlying distribution can be recovered from an estimate of its embedding. Let

P_{θ}

be an arbitrary distribution parametrized by

θ

and let

μ_{P_{θ}}

be its mean embedding in

H

.

P_{θ}

can be found by the following minimization problem

\begin{matrix} \hat{θ} & = \underset{θ \in Θ}{arg min} {‖ {\hat{μ}}_{Y} - μ_{P_{θ}} ‖}_{H}^{2} \\ = \underset{θ \in Θ}{arg min} 〈 {\hat{μ}}_{Y}, {\hat{μ}}_{Y} 〉 - 2 〈 {\hat{μ}}_{Y}, μ_{P_{θ}} 〉 + 〈 μ_{P_{θ}}, μ_{P_{θ}} 〉, \end{matrix}

(26)

subject to appropriate constraints on the parameter vector

θ

. Equation (26) minimizes the maximum mean discrepancy (MMD), which is defined by the idea of representing distances between distributions as distances between mean embeddings of features. Applied to this work,

{\hat{μ}}_{Y}

should be interpreted as the estimated conditional embedding using LOOCV, defined by Equation (25). The term

〈 {\hat{μ}}_{Y}, {\hat{μ}}_{Y} 〉

is only a function of the estimated conditional embedding, thus is constant and is left out of the minimization. Assume that

μ_{P_{θ}} = \sum_{i = 1}^{n} α_{i} φ (y_{i})

for some

α \in Δ^{n - 1}

, or in words:

P_{θ}

is a histogram. It follows that

〈 μ_{P_{θ}}, μ_{P_{θ}} 〉 = α^{'} (L + λ I) α

with

L_{i j} = l (y_{i}, y_{j})

. The addition of a regularizing term

λ

allows us to cast the optimization as a standard quadratic programming problem. Finally,

〈 {\hat{μ}}_{Y}, μ_{P_{θ}} 〉

is then equal to the dot product of

P_{θ}

and the conditional embedding of Equation (25). The optimization in Equation (26) can thus be written as

\hat{α} = \underset{α \in Δ^{n - 1}}{arg min} α^{'} (L + λ I) α - 2 α \cdot {\hat{μ}}_{Y | x}^{LOOCV} .

(27)

Although it is possible to solve Equation (27) and find a distributional pre-image, it is not clear what kind of information of

P

this pre-image represents. Kanagawa and Fukumizu [38] considers the recovery of the information of a distribution from an estimate of the kernel mean when the Gaussian RBF kernel on Euclidean space is used. They show that under some situations certain statistics of

P

can be recovered, namely its moments and measures on intervals, from

{\hat{μ}}_{P}

, and that the density of

P

can be estimated from

{\hat{μ}}_{P}

without any parametric assumption on

P

(Kanagawa and Fukumizu [38]; Theorem 2).

4. Experimental Set Up and Data Collection

The application field of this work is pharmaceutical manufacturing. More specifically, the data gathered for this study originate from the high-shear TSWG unit operation in the ConsiGma

^{TM}

-25 system (GEA Pharma systems, Collette, Wommelgem, Belgium) continuous powder-to-tablet line. A schematic representation of the production line is shown in Figure 6. A more in-depth depiction of the TSWG can be found in Figure 7, displaying the input and output data of the system. The experimental set up of this paper is described in the work of Verstraeten et al. [2]. Here, only a summary with the details relevant to this work is given. For the full details, the reader should refer to the aforementioned paper. The TSWG is comprised of two 25

m

m

diameter self-wiping, co-rotating screws with a length-to-diameter ratio of

20 : 1

. The preblend and the granulation liquid (demineralized water) are introduced into the system by a gravimetric twin-screw loss-in-weight feeder (KT20, K-Tron Soder, Niederlenz, Switzerland), and two out-of-phase peristaltic pumps located on top of the granulator (Watson Marlow, Cornwall, UK), connected to

1.6

m

m

nozzles. In this work, the data for the hydrophobic model drug were used. The preblend for this model drug contains 60% (w/w) hydrochlorothiazide (UTAG, Almere, The Netherlands), 16% (w/w) lactose monohydrate (Lactochem^® Regular, DFE Pharma, Goch, Germany), 16% (w/w) microcrystalline cellulose (Avicel^® PH 101, FMC, Philadelphia, PA, USA), 3% (w/w) hydroxypropylcellulose (Klucel^® EXF, Ashland, Covington, KY, USA), and 5% (w/w) crosscarmellose sodium (Ac-Di-Sol^®, FMC, Philadelphia, PA, USA). A three-level full-factorial experimental design was used to study the influence of the granulation process parameters screw speed (450, 675, and 900 rpm), material throughput (5, 12.5, and 25 kg/h), and liquid-to-solid ratio (0.3, 0.45, and 0.6). An overview of the process conditions of this experimental design is listed in Table 1. This experiment was performed with a fixed screw configuration: two kneading compartments, each comprised of six kneading elements (length = diameter/6 for each element) with a 60° stagger angle, separated by a conveying element with the same length equal to 1.5 times the diameter. The barrel’s jacket temperature was set at 25

^{\circ} C

. The samples were collected at four locations inside the barrel, however, only the measurements at the end of the granulator are considered in this work. After collection, the samples were oven-dried before the measurement of the PSD and other properties. The size and shape distribution of the collected, oven-dried, granule samples were analyzed using a QICPIC particle size analyzer with WINDOX 5.4.1.0 software (Sympatec, GmbH, Clausthal-Zellerfeld, Germany). The number of bins in the data were taken from previous work on population balance models [1]. They were chosen such that experimental data of the PSD by QICPIC could be loaded swiftly without the need for interpolation. The grid was comprised of a total of 35 bins, logarithmically spaced between

8.46

μ

m

and

6765.36

μ

m

.

5. Calibration Procedure

First, the process parameters

X

are standardized by removing the mean and scaling to unit variance. Next, some hyperparameters need to be defined: the bandwidth parameters

σ

of the RBF kernels and the regularization parameter

λ

. For the kernel on the grid of the distributions,

σ

is chosen via the median heuristic [39]:

σ^{2} = median \{∥ \log_{10} (x_{i}) - \log_{10} (x_{j}) ∥^{2} : i, j = 1, \dots, n\}

. Note that the logarithm of grid values is taken, as the grid spans more than three orders of magnitude. This brings the kernel values more closely together and gives more realistic results. For the kernel on the process parameters, the bandwidth is chosen as

σ^{2} = 0.1

, which is approximately

^{1} /_{10}

of the length scale. Finally, for numerical stability reasons (especially for the pre-imaging problem), a bias is added to the diagonal of the Gram matrix of both kernels:

0.05

for k and

0.1

for l. For a visual cue: the heatmaps of two Gram matrices of kernels are given in Figure 8. The regularization parameter

λ

is estimated via LOOCV: its value is altered so that the squared error between the mean embedding of the measured distributions and estimated distributions via LOOCV is minimized:

λ = \underset{λ}{arg min} {∥ μ_{P (Y | X)} - {\hat{μ}}_{P (Y | X)}^{LOOCV} ∥}_{G}^{2} .

(28)

To assess the model quality, three different measures are used. In the following equations,

P

is the measured distribution, and

\hat{P}

is the estimated distribution using LOOCV KME. The MMD is calculated as shown before in Equation (26):

D_{MMD} = {∥ μ_{P} - μ_{\hat{P}} ∥}_{G}^{2} .

(29)

The root mean squared error (RMSE), or

L_{2}

-norm, is defined as:

D_{RMSE} = ∥ α - \hat{α} ∥,

(30)

with

α \in Δ^{n - 1}

. Last, the Kullback–Leibler (KL) divergence is calculated as:

D_{KL} = \sum_{i = 1}^{n} α log \frac{α_{i}}{\hat{α_{i}}} .

(31)

6. Results and Discussion

In Figure A4, results for four different experiments are visualized: the measured distribution and the predicted distribution using LOOCV. For the two figures on the left, the calibrated distributions using a PBM from Van Hauwermeiren et al. [1] are plotted as well. Calibrated distributions using the PBM are not available for the experiments in the two right figures, so only predicted distributions using KME are plotted there. More figures with results can be found in Appendix A. These figures show that a good prediction can be achieved over a wide range of distribution shapes (monomodal, monomodal with high skewness, and bimodal) without any assumption of the nature of the true underlying distribution. Some peculiarities occur in the predictions: in the first and last bins, the KME model always predicts non-zero values. This might be due to the formulation of the ridge-regression-like problem or incorrect retrieval of the distribution in the pre-imaging problem. A switch to a lasso problem, as described by Grünewälder et al. [33], could potentially alleviate this issue. Alternatively, the entropic regularization in Equation (27) might not be chosen in an optimal fashion. A closer look into experiments 9, 12, and 14 shows that the KME predicts the location of the peaks and the skewness of the distributions accurately. Experiments 9 and 14 have lower values of MMD, RMSE, and KL compared to the previous work using PBMs. Considering experiment 20, the model correctly ignores the measurement error of slight bimodal distribution. Overall, when ignoring the peculiarities in the first and last bins, the model attains better results than previous work using population balance models (PBMs) [1], while at the same time lowering the computational cost of calibration and validation and the number of required parameters to describe the model. The main difference in the resulting distributions between KME and PBM is the amount of separation between the modes in bimodal distributions. In other words, the calibrated distributions using PBM have more separated peaks, which results in a recurrent underprediction between the modes in the middle of the distribution and overprediction at the modes. In terms of objective function value, the PBM work quantified the distance between simulation and measurement using the root mean squared error. In Van Hauwermeiren et al. [1], the average RMSE value is 0.0803. In this work, the average RMSE value is 0.0876. For the MMD, the average values for the KME are higher than the PBM approach. According to the KL distance, the KME approach gave better results than the PBM work. Note that in the aforementioned work, the published results are calibration, whereas, in this work, one extra level of complexity is added: predictions of unseen data. Thus, the results presented in this work using KME achieve similar goodness-of-fit while trying to solve a more complex problem. An overview of all goodness-of-fit values can be found in Table 2. It should be noted that, to the authors’ best knowledge, this work is the first one to compare PBM with KME.

7. Conclusions

The KME of distributions is an interesting data-driven framework to learn relations between a certain input space (in this application, TSWG process settings) and measured distributions. It can be written into a form that allows the description of the problem as a kernel ridge regression problem. Using this framework, kernel mean embeddings of distributions can be predicted given certain inputs. With the kernel pre-image problem, the prediction can be translated from the high-dimensional Hilbert space into its original space. This allows the interpretation and evaluation of the framework in its original space. The benefits of using KME are fast calculation (a couple of seconds for the given problem), analytical short-cuts for LOOCV, high-quality predictions for a wide variety of distribution shapes without making any assumption about those distributions, and a small number of parameters. The model only has five hyperparameters, for which only the regularization parameter

λ

was estimated.

This work shows an intuitive data-driven approach for which the whole workflow can be written in less than 30 lines of code (see Appendix B). This compactness combined with the fact that only one parameter needs to be estimated makes it an attractive choice for practitioners with limited programming knowledge.

The whole calculation of the LOOCV KME and pre-image problem takes only a couple of seconds. This is in stark contrast with the PBM calculation from previous work [1], where a single calculation of a PBM already takes a couple of seconds. Performing the whole parameter estimation to yield the results that were also shown in the previous section takes orders of magnitude more time than our data-driven approach.

In conclusion, the proposed approach to predict PSDs in TSWG is fast, does not make any assumptions about the shape of the data, and most importantly, yields high-quality cross-validated results.

8. Prospects

Further improvements of this work could include an extension to learn relationships from distribution to distribution. In this application field, this could be applied to assessing the effect of a change in pre-blend composition on the resulting PSD at the end of the granulator. Further, instead of working with one unit operation, the whole ConsiGma

^{TM}

-25 production line could be studied. The whole transformation of distributions in the unit operations (feeder, blender, granulator, dryer, mill, and tablet press) could be mapped.

The data for this work was gathered using the off-line measurement device QICPIC. It might be interesting to investigate if the model can be trained with a similar predictive power using an in-line particle size measurement device. In that way, the time-consuming preprocessing steps could be bypassed and the results could be gathered more swiftly. If the same experimental design was used, all the data could be gathered in a matter of hours. However, in-line measurement is more prone to noise and has a lower resolution. The effect of this on the model predictions needs to be investigated.

This method is here applied to particle size distributions, but could be extended to other types of distribution-like data. This work focused on a distribution of particle size, but for instance, the moisture distribution in a collection of granules or the hardness of a representative set of final tablets are also possible applications. One other example seems obvious: the prediction of mixtures. In this sense, we could answer questions like “what is the behavior of a mixture of powders starting from the attributes of its components?” This is a hot topic in pharmaceutical manufacturing, as generating an adequate mixture that has the desired properties is mostly done using expert knowledge. The method described in this work could help in creating a model-based design for mixtures. To our best knowledge, we see no hurdles in applying the same methodology to other data-driven problems with distributed data.

Author Contributions

Data curation: D.V.H.; validation: D.V.H.; writing—original draft, D.V.H.; conceptualization, M.S.; methodology, M.S.; software, D.V.H. and M.S.; writing—review and editing, M.S.; supervision, T.D.B. and I.N. All authors have read and agreed to the published version of the manuscript.

Funding

M.S. is supported by the Research Foundation - Flanders (FWO17/PDO/067).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

doe	design of experiments
iid	independent and identically distributed
KME	kernel mean embedding
MFR	mass flow rate
KL	Kullback–Leibler divergence
LOOCV	leave-one-out cross-validation
L/S	liquid-to-solid ratio
MMD	maximum mean discrepancy
PSD	particle size distribution
RBF	radial basis function
RKHS	recurrent kernel Hilbert space
RMSE	root mean squared error
TSWG	twin-screw wet granulation

Appendix A. Additional Figures

Figure A1. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 1. (b) Experiment 2. (c) Experiment 3. (d) Experiment 4. (e) Experiment 5. (f) Experiment 6. (g) Experiment 7. (h) Experiment 8. Note that the population balance model (PBM) calibration was not performed for experiments (b,d–f,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A1. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 1. (b) Experiment 2. (c) Experiment 3. (d) Experiment 4. (e) Experiment 5. (f) Experiment 6. (g) Experiment 7. (h) Experiment 8. Note that the population balance model (PBM) calibration was not performed for experiments (b,d–f,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A2. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 9. (b) Experiment 10. (c) Experiment 11. (d) Experiment 12. (e) Experiment 13. (f) Experiment 14. (g) Experiment 15. (h) Experiment 16. Note that the population balance model (PBM) calibration was not performed for experiments (b–e,g,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A2. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 9. (b) Experiment 10. (c) Experiment 11. (d) Experiment 12. (e) Experiment 13. (f) Experiment 14. (g) Experiment 15. (h) Experiment 16. Note that the population balance model (PBM) calibration was not performed for experiments (b–e,g,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A3. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 17. (b) Experiment 18. (c) Experiment 19. (d) Experiment 20. (e) Experiment 21. (f) Experiment 22. (g) Experiment 23. (h) Experiment 24. Note that the population balance model (PBM) calibration was not performed for experiments (a,d–f,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A3. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 17. (b) Experiment 18. (c) Experiment 19. (d) Experiment 20. (e) Experiment 21. (f) Experiment 22. (g) Experiment 23. (h) Experiment 24. Note that the population balance model (PBM) calibration was not performed for experiments (a,d–f,h). KME: kernel mean embedding; MFR: mass flow rate.

Figure A4. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 25. (b) Experiment 26. (c) Experiment 27. (d) Experiment 28. (e) Experiment 29. Note that the population balance model (PBM) calibration was not performed for experiments (b–e). KME: kernel mean embedding; MFR: mass flow rate.

Figure A4. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 25. (b) Experiment 26. (c) Experiment 27. (d) Experiment 28. (e) Experiment 29. Note that the population balance model (PBM) calibration was not performed for experiments (b–e). KME: kernel mean embedding; MFR: mass flow rate.

Appendix B. Julia Code

References

Van Hauwermeiren, D.; Verstraeten, M.; Doshi, P.; am Ende, M.T.; Turnbull, N.; Lee, K.; De Beer, T.; Nopens, I. On the modelling of granule size distributions in twin-screw wet granulation: Calibration of a novel compartmental population balance model. Powder Technol. 2019, 341, 116–125. [Google Scholar] [CrossRef]
Verstraeten, M.; Van Hauwermeiren, D.; Lee, K.; Turnbull, N.; Wilsdon, D.; am Ende, M.; Doshi, P.; Vervaet, C.; Brouckaert, D.; Mortier, S.T.; et al. In-depth experimental analysis of pharmaceutical twin-screw wet granulation in view of detailed process understanding. Int. J. Pharm. 2017, 529, 678–693. [Google Scholar] [CrossRef] [PubMed]
Gärtner, T. A survey of kernels for structured data. ACM SIGKDD Explor. Newsl. 2003, 5, 49. [Google Scholar] [CrossRef]
Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Stat. 2008, 36, 1171–1220. [Google Scholar] [CrossRef] [Green Version]
Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel Mean Embedding of Distributions: A Review and Beyond. Found. Trends Mach. Learn. 2016. [Google Scholar] [CrossRef]
Schölkopf, B.; Smola, A.J.; Bach, F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer: Boston, MA, USA, 2004. [Google Scholar] [CrossRef]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [Green Version]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Mercer, J. Functions of positive and negative type, and their connection with the theory of integral equations. Proc. R. Soc. A Math. Phys. Eng. Sci. 1909, 83, 69–70. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337. [Google Scholar] [CrossRef]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory—COLT ’92, Pittsburgh, PA, USA, 27–29 July 1992; ACM Press: New York, NY, USA, 1992; pp. 144–152. [Google Scholar] [CrossRef]
Vapnik, V. The Nature of Statistical Learning Theory, 2nd ed.; Springer: Berlin, Germany, 2000; p. 314. [Google Scholar]
Genton, M.G. Classes of kernels for machine learning: A statistics perspective. J. Mach. Learn. Res. 2002, 2, 299–312. [Google Scholar]
Bochner, S. Monotone funktionen, stieltjessche integrale und harmonische analyse. Math. Ann. 1933, 108, 378–410. [Google Scholar] [CrossRef]
Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), Sendai, Japan, 1–4 October 2007; pp. 13–31. [Google Scholar]
Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20: 21st Annual Conference on Neural Information Processing Systems 2007; Biologische Kybernetik, Curran: Red Hook, NY, USA, 2008; pp. 489–496. [Google Scholar]
Sriperumbudur, B.K.; Gretton, A.; Fukumizu, K.; Schölkopf, B.; Lanckriet, G.R.G. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 2010, 11, 1517–1561. [Google Scholar]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A.J. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Schölkopf, B.; Muandet, K.; Fukumizu, K.; Harmeling, S.; Peters, J. Computing functions of random variables via reproducing kernel Hilbert space representations. Stat. Comput. 2015, 25, 755–766. [Google Scholar] [CrossRef] [Green Version]
Fukumizu, K.; Bach, F.R.; Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res. 2004, 5, 73–99. [Google Scholar]
Sriperumbudur, B.K.; Gretton, A.; Fukumizu, K.; Lanckriet, G.; Schölkopf, B. Injective Hilbert space embeddings of robability measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), Helsinki, Finland, 9–12 July 2008; Biologische Kybernetik, Omnipress: Madison, WI, USA, 2008; pp. 111–122. [Google Scholar]
Sriperumbudur, B.K.; Fukumizu, K.; Gretton, A.; Schölkopf, B.; Lanckriet, G.R.G. On the empirical estimation of integral probability metrics. Electron. J. Stat. 2012, 6, 1550–1599. [Google Scholar] [CrossRef]
Song, L.; Fukumizu, K.; Gretton, A. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 2013, 30, 98–111. [Google Scholar] [CrossRef]
Poczos, B.; Xiong, L.; Schneider, J. Nonparametric divergence estimation with applications to machine learning on distributions. CoRR 2012, arXiv:1202.3758. [Google Scholar]
Oliva, J.B.; Póczos, B.; Schneider, J.G. Distribution to distribution regression. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1049–1057. [Google Scholar]
Oliva, J.B.; Poczos, B.; Schneider, J. Fast distribution to real regression. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, 22–25 April 2014; pp. 706–714. [Google Scholar]
Song, L.; Huang, J.; Smola, A.; Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 961–968. [Google Scholar]
Fukumizu, K.; Song, L.; Gretton, A. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 2013, 14, 3753–3783. [Google Scholar]
Zhang, K.; Peters, J.; Janzing, D.; Schölkopf, B. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), Barcelona, Spain, 14–17 July 2011; pp. 804–813. [Google Scholar]
Grünewälder, S.; Lever, G.; Gretton, A.; Baldassarre, L.; Patterson, S.; Pontil, M. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, UK, 26 June–1 July 2012; pp. 1823–1830. [Google Scholar]
Micchelli, C.A.; Pontil, M. On Learning Vector-Valued Functions. Neural Comput. 2005, 17, 177–204. [Google Scholar] [CrossRef] [PubMed]
Wahba, G. Spline Models for Observational Data; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1990. [Google Scholar]
Kwok, J.T.Y.; Tsang, I.W.H. The pre-image problem in kernel methods. IEEE Trans. Neural Netw. 2004, 15, 1517–1525. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Song, L.; Zhang, X.; Smola, A.; Gretton, A.; Schölkopf, B. Tailoring Density Estimation via Reproducing Kernel Moment Matching. In Proceedings of the 25th Annual International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 992–999. [Google Scholar] [CrossRef] [Green Version]
Kanagawa, M.; Fukumizu, K. Recovering distributions from Gaussian RKHS embeddings. Aistats 2014, 33, 457–465. [Google Scholar]
Gretton, A.; Herbrich, R.; Smola, A.J.; Bousquet, O.; Schölkopf, B. Kernel methods for measuring independence. J. Mach. Learn. Res. 2005, 6, 2075–2129. [Google Scholar]

Figure 1. Visual representation of the general modeling principle in this paper. We start at the top left with our data: A measured particle size distribution. This distribution is translated into a new feature space through a kernel function

φ

. On the top right of the figure, the measured distribution is represented as a mean of features (large red dot), which might be slightly different from the embedding of the true distribution (large black dot). From this embedding we can perform inference or regression with the machine input settings. RKHS: recurrent kernel Hilbert space.

Figure 1. Visual representation of the general modeling principle in this paper. We start at the top left with our data: A measured particle size distribution. This distribution is translated into a new feature space through a kernel function

φ

. On the top right of the figure, the measured distribution is represented as a mean of features (large red dot), which might be slightly different from the embedding of the true distribution (large black dot). From this embedding we can perform inference or regression with the machine input settings. RKHS: recurrent kernel Hilbert space.

Figure 2. Visual representation of the kernel trick. The value of the kernel function of a pair of objects (denoted in red) in object space

X

is identical to an inner product of the representations of the objects in the implied Hilbert space

F

.

Figure 2. Visual representation of the kernel trick. The value of the kernel function of a pair of objects (denoted in red) in object space

X

is identical to an inner product of the representations of the objects in the implied Hilbert space

F

.

Figure 3. Embedding of marginal distributions

P

and

Q

into the RKHS

H

yielding

μ_{P}

and

μ_{Q}

. Figure based on Muandet et al. [5].

Figure 3. Embedding of marginal distributions

P

and

Q

into the RKHS

H

yielding

μ_{P}

and

μ_{Q}

. Figure based on Muandet et al. [5].

Figure 4. A comparison of the embedding of marginal distributions

P

into the RKHS

H

: (top) a continuous distribution is an integral over the Hilbert space, (middle) a sample distribution is the arithmetic mean over the embeddings of the individual samples, and (bottom) a probability mass function is the weighted average over the individual embeddings.

Figure 4. A comparison of the embedding of marginal distributions

P

into the RKHS

H

: (top) a continuous distribution is an integral over the Hilbert space, (middle) a sample distribution is the arithmetic mean over the embeddings of the individual samples, and (bottom) a probability mass function is the weighted average over the individual embeddings.

Figure 5. The embedding of conditional distribution

P (Y | X)

is not a single element in the RKHS. Instead, it may be viewed as a family of Hilbert space embeddings of the conditional distributions

P (Y | X = x)

indexed by the conditioning variable X. In other words, the conditional mean embedding can be viewed as an operator mapping from the RKHS

H

for features to RKHS

G

for distributions. Figure based on Muandet et al. [5].

Figure 5. The embedding of conditional distribution

P (Y | X)

is not a single element in the RKHS. Instead, it may be viewed as a family of Hilbert space embeddings of the conditional distributions

P (Y | X = x)

indexed by the conditioning variable X. In other words, the conditional mean embedding can be viewed as an operator mapping from the RKHS

H

for features to RKHS

G

for distributions. Figure based on Muandet et al. [5].

Figure 6. Schematic representation of the ConsiGma

^{TM}

-25 system (GEA Pharma systems, Collette, Wommelgem, Belgium) continuous powder-to-tablet line.

Figure 6. Schematic representation of the ConsiGma

^{TM}

-25 system (GEA Pharma systems, Collette, Wommelgem, Belgium) continuous powder-to-tablet line.

Figure 7. Schematic representation of the input and output data of the twin-screw wet granulation (TSWG), part of the ConsiGma

^{TM}

-25 system (GEA Pharma systems, Collette, Wommelgem, Belgium) continuous powder-to-tablet line.

Figure 7. Schematic representation of the input and output data of the twin-screw wet granulation (TSWG), part of the ConsiGma

^{TM}

-25 system (GEA Pharma systems, Collette, Wommelgem, Belgium) continuous powder-to-tablet line.

Figure 8. Visualisation of the Gram matrices. (a) Gram matrix of kernel k on the process settings

X

. (b) Gram matrix of kernel l on the grid.

Figure 8. Visualisation of the Gram matrices. (a) Gram matrix of kernel k on the process settings

X

. (b) Gram matrix of kernel l on the grid.

Figure 9. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 9. (b) Experiment 12. (c) Experiment 14. (d) Experiment 20. Note that the population balance model (PBM) calibration was not performed for experiments (b,d). KME: kernel mean embedding; MFR: mass flow rate.

Figure 9. Measurements and leave-one-out cross-validation (LOOCV) predictions of

P_{Y | X}

. For the two figures on the left, the calibrated distributions using a PBM model from Van Hauwermeiren et al. [1] are plotted as well. From top left to bottom right: (a) Experiment 9. (b) Experiment 12. (c) Experiment 14. (d) Experiment 20. Note that the population balance model (PBM) calibration was not performed for experiments (b,d). KME: kernel mean embedding; MFR: mass flow rate.

Table 1. Process parameters of the experimental conditions. L/S: liquid-to-solid ratio.

Experiment	Throughput	Screw Speed	L/S Ratio
	(kg/h)	(rpm)	(—)
N1	5	450	0	3
N2	5	675	0	3
N3	5	900	0	3
N4	12.5	450	0	3
N5	12.5	675	0	3
N6	12.5	900	0	3
N7	25	450	0	3
N8	25	675	0	3
N9	25	900	0	3
N10	5	450	0	45
N11	5	675	0	45
N12	5	900	0	45
N13	12.5	450	0	45
N14	12.5	675	0	45
N15	12.5	900	0	45
N16	25	900	0	45
N17	25	450	0	45
N18	5	900	0	6
N19	5	450	0	6
N20	12.5	450	0	6
N21	12.5	675	0	6
N22	12.5	900	0	6
N23	25	900	0	6
N24	25	675	0	6
N25	25	450	0	6
N26	12.5	675	0	375
N27	12.5	675	0	525
N28	12.5	675	0	337
N29	12.5	675	0	563

Table 2. Overview of the quality of the LOOCV KME prediction and the calibration of the previous PBM model [1] for each experiment expressed in three distance functions: maximum mean discrepancy (MMD), root mean square error (RMSE), and Kullback–Leibler divergence (KL). At the bottom, the mean of each column is added. The columns of the PBM are missing some values because the calibration could not be performed for those experiments due to missing data in the wetting zone. For more information, see Van Hauwermeiren et al. [1].

Experiment	MMD KME	MMD PBM	RMSE KME	RMSE PBM	KL KME	KL PBM
1	$8.10 \times 10^{- 4}$	$3.28 \times 10^{- 3}$	$5.05 \times 10^{- 2}$	$5.92 \times 10^{- 2}$	$3.82 \times 10^{- 1}$	$3.02 \times 10^{- 1}$
2	$4.73 \times 10^{- 3}$		$6.22 \times 10^{- 2}$		$5.44 \times 10^{- 1}$
3	$9.59 \times 10^{- 3}$	$1.57 \times 10^{- 3}$	$1.00 \times 10^{- 1}$	$7.49 \times 10^{- 2}$	$1.84 \times 10^{0}$	$7.61 \times 10^{- 1}$
4	$5.22 \times 10^{- 3}$		$5.06 \times 10^{- 2}$		$5.99 \times 10^{- 1}$
5	$2.28 \times 10^{- 3}$		$6.10 \times 10^{- 2}$		$3.10 \times 10^{- 2}$
6	$1.94 \times 10^{- 4}$		$2.48 \times 10^{- 2}$		$1.36 \times 10^{- 1}$
7	$7.04 \times 10^{- 3}$	$8.05 \times 10^{- 4}$	$9.56 \times 10^{- 2}$	$7.41 \times 10^{- 2}$	$1.93 \times 10^{0}$	$1.80 \times 10^{0}$
8	$3.10 \times 10^{- 3}$		$4.42 \times 10^{- 2}$		$3.30 \times 10^{- 2}$
9	$7.58 \times 10^{- 4}$	$4.26 \times 10^{- 3}$	$3.16 \times 10^{- 2}$	$6.49 \times 10^{- 2}$	$5.87 \times 10^{- 1}$	$1.04 \times 10^{0}$
10	$2.10 \times 10^{- 3}$		$7.78 \times 10^{- 2}$		$2.30 \times 10^{0}$
11	$4.39 \times 10^{- 3}$		$5.78 \times 10^{- 2}$		$5.15 \times 10^{- 1}$
12	$1.16 \times 10^{- 3}$		$5.04 \times 10^{- 2}$		$1.06 \times 10^{- 1}$
13	$4.02 \times 10^{- 2}$		$1.16 \times 10^{- 1}$		$4.72 \times 10^{- 1}$
14	$8.60 \times 10^{- 4}$	$2.71 \times 10^{- 3}$	$4.62 \times 10^{- 2}$	$6.50 \times 10^{- 2}$	$3.60 \times 10^{- 2}$	$2.28 \times 10^{- 1}$
15	$5.46 \times 10^{- 4}$		$6.22 \times 10^{- 2}$		$4.22 \times 10^{- 1}$
16	$1.15 \times 10^{- 2}$		$7.74 \times 10^{- 2}$		$4.76 \times 10^{- 1}$
17	$5.27 \times 10^{- 2}$		$1.55 \times 10^{- 1}$		$3.53 \times 10^{0}$
18	$3.73 \times 10^{- 2}$	$7.26 \times 10^{- 4}$	$1.63 \times 10^{- 1}$	$1.10 \times 10^{- 1}$	$9.09 \times 10^{- 1}$	$1.92 \times 10^{0}$
19	$1.39 \times 10^{- 3}$	$1.07 \times 10^{- 3}$	$6.00 \times 10^{- 2}$	$5.76 \times 10^{- 2}$	$9.35 \times 10^{- 1}$	$7.98 \times 10^{- 1}$
20	$4.90 \times 10^{- 3}$		$9.76 \times 10^{- 2}$		$2.34 \times 10^{0}$
21	$6.51 \times 10^{- 3}$		$8.01 \times 10^{- 2}$		$5.39 \times 10^{- 1}$
22	$4.99 \times 10^{- 2}$		$1.55 \times 10^{- 1}$		$4.45 \times 10^{- 1}$
23	$4.58 \times 10^{- 2}$	$1.52 \times 10^{- 3}$	$1.38 \times 10^{- 1}$	$6.83 \times 10^{- 2}$	$9.43 \times 10^{- 1}$	$7.34 \times 10^{- 1}$
24	$6.31 \times 10^{- 3}$		$9.74 \times 10^{- 2}$		$9.75 \times 10^{- 2}$
25	$5.53 \times 10^{- 2}$	$3.52 \times 10^{- 3}$	$3.01 \times 10^{- 1}$	$1.48 \times 10^{- 1}$	$9.82 \times 10^{- 1}$	$3.21 \times 10^{0}$
26	$2.61 \times 10^{- 3}$		$3.70 \times 10^{- 2}$		$3.12 \times 10^{- 2}$
27	$8.45 \times 10^{- 3}$		$1.16 \times 10^{- 1}$		$7.64 \times 10^{- 2}$
28	$8.80 \times 10^{- 3}$		$6.52 \times 10^{- 2}$		$5.45 \times 10^{- 2}$
29	$4.31 \times 10^{- 3}$		$6.64 \times 10^{- 2}$		$9.46 \times 10^{- 2}$
mean	$1.31 \times 10^{- 2}$	$2.16 \times 10^{- 3}$	$8.76 \times 10^{- 2}$	$8.03 \times 10^{- 2}$	$7.38 \times 10^{- 1}$	$1.20 \times 10^{0}$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Van Hauwermeiren, D.; Stock, M.; De Beer, T.; Nopens, I. Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding. Pharmaceutics 2020, 12, 271. https://doi.org/10.3390/pharmaceutics12030271

AMA Style

Van Hauwermeiren D, Stock M, De Beer T, Nopens I. Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding. Pharmaceutics. 2020; 12(3):271. https://doi.org/10.3390/pharmaceutics12030271

Chicago/Turabian Style

Van Hauwermeiren, Daan, Michiel Stock, Thomas De Beer, and Ingmar Nopens. 2020. "Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding" Pharmaceutics 12, no. 3: 271. https://doi.org/10.3390/pharmaceutics12030271

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Pharmaceutical Particle Size Distributions Using Kernel Mean Embedding

Abstract

1. Introduction

2. General Principles

3. Theoretical Background

3.1. Learning with Kernels

3.2. Hilbert Space Embedding of Marginal Distributions

3.3. Hilbert Space Embedding of Conditional Distributions

3.4. Learning on Distributional Data

3.5. Recovering Distributions from RKHS Embeddings

4. Experimental Set Up and Data Collection

5. Calibration Procedure

6. Results and Discussion

7. Conclusions

8. Prospects

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Additional Figures

Appendix B. Julia Code

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI