For completeness purposes, the basic idea of cross-entropy method is briefly introduced. Denote

$X=\{{X}_{1},{X}_{2},...,{X}_{k}\}$ as a

k-dimensional random variable taking values in

$\mathcal{X}$ and having a probability distribution of

$f(x;\theta )$, where

θ is the distribution parameter. Define a real-valued function

$h\left(x\right)$, and the probability of

$h\left(x\right)\ge \gamma $ is of interest. The probability can be expressed as

where

$\mathbf{1}\left(h\right(x)\ge \gamma )$ is the indicator function taking value of 1 if

$h\left(x\right)\ge \gamma $, and 0 otherwise.

${\mathbb{E}}_{f}\left[\xb7\right]$ is the expectation operator under the distribution

$f(x;\theta )$. For realistic problems,

${l}_{\theta}$ is usually estimated using simulation-based methods, such as Monte Carlo (MC) method or its variants. According to the Law of Large Numbers and the Central Limit Theorem, the MC estimator has a relative error of

where

N is the number of simulation samples. If

${l}_{\theta}$ is very small (rare event), for example

${l}_{\theta}={10}^{-6}$ and

$\u03f5=0.01$, the estimator needs

$N\approx {10}^{10}$ samples, which is computationally infeasible. The important sampling (IS) method uses a different distribution

$g\left(x\right)$ to draw samples, and

${l}_{\theta}$ can be estimated as

It is seen that the ideal

$g\left(x\right)$, denoted as

${g}_{*}\left(x\right)$, is

By using the change of measure in Equation (

4), the following result is guaranteed

where

${X}^{\left(i\right)}$ is the

$i\mathrm{th}$ random sample. This means the estimator has zero variance and only

$N=1$ sample is needed. However,

${g}_{*}\left(x\right)$ is difficult to obtain because

${l}_{\theta}$ is unknown. Therefore, it is desired to find a

$g\left(x\right)$ which is as close to the ideal distribution of Equation (

4) as possible. In practice,

$g\left(x\right)$ can be chosen as the same type of distribution as

$f(x;\theta )$ but with a different parameter

$\tilde{\theta}$, i.e.,

$g\left(x\right):=f(x;\tilde{\theta})$. One of the convenient measures for “distance” between two distributions is the Kullback–Leibler (KL) distance [

43], which is also referred as cross-entropy in engineering and is defined as

where

${p}_{1}\left(x\right)$ and

${p}_{2}\left(x\right)$ are two distributions. Minimization of Equation (

6) between

${g}_{*}\left(x\right)$ and

$g\left(x\right):=f(x;\tilde{\theta})$ leads to minimizing

which is equivalent to maximizing

${\sum}_{x}{g}_{*}\left(x\right)lnf(x;\tilde{\theta})$ due to the fact that

${\sum}_{x}{g}_{*}\left(x\right)ln{g}_{*}\left(x\right)$ is a constant. Substitution of

${g}_{*}\left(x\right)$ from Equation (

4) into Equation (

7) casts the problem to finding

Given an initial sampling distribution

$f(x;{\theta}_{0})$, Equation (

8) is equivalent to

where

$W(x;\theta ,{\theta}_{0})=\frac{f(x;\theta )}{f(x;{\theta}_{0})}$ is the likelihood ratio. Using discrete samples,

${\theta}_{*}$ can be estimated by

where

${X}^{\left(1\right)},{X}^{\left(2\right)},...,{X}^{\left(N\right)}$ are random samples from

$f(x;{\theta}_{0})$.

${\widehat{\theta}}_{*}$ can be obtained by solving the following equations

Given that the distribution of

$f(x;\tilde{\theta})$ belongs to a natural exponential family,

${\widehat{\theta}}_{*}$ can be calculated analytically. An iterative procedure of solving

${\theta}_{*}$ is presented as Algorithm 1 [

40].

**Algorithm 1** Iterative Procedure of Estimation for ${\theta}_{*}$ |

$t\leftarrow 0$, initialize ${\theta}_{0}$, e.g., ${\theta}_{0}=\theta $. **repeat** (1) Draw N random samples ${x}^{\left(1\right)},{x}^{\left(2\right)},...,{x}^{\left(N\right)}$ from $f(x;{\theta}_{t})$. (2) Calculate ${h}_{i}=h({x}^{\left(i\right)})$ for all i. Sort ${h}_{i}$ from largest to smallest. (3) ${\gamma}_{t}\leftarrow {h}_{M}$, where $1\le M\le N$. (4) Calculate ${\theta}_{t+1}$ by solving ${\sum}_{i=1}^{N}\mathbf{1}(h({X}^{\left(i\right)})\ge \gamma )W({X}^{\left(i\right)};\theta ,{\theta}_{t})\nabla lnf({X}^{\left(i\right)};{\theta}_{t+1})=0$. (5) $t\leftarrow t+1$. **until** ${\gamma}_{t}\ge \gamma $. ${\theta}_{*}\leftarrow {\theta}_{t+1}$. |

For optimization problems, Algorithm 1 needs slight modifications because the objective is not finding an optimized important sampling distribution parameter

${\theta}_{*}$. The idea is to convert an optimization problem to its associated stochastic problem. Consider a general minimization problem with a real-valued performance function

$h\left(x\right)$ and a state variable

$x\in \mathcal{X}$. The minimum of

$h\left(x\right)$ is denoted as

${\gamma}_{*}$ when

x is equal to

${x}_{*}$. Define a threshold value

$\gamma \in \mathbb{R}$ and an indicator function

$\mathbf{1}\left(h\right(x)\le \gamma )$. This setting allows one to estimate the probability of

$h\left(x\right)\le \gamma $ as

where

$f(x;\theta )$ is defined as before. It is seen that if

γ is close to

${\gamma}_{*}$, i.e.,

$\gamma ={\gamma}_{*}+\u03f5$, where

ϵ is a very small positive,

$h\left(x\right)\le \gamma $ can be treated as a rare event. In that case,

${l}_{\theta}({\gamma}_{*}+\u03f5)$ will be a small quantity and thus the important sampling can be used to estimate it efficiently. The ideal parameter

${\theta}_{*}$ of the important sampling distribution

$f(x;{\theta}_{*})$ can be estimated following Equations (

8)–(

11) as

The term

$W({X}^{\left(i\right)};\theta ,{\theta}_{0})$ is dropped out from Equation (

13) because, in each iteration,

${\gamma}_{t}$ is assigned a new value based on samples of the current iteration. As a result, the corresponding

θ is equal to

${\theta}_{t}$ and

$W(X;\theta ,{\theta}_{t})=1$. The correlation between the estimation of

${l}_{\theta}({\gamma}_{*}+\u03f5)$ and finding the solution to the optimization problem (i.e.,

${x}_{*}$) lies in the following fact: it is plausible that

$f(x;{\theta}_{*})$ assigns most of its probability mass close to

${x}_{*}$ if

γ is close to

${\gamma}_{*}$; therefore, the resulting simulation samples from

$f(x;{\theta}_{*})$ can be used to estimate

${x}_{*}$. The process for estimation of

${x}_{*}$ now involves a two-stage procedure and one possible implementation is described by Algorithm 2 [

40]:

**Algorithm 2** Iterative Procedure of Estimation for ${x}_{*}$ in a General Minimization Problem |

$t\leftarrow 0$, initialize ${\theta}_{0}$ and γ, e.g., Let ${\theta}_{0}=\theta $ and assign γ a large value. **repeat** (1) Draw N random variables ${x}^{\left(1\right)},{x}^{\left(2\right)},...,{x}^{\left(N\right)}$ from $f(x;{\theta}_{t})$. (2) Calculate ${h}_{i}=h({x}^{\left(i\right)})$ for all i. Sort ${h}_{i}$ from smallest to largest. (3) ${\gamma}_{t}\leftarrow {h}_{M}$, where $1\le M\le N$. (4) Calculate ${\theta}_{t+1}$ by solving ${\sum}_{i=1}^{N}\mathbf{1}(h({X}^{\left(i\right)})\ge \gamma )\nabla lnf({X}^{\left(i\right)};{\theta}_{t+1})=0$. (5) $t\leftarrow t+1$, $\gamma \leftarrow {\gamma}_{t}$. **until** convergence is reached. Draw samples ${x}^{\left(1\right)},{x}^{\left(2\right)},...,{x}^{\left(N\right)}$ from $f(x;{\theta}_{t})$. ${x}_{*}$ is estimated by $\frac{1}{N}{\sum}_{i=1}^{N}{x}^{\left(i\right)}$. |