Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty

Sano, Kei; Kawahito, Daiki; Saito, Yukiya; Moki, Hironori; Djurdjanovic, Dragan

doi:10.3390/app16073213

Open AccessArticle

Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty

by

Kei Sano

^1,*

,

Daiki Kawahito

¹,

Yukiya Saito

¹,

Hironori Moki

¹ and

Dragan Djurdjanovic

²

¹

Tokyo Electron Ltd., Tokyo 107-6325, Japan

²

NSF Engineering Research Center, The University of Texas at Austin, Austin, TX 78712, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3213; https://doi.org/10.3390/app16073213

Submission received: 23 December 2025 / Revised: 10 March 2026 / Accepted: 11 March 2026 / Published: 26 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Featured Application

This paper applies yield estimation and optimization with minimal experiments and stringent specifications in the field of quality control.

Abstract

In this paper, we propose a novel method which utilizes samples of measured product quality characteristics to efficiently estimate the probabilities of those quality characteristics being within the desired specifications and, consequently, the process yield. Specifically, when dealing with 1D Gaussian distributions, we formally prove that the proposed yield estimator asymptotically gives a lower Mean Squared Error compared to the best unbiased estimator. In order to enable maximization of yield, this novel estimator is incorporated into the framework of Bayesian Optimization which iteratively seeks controllable tool parameters under which the outgoing product yield is maximized. The newly proposed yield maximization method is demonstrated in an application involving high-fidelity simulations of a reactive ion etch chamber, a tool component commonly used in semiconductor manufacturing. The aim of these simulations was to rapidly and reliably determine tool parameters that maximize the probability of delivering desired plasma density characteristics under stochastic variations in chamber conditions. The novel yield estimation and optimization methods show superiority when the number of experimental observations is limited and the distributions of outgoing product characteristics can be approximated well by a Gaussian distribution.

Keywords:

Bayes procedures; optimization methods; probability; simulation; statistics

1. Introduction

In advanced manufacturing, product specifications have become increasingly stringent, and relevant tools must be meticulously regulated to ensure that the outgoing product quality meets these specifications. However, even when a tool operates under identical control parameter settings, the quality of fabricated products usually still varies. As a result, some portion of the outgoing products may potentially be defective. This variability arises because tools are influenced by numerous environmental factors that cannot be observed or controlled. To minimize defective quality and maximize yields, it is crucial to efficiently address the uncertainties arising from these uncontrollable and unobservable factors.

Problems in yield assessment vary across industry sectors, including additive manufacturing [1], semiconductor fabrication [2,3], and the pharmaceutical industry [4,5]. A good yield is qualitatively defined as having quality characteristics of outgoing products that match the desired values with minimal variation. On a quantitative level, various formulations for defining yield have been considered, such as the sum of bias and variance [6,7] or the probability that the quality characteristics meet the specified constraints [8,9,10,11,12]. In cases where quality characteristics are multivariate, multi-criteria frameworks may be employed to define yield, as in [13,14]. In this manuscript, we define yield as the probability that the quality characteristic falls within the predefined specification limits (it should be noted that even when quality characteristics are multidimensional, yield defined as above is a scalar value), as discussed in [9,10].

A common approach is to treat yield as a Bernoulli parameter and estimate it via the maximum likelihood as in [9,11]. These studies regard the maximum likelihood estimation as a method using traditional Monte Carlo techniques. They propose the use of efficient sampling methods, such as importance sampling [11] or the cross-entropy method [9], each aiming for the highest possible accuracy using as few samples as possible. Nevertheless, improvement in sampling methods is only applicable when the probability density function of the quality characteristics is known. An alternative strategy for efficiently estimating yields is to assume a prior distribution for the parameter of a Bernoulli distribution and use samples to iteratively update that distribution. The study reported in [12] suggests employing a Beta distribution as the prior distribution of the Bernoulli parameter. While this approach can achieve high accuracy when provided with a precise initial yield estimation, it is often unreasonable to assume that such accurate yield estimations are available, especially without significant gathering of prior observations of the quality characteristics. Therefore, while leveraging prior knowledge can enhance accuracy with a limited number of samples, it is often preferable for estimation methods to operate without any assumptions about the true yield value, especially when one is developing a new process—i.e., when such knowledge is limited or does not exist.

To address this need, this article proposes estimating the yield by assuming and exploiting knowledge of the form of the underlying distribution of the product quality, with the standard assumption of independent and identically distributed (i.i.d.) samples. The main idea is that when one does have knowledge about the distribution form, one can first estimate its parameters and then subsequently calculate the corresponding probability of product quality falling within the desired specifications. Note that the estimator based on the Bernoulli distribution is the best unbiased estimator whose variance attains the Cramér–Rao bound [15]; however, it does not utilize any knowledge about the underlying distribution of the quality characteristic. Although the proposed estimator is generally biased, it can achieve a lower Mean Squared Error (MSE) due to the trade-off between bias and variance. In this article, we will formally prove that, under the assumptions that a scalar (univariate) quality output follows a Gaussian distribution and the desired quality specification area is an interval, the estimator based on the Gaussian distribution is asymptotically superior to the Bernoulli-based estimator in terms of the MSE metric.

Besides yield estimation from the product quality measurements, utilization of those measurements to adjust controllable parameters of the relevant manufacturing tool and thus maximize the outgoing product yield is a highly impacting, complementary problem. Such sampling-driven optimization of yield is a classical problem of optimization under uncertainty and has been examined in numerous manufacturing sectors, including pharmaceutical manufacturing [16,17], the petrochemical industry [18,19], circuit design [10,20], and material design [21], to name a few.

Regardless of the application area, the success metric for any sampling-driven yield optimization approach is the ability to determine controllable parameter settings that maximize the outgoing yield, while using as few samples of the outgoing product quality as possible. One highly promising approach for addressing this classical problem of optimization under uncertainty is Bayesian Optimization (BO). In the field of BO, various forms of uncertainties and objective functions have been considered, as discussed in [22,23,24,25,26,27]. Similar problem formulations to the one considered in this article are proposed in [28,29], where the authors proposed a method based on a criterion they named probability threshold robustness measure, with the underlying assumption that the environmental factors are observable. However, in many situations, the environmental factors are not observable, which necessitates a different approach; this need will be addressed in this paper. More specifically, to maximize the probability that the quality characteristic falls within the desired specifications, we propose incorporating the novel yield estimator introduced in this paper into a BO framework. Unlike previous approaches modeling empirical probabilities directly such as [27], our method employs Gaussian Process Regression (GPR) to model the parameters of the quality distribution. To address noise effects caused by the unobservable and uncontrollable environmental factors, a repeated experimentation strategy will be employed, similar to how BO was performed in [25,27].

In summary, the contributions of this article are as follows:

The introduction of a yield estimator based on the estimates of the mean and variance parameters of the distribution characterizing the outgoing product quality which are then used to estimate the probability that the outgoing quality falls within a desired specification interval.
A mathematical proof of asymptotic superiority in terms of MSE of the aforementioned yield estimator compared to the traditionally utilized yield estimator based on estimating the Bernoulli distribution parameter under the assumption that the outgoing product quality variables follow an uncorrelated Gaussian distribution.
The introduction of a BO algorithm which uses the aforementioned mean and variance parameter-based estimator of yield to tune controllable process parameters in a way that maximizes the probability that the random variable characterizing the outgoing product quality falls within the desired specifications.

The newly proposed approaches to yield estimation and optimization were evaluated in simulations of a plasma chamber reactor, which is a highly complex system commonly used in semiconductor manufacturing. The aim was to use the novel methods to iteratively drive the controllable tool parameters towards a setting at which key plasma parameters could be kept within the desired specifications with maximum probability, given the inherent uncertainties present in the plasma chamber reactor.

The remainder of this article is structured as follows: In Section 2.1, we introduce a yield estimator based on estimating the mean and variance of the distribution of the outgoing product quality, and we introduce several theorems characterizing that estimator and proving its superiority over the Bernoulli-based yield estimator when the outgoing product quality characteristics follow an uncorrelated Gaussian distribution. In Section 2.2, we formalize the optimization problem for sampling-driven maximization of yield and propose two BO algorithms to solve it—one using the newly proposed yield estimator, and the other employing the traditional yield estimator. In Section 3.1, we comparatively evaluate the performance of the yield estimators. First, we examine problems where quality characteristics follow Gaussian distributions and empirically validate the theorems presented in Section 2.1. Second, we estimate the probability of plasma parameters falling within the specifications in the presence of random condition variations in a plasma reactor, with plasma behaviors being simulated using a commercial high-fidelity simulator. Similar to Section 3.1, in Section 3.2, we assess the effectiveness of the proposed BO-based yield maximization algorithms for scenarios where the outgoing quality characteristics follow Gaussian distributions, as well as for a scenario in which that is not the case. Finally, Section 4 concludes this article and offers suggestions for possible future work.

2. Materials and Methods

2.1. Estimation of Yields

A. Problem formulation

Let

y \in R

be a random variable representing a quality characteristic and let

S = [a, b]

denote the desired specification interval for

y

. The yield is defined as

p_{*} = P r (y \in [a, b])

, i.e., as the probability that the quality characteristic falls within the desired specification limits. Suppose

y

follows a Gaussian distribution with unknown mean

μ_{*}

and variance

σ_{*}^{2}

. In addition, let us observe

N

independent samples

{y_{1}, \dots, y_{N}}

of the outgoing quality, where

\begin{matrix} \begin{array}{r} y_{n} \overset{i . i . d}{\sim} N (μ_{*}, σ_{*}^{2}), n \in \{1,2, \dots, N\} . \end{array} \end{matrix}

(1)

The objective is to accurately estimate the probability

p_{*}

via some estimator

\hat{p}

from the samples

{y_{1}, \dots, y_{N}}

. The accuracy of the estimator will be assessed via its MSE, defined as

\begin{matrix} \begin{array}{r} E_{\{y_{1}, \dots, y_{N}\}} [{(p_{*} - \hat{p})}^{2}] . \end{array} \end{matrix}

(2)

Obviously, estimators with a smaller MSE are generally regarded as better estimators. Note that our formulation of yield differs from outage error probability discussed in estimation theory as in [30]. Outage error probability is the probability that the error of the estimate exceeds a certain threshold. In our formulation, yield itself is the estimation target expressed as a probability.

B. Yield estimator based on estimating Bernoulli distribution parameter

By focusing on whether a sample

y_{n}

satisfies the specifications or not, yield estimation can be formulated as the problem of estimating the parameter of Bernoulli distribution, which describes the yield as in [31]. The maximum likelihood is given by

\begin{matrix} \begin{array}{r} {\hat{p}}_{1} = \frac{1}{N} \sum_{n = 1}^{N} 1_{S} (y_{n}), \end{array} \end{matrix}

(3)

where

1_{S}

is the indicator function defined as

\begin{matrix} \begin{array}{r} 1_{S} (y) = \{\begin{array}{l} 1 & y \in S \\ 0 & y \notin S \end{array} . \end{array} \end{matrix}

(4)

The MSE of

{\hat{p}}_{1}

is easily derived as

\begin{matrix} \begin{array}{r} E [{({\hat{p}}_{1} - p_{*})}^{2}] = \frac{1}{N} p_{*} (1 - p_{*}) . \end{array} \end{matrix}

(5)

Note that

{\hat{p}}_{1}

is the best unbiased estimator in the sense that it achieves the minimum Mean Squared Error.

C. Yield estimator based on estimates of mean and variance parameters

Under the assumption that

y

follows a Gaussian distribution, we can estimate the parameters of that distribution and subsequently estimate the required probability

p_{*}

. Let us use the unbiased estimators for the mean and variance of Gaussian distributions, which are, respectively,

\begin{matrix} \hat{μ} & = \frac{1}{N} \sum_{n = 1}^{N} y_{n} \\ {\hat{σ}}^{2} & = \frac{1}{N - 1} \sum_{n = 1}^{N} {(y_{n} - \hat{μ})}^{2} . \end{matrix}

(6)

Using those estimators, we can construct the estimator of the probability

p_{*}

as follows:

\begin{matrix} {\hat{p}}_{2} & = P r (y \in [a, b] | y \sim N (\hat{μ}, {\hat{σ}}^{2})) \\ = \frac{1}{\sqrt{2 π {\hat{σ}}^{2}}} \int_{a}^{b} e x p (- \frac{1}{2 {\hat{σ}}^{2}} {(z - \hat{μ})}^{2}) d z . \end{matrix}

(7)

Here, let us define a function

F (a, b, μ, V)

as

\begin{matrix} \begin{array}{r} F (a, b, μ, V) = \frac{1}{2} (e r f (\frac{b - μ}{\sqrt{2 V}}) - e r f (\frac{a - μ}{\sqrt{2 V}})), \end{array} \end{matrix}

(8)

where

e r f (\cdot)

is the error function defined by

\begin{matrix} \begin{array}{r} e r f (z) = \frac{2}{\sqrt{π}} \int_{0}^{z} e^{- t^{2}} d t . \end{array} \end{matrix}

(9)

Then, the estimator

{\hat{p}}_{2}

can be expressed as

\begin{matrix} \begin{array}{r} {\hat{p}}_{2} = F (a, b, \hat{μ}, {\hat{σ}}^{2}) . \end{array} \end{matrix}

(10)

In order to calculate the MSE of

{\hat{p}}_{2}

, we use the fact that

\hat{μ}

and

{\hat{σ}}^{2}

independently follow

N (μ_{*}, \frac{σ_{*}^{2}}{N})

and

\frac{σ_{*}^{2}}{N - 1} χ_{N - 1}^{2}

, respectively (e.g., see Chapter 6 of [32]). Then, the MSE of the estimator

{\hat{p}}_{2}

is calculated as follows:

E [{({\hat{p}}_{2} - p_{*})}^{2}] = E_{Y, Z} [F {(a, b, Z, Y)}^{2}] - 2 p_{*} E_{Y, Z} [F (a, b, Z, Y)] + p_{*}^{2}

(11)

where distributions of

Y

and

Z

satisfy

\begin{matrix} \begin{array}{r} \frac{(N - 1)}{σ_{*}^{2}} Y \sim χ_{N - 1}^{2}; Z \sim N (μ_{*}, \frac{σ_{*}^{2}}{N}) . \end{array} \end{matrix}

(12)

Although the MSE of the estimates is mainly discussed in the main text, confidence intervals of the estimates are also important in applications. The derivation of the confidence intervals is explained in Appendix A.

D. Comparison of MSEs of two estimators

If

[a, b], μ_{*}, σ_{*}^{2}

and

N

are known, then the performance of the estimators becomes clear via calculation of (

5

) and (11). However, in reality,

μ_{*}

and

σ_{*}^{2}

are never known. Here, we formulate and prove the theorem that shows the following: when the number of samples

N

is sufficiently large, metric (11) is smaller than (

5

) for any

[a, b], μ_{*}

and

σ_{*}^{2}

.

First, let us apply the delta method (numerous statistics textbooks describe the delta method; e.g., one can refer to [33] or [34] for details of this method) to the MSE of the estimator

{\hat{p}}_{2}

in order to derive its asymptotic behavior in the following Lemma.

Lemma 1 (Asymptotic MSE of

{\hat{p}}_{2}

).

Suppose that we are given the specification interval

[a, b]

and

N

independent samples

y_{1}, \dots, y_{N}

, where

y_{n} \sim N (μ_{*}, σ_{*}^{2}), n \in \{1,2, \dots, N\}

. Let

p_{*}

denote the target probability that

y \in [a, b]

if

y \sim N (μ_{*}, σ_{*}^{2})

, i.e., let

\begin{matrix} p_{*} = P r (y \in [a, b] | y \sim N (μ_{*}, σ_{*}^{2})) . \end{matrix}

(13)

Then, for the Gaussian parameter-based estimator

{\hat{p}}_{2}

defined in (

7

), the corresponding MSE of

{\hat{p}}_{2}

can be expressed as follows:

\begin{matrix} \begin{array}{l} \begin{array}{l} E [{({\hat{p}}_{2} - p_{*})}^{2}] \\ = \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{σ_{*}^{2}} {(a - μ_{*})}^{2}} (2 σ_{*}^{2} + {(a - μ_{*})}^{2}) \\ + \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{σ_{*}^{2}} {(b - μ_{*})}^{2}} (2 σ_{*}^{2} + {(b - μ_{*})}^{2}) \\ - \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{2 σ_{*}^{2}} ({(a - μ_{*})}^{2} + {(b - μ_{*})}^{2})} (4 σ_{*}^{2} + 2 (a - μ_{*}) (b - μ_{*})) \\ + O (\frac{1}{N^{2}}) \end{array} \end{array} \end{matrix}

(14)

For details of the derivation, please refer to Appendix B.

Theorem 1 (comparison of asymptotic MSEs).

Under the same premises as in Lemma 1, we have

\begin{matrix} \begin{array}{r} \underset{N \to \infty}{l i m} \frac{E [{({\hat{p}}_{1} - p_{*})}^{2}] - E [{({\hat{p}}_{2} - p_{*})}^{2}]}{\frac{1}{N}} \geq 0 . \end{array} \end{matrix}

(15)

The strategy for the proof of this theorem is to observe the difference in the MSE portions associated with the factor

1 / N

as a function of parameters

a

and

b

and to compute extrema of this function. The details of the proof are enclosed in Appendix C.

The following corollary can be easily derived based on the inequality in (

15

) in Theorem 1.

Corollary of Theorem 1 (comparison of asymptotic MSEs).

Asymptotically, the MSE of

{\hat{p}}_{1}

is equal to or greater than that of

{\hat{p}}_{2}

.

It should be noted that in the process of proving Theorem 1, the following proposition is derived.

Proposition 1 (the maximum difference between MSEs of

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for a centered manufacturing process).

Consider the case where the true mean parameter corresponds to the center of the specification interval, i.e.,

[a, b] = [μ_{*} - w \sqrt{σ_{*}^{2}}, μ_{*} + w \sqrt{σ_{*}^{2}}]

. Under the same premises as Lemma 1 and Theorem 1, the difference in asymptotic MSEs between yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

reaches its maximum when

[a, b] = [μ_{*} - w_{*} \sqrt{σ_{*}^{2}}, μ_{*} + w_{*} \sqrt{σ_{*}^{2}}]

, where

w_{*}

is the solution of

\begin{matrix} \sqrt{2 π} (1 - 2 e r f (\frac{w_{*}}{\sqrt{2}})) - (2 w_{*} - 2 w_{*}^{3}) e^{\frac{- w_{*}^{2}}{2}} = 0 . \end{matrix}

(16)

That solution can be determined to be

w_{*} \approx 0.48

and, in that case, the maximum value for

\underset{N \to \infty}{l i m} \frac{E [{({\hat{p}}_{1} - p_{*})}^{2}] - E [{({\hat{p}}_{2} - p_{*})}^{2}]}{1 / N}

can be found to be

\underset{N \to \infty}{l i m} \frac{E [{({\hat{p}}_{1} - p_{*})}^{2}] - E [{({\hat{p}}_{2} - p_{*})}^{2}]}{1 / N} ≒ 0.17 .

E. Yield estimator based on parameters of uncorrelated multivariate Gaussian distribution

Theorem 1 can be easily extended to the case of multivariate quality characterization, with quality variable vectors following an uncorrelated multivariate Gaussian distribution. Suppose that a vector of quality characteristics

y \in R^{D}

follows an uncorrelated multivariate Gaussian distribution as below,

\begin{matrix} \begin{array}{r} y \sim N (μ_{*}, diag (σ_{*, 1}^{2}, σ_{*, 2}^{2}, σ_{*, D}^{2})) . \end{array} \end{matrix}

(17)

The yield estimator based on the Bernoulli distribution parameter is the same as (

3

), though the specification area

S

is now defined as a set product over each dimension of the specification area, i.e.,

\begin{matrix} \begin{array}{r} S = [a_{1}, b_{1}] \times [a_{2}, b_{2}] \times \dots \times [a_{D}, b_{D}] . \end{array} \end{matrix}

(18)

On the other hand, the yield estimator based on the estimates of the mean and variance components of the quality variable vector

y

can be expressed as

\begin{matrix} \begin{array}{r} {\hat{p}}_{2} = \Pr (y \in S | y \sim N (\hat{μ}, diag ({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{D}^{2}))), \end{array} \end{matrix}

(19)

where

\begin{matrix} \begin{matrix} {\hat{μ}}_{d} & = \frac{1}{N} \sum_{n = 1}^{N} y_{n, d} \\ {\hat{σ}}_{d}^{2} & = \frac{1}{N - 1} \sum_{n = 1}^{N} {(y_{n, d} - {\hat{μ}}_{d})}^{2} . \end{matrix} \end{matrix}

(20)

Extension of Theorem 1 for this case is given below.

Theorem 2 (comparison of asymptotic MSEs for yield estimators based on samples of quality characteristics drawn from a multivariate uncorrelated Gaussian distribution).

Let the desired specification area be

S = [a_{1}, b_{1}] \times [a_{2}, b_{2}] \times \dots \times [a_{D}, b_{D}]

and let us observe

N

independent vectorial samples

{y_{1}, \dots, y_{N}}

drawn from the multivariate uncorrelated Gaussian distribution

N (μ_{*}, diag (σ_{*, 1}^{2}, σ_{*, 2}^{2}, σ_{*, D}^{2}))

. Let

p_{*}

denote the yield we are trying to estimate, i.e.,

\begin{matrix} \begin{array}{r} p_{*} = P r (y \in S | y \sim N (μ_{*}, diag (σ_{*, 1}^{2}, σ_{*, 2}^{2}, σ_{*, D}^{2}))) \end{array} . \end{matrix}

(21)

Then, the Bernoulli parameter-based yield estimator

{\hat{p}}_{1}

defined by (

3

) and the Gaussian parameter-based estimator

{\hat{p}}_{2}

defined by (

19

) satisfy the following:

\underset{N \to \infty}{l i m} \frac{E [{({\hat{p}}_{1} - p_{*})}^{2}] - E [{({\hat{p}}_{2} - p_{*})}^{2}]}{1 / N} \geq 0

The approach to the proof is to independently consider

D

estimators for each dimension of the quality variable vector and to utilize the result of Theorem 1 for each of those dimensions.

F. Discussion

The proposed methodology for estimating yield based on estimating the parameters of the underlying distribution from the available samples is applicable to any distribution form, provided that a parametric distribution can be assumed and that the probability over the specification region can be computed. Nevertheless, formally proving the benefits of exploiting the distribution form and estimating its parameters may be challenging for correlated multivariate Gaussian distributions or for non-Gaussian distributions. In particular, when the distribution is heavy-tailed, estimating its parameters is difficult, and parameter estimation accuracy strongly affects yield estimation. A future challenge is to improve and provide guarantees for yield estimation by leveraging theoretical guarantees on parameter estimation (e.g., [35]).

In this article, we indeed formally only analyze the case when the underlying distribution of the outgoing quality is of a multivariate uncorrelated Gaussian form. However, there are application areas, such as semiconductor manufacturing processes, where the random behavior of outgoing quality could significantly deviate from a Gaussian distribution form. In these cases, the superiority or inferiority of

{\hat{p}}_{1}

and

{\hat{p}}_{2}

depends on the sample size

N

, the actual underlying distribution form, and the specification interval.

Generally,

{\hat{p}}_{2}

tends to be superior to

{\hat{p}}_{1}

for small sample sizes

N

. This is because the available empirical clues for estimating yield probability from the Bernoulli distribution describing whether samples are falling into the specification area or not are insufficient when the sample size is small. In such cases, although not entirely accurate, assumption of Gaussianity helps

{\hat{p}}_{2}

compensate for the scarcity of samples. On the other hand, when

N

is large,

{\hat{p}}_{1}

will be superior to

{\hat{p}}_{2}

, in general, because

{\hat{p}}_{1}

is a consistent estimator [36], while

{\hat{p}}_{2}

is not. To select the preferred estimator, the sample size

N

for which the superiority of

{\hat{p}}_{2}

to

{\hat{p}}_{1}

flips is an important threshold. Unfortunately, derivation of that threshold is practically impossible without knowledge of the true distribution. Nevertheless, based on Theorem 1, it is reasonable to claim that the more closely the distribution of the outgoing quality resembles a Gaussian distribution, the more pronouncedly

{\hat{p}}_{2}

is superior to

{\hat{p}}_{1}

for larger

N

. Note that if Gaussianity is strongly violated and the true distribution is multimodal, the relative performance of

{\hat{p}}_{2}

and

{\hat{p}}_{1}

is expected to be reversed for very small sample sizes (e.g.,

N

= 2).

From a practical viewpoint, selecting a subset of important dimensions to measure—rather than reducing the sample size

N

—can reduce measurement cost while maintaining estimation accuracy. Simple heuristics based on collinearity or formal subset selection frameworks [37] can be used. Combining these selection methods with the newly proposed yield estimator may improve the trade-off between measurement cost and estimation accuracy, particularly when the selected features are approximately regarded as uncorrelated.

2.2. Bayesian Optimization for Maximizing Yields

A classical problem in manufacturing is determining the settings of controllable parameters of the tool which maximize the outgoing yield. In this section, we propose a yield optimization method which incorporates the yield estimators described in the previous section into a Bayesian Optimization (BO) framework.

A. Problem formulation

We consider the optimization problem

\begin{matrix} \begin{array}{r} \max_{x \in X} P r (y \in S | y \sim D (x)), \end{array} \end{matrix}

(22)

where

x

is a vector of control parameters,

D (x)

denotes the control parameter-dependent distribution of the vector of quality characteristics

y

,

X

is the domain of control parameters and

S

is the desired multivariate specification area.

For a given candidate solution

x_{i}

of the vector of control parameters, let us assume that we can observe

N

independent samples

y_{1}, y_{2}, \dots y_{N}

, of the vector of quality characteristics, where

y_{n} \sim D (x_{i})

. The objective is to iteratively move towards a solution

\hat{x}

such that

P r (y \in S | y \sim D (\hat{x}))

becomes as high as possible and to do that in as few iterations as possible (i.e., using as few experiments as possible). Algorithm 1 shown below summarizes the proposed optimization procedure. In each iteration, it encompasses two major functionalities: (i) suggestion of the next iteration’s control parameter (step 3 in Algorithm 1), and (ii) estimation of the control parameter settings which maximize the outgoing yield based on observations available at that iteration (step 6 in Algorithm 1). Details of the implementation of this procedure will be discussed below.

Algorithm 1 Bayesian procedure for solving optimization problem

Input: An algorithm

A

initial observations

H_{init}

,
the number of observations per iteration

N

,
and maximum iteration

I_{\max}

.

Output: Estimated best control parameters for each iteration

\hat{X} = ({\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{I_{\max}})

1:

H \leftarrow H_{init}

2: for

i = 1

to

I_{\max}

do
3: Algorithm

A (H)

suggests the next control parameter

x_{i}

.
4: Observe the

N

outputs for

x_{i}

,

(x_{i}, (y_{i}^{(1)}, y_{i}^{(2)}, \dots, y_{i}^{(N)}))

.
5: Update the history of observations

H \leftarrow H ⋃ (x_{i}, (y_{i}^{(1)}, y_{i}^{(2)}, \dots, y_{i}^{(N)}))

6: Algorithm

A (H)

suggests the best control parameter,

{\hat{x}}_{i}

.
7:

\hat{X} \leftarrow \hat{X} ⋃ {\hat{x}}_{i}

.
8: end for

B. Bayesian Optimization using GPR modeling empirical yields

In this subsection, we describe the BO algorithm for solving (

22

) which relies on the GPR estimator of the yield, with yields being modeled from experiments using the empirical estimator of the Bernoulli distribution parameter, as described in Section 2.1-B. Let us define

f_{*} (x) = P r (y \in S | y \sim D (x))

. Suppose that we have an available record of iterations of control parameter vectors

x

and the corresponding observations of the outgoing quality vectors

y

in the form

\begin{matrix} \begin{matrix} H & = \{(x_{1}, (y_{1}^{(1)}, y_{1}^{(2)}, \dots y_{1}^{(N_{1})})) \dots \\ (x_{i}, (y_{i}^{(1)}, y_{i}^{(2)}, \dots y_{i}^{(N_{i})})) \dots \\ (x_{I}, (y_{I}^{(1)}, y_{I}^{(2)}, \dots y_{I}^{(N_{I})}))\} . \end{matrix} \end{matrix}

(23)

For each setting of control parameters

x_{i}

, we can use quality observations

y_{i}^{(k)}, k = 1, 2, \dots, N_{i}

to empirically estimate the yield under those control settings using the simple formulation

\begin{matrix} \begin{array}{r} \overline{f} (x_{i}) = \frac{1}{N_{i}} \sum_{n = 1}^{N_{i}} 1_{S} (y_{i}^{(n)}) . \end{array} \end{matrix}

(24)

Then, one can model the yield as it depends on the control parameter vector

x

by training a GPR to model

f_{*} (x)

using control vector–yield estimate pairs

(x_{1}, \overline{f} (x_{1})), (x_{2}, \overline{f} (x_{2})), \dots, (x_{I}, \overline{f} (x_{I}))

obtained from the available data

H

. Let

{GPR}_{f} (x | f)

denote the GPR function modeling the expected value and variance of the yield

f (x)

, given observations

f = {[f (x_{1}), f (x_{2}), \dots f (x_{I})]}^{⊤}

of the yield at

x_{i}, i = 1,2, \dots, I

. Let us denote

\overline{f} = {[\overline{f} (x_{1}), \overline{f} (x_{2}), \dots, \overline{f} (x_{I})]}^{⊤}

. From the set of control parameters

X_{I} = {x_{1}, x_{2}, \dots, x_{I}}

, let us select the vector

\hat{x}

which maximizes the posterior expectation of the yield, as modeled by

{GPR}_{f} (x | \overline{f}),

i.e.,

\begin{matrix} \hat{x} = \underset{X_{I}}{argmax} E_{f} [f | f \sim {GPR}_{f} (x | \overline{f})], \end{matrix}

(25)

and let

{\hat{f}}_{*}

be the expected yield, as modeled by

{GPR}_{f} (\hat{x} | \overline{f})

, i.e.,

{\hat{f}}_{*} = E_{f} [f | f \sim {GPR}_{f} (\hat{x} | \overline{f})] .

(26)

Now, as in any BO procedure, the vector of control parameters

x

in the next iteration needs to be pursued via a trade-off between exploration and exploitation enabled by that iteration [38]. Following the approach of [16,21], one can use Expected Improvement (EI) as the acquisition function and obtain the next candidate vector of controllable parameters

x_{I + 1}

as

\begin{matrix} \begin{array}{r} \begin{array}{r} x_{I + 1} & = \underset{x \in X}{argmax} EI (x) \end{array}, \end{array} \end{matrix}

(27)

where

\begin{matrix} E I (x) = E_{f} [m a x (f - {\hat{f}}_{*}, 0) | f \sim {GPR}_{f} (x | \overline{f})] \end{matrix}

(28)

is the acquisition function which, for a given vector of control parameters

x

, expresses the expected improvement in the yield relative to the current best expected yield

{\hat{f}}_{*}

.

Nevertheless, the EI acquisition function (

28

) does not account for the uncertainties of the observations

\overline{f} .

In order to address this problem, following [39], in this paper we employ the so-called Noisy Expected Improvement (NEI) acquisition function which incorporates the uncertainty of observations

\overline{f}

into the selection of the next candidate solution

x_{I + 1}

, thus achieving general superiority over the EI acquisition function-based iterations described by (26). For completeness, NEI acquisition function-based selection of the next iteration for the candidate solution

x_{I + 1}

is given below.

Let us introduce EI acquisition function

α_{EI} (x | f)

for any set of observations

f

of yield for control parameter vectors

x_{1}, x_{2}, \dots, x_{I}

, i.e.,

\begin{matrix} α_{EI} (x | f) = E_{f} [m a x (f - {\hat{f}}_{*}, 0) | f \sim {GPR}_{f} (x | f)] . \end{matrix}

(29)

Obviously,

EI (x) = α_{EI} (x | \overline{f}) .

Let us note that as modeled by

{GPR}_{f} (x | \overline{f})

, the vector of yields for control vector

x_{1}, x_{2} \dots, x_{I}

can be modeled as a random vector

{f_{GPR} = [f_{GPR} (x_{1}) f_{GPR} (x_{2}) \dots f_{GPR} (x_{I})]}^{T}

which follows a vectorial normal distribution:

\begin{matrix} [\begin{matrix} f_{GPR} (x_{1}) \\ ⋮ \\ f_{GPR} (x_{I}) \end{matrix}] & \sim N (m^{f_{GPR}}, S^{f_{GPR}}) \\ m_{i}^{f_{GPR}} & = E [{GPR}_{f} (x_{i} | \overline{f}))], i = 1,2, \dots, I \\ S_{i j}^{f_{GPR}} & = C o v [{GPR}_{f} (x_{i} | \overline{f}), {GPR}_{f} (x_{j} | \overline{f}))], i, j = 1,2, \dots, I \end{matrix}

(30)

Then, the NEI acquisition function can be defined as

\begin{matrix} N E I (x) = E_{f_{GPR}} [α_{EI} (x | f_{GPR})] . \end{matrix}

(31)

Following [39], Monte Carlo sampling needs to be utilized in order to evaluate the expectation over

f_{GPR}

in the

NEI (x)

acquisition function described by (

31

). Finally, selection of the next candidate vector of controllable parameters is accomplished by

\begin{matrix} \begin{array}{r} x_{I + 1} & = \underset{x \in X}{argmax} NEI (x) . \end{array} \end{matrix}

(32)

C. Bayesian Optimization using GPR modeling parameters of the distribution of quality characteristics

Let us consider the situation where for a given setting of control parameters

x

, the outgoing quality characteristics

y

follow a normal distribution (For simplicity, in this section, we consider the case in which quality output

y

is a scalar. Please note that following the discussions in Section 2.1-E, one can straightforwardly extend the approach described in this section to the multivariate case when the vector of quality characteristics

y

follows an uncorrelated multivariate Gaussian distribution.)

N (μ_{*} (x), σ_{*}^{2} (x))

. In that case, a control parameter vector

x

determines the mean and variance of the distribution

D (x)

, and the yield for that control parameter setting can be written as

\begin{matrix} \begin{array}{r} P r (y \in S | y \sim N (μ_{*} (x), σ_{*}^{2} (x))) . \end{array} \end{matrix}

(33)

Based on the results from Section 2.1, which show that the yield estimator relying on the estimation of Gaussian distribution parameters outperforms that relying on the empirical estimation of the Bernoulli distribution parameter, let us apply the GPR framework to model the mean and variance parameters of the distribution of the outgoing product quality with the available data

H

, i.e.,

\begin{matrix} \begin{matrix} \overline{μ} (x_{i}) & = \frac{1}{N_{i}} \sum_{n = 1}^{N_{i}} y_{i}^{(n)} \\ {\overline{σ}}^{2} (x_{i}) & = \frac{1}{N_{i} - 1} \sum_{n = 1}^{N_{i}} {(y_{i}^{(n)} - \overline{μ} (x_{i}))}^{2} . \end{matrix} \end{matrix}

(34)

Following [40], variances

{\overline{σ}}^{2} (x_{i})

are transformed using the natural

l o g

function in order to ensure that the range for the GPR model estimating those transformed variances comprises the entire set of real numbers. Hence, GPRs modeling the mean and variance parameters will be trained on the pairs of control vectors and estimated distribution parameters as below,

\begin{matrix} (x_{1}, (\overline{μ} (x_{1}), \log [{\overline{σ}}^{2} (x_{1})]), (x_{2}, (\overline{μ} (x_{2}), \log [{\overline{σ}}^{2} (x_{2})])), \dots, \\ (x_{I}, (\overline{μ} (x_{I}), \log [{\overline{σ}}^{2} (x_{I})]))) . \end{matrix}

(35)

The variance of the estimator

\overline{μ} (x_{i})

can be expressed and approximated as

\begin{matrix} \begin{matrix} \begin{matrix} VAR [\overline{μ} (x_{i})] & = VAR [\frac{1}{N_{i}} \sum_{n = 1}^{N_{i}} y_{i}^{(n)}] \\ = \frac{1}{N_{i}} σ_{*}^{2} (x_{i}) ≒ \frac{1}{N_{i}} {\overline{σ}}^{2} (x_{i}), \end{matrix} \end{matrix} \end{matrix}

(36)

while the variance of the estimator

l o g [{\overline{σ}}^{2} (x_{i})]

can be approximated using the delta method as follows [34]:

\begin{matrix} \begin{array}{r} VAR [l o g [{\overline{σ}}^{2} (x_{i})]] ≒ \frac{V A R [{\overline{σ}}^{2} (x_{i})]}{{(E [{\overline{σ}}^{2} (x_{i})])}^{2}} = \frac{2}{N_{i} - 1} . \end{array} \end{matrix}

(37)

Let

{GPR}_{μ} (x | μ)

denote the GPR function modeling the mean parameter

μ (x)

, given the observations of the estimated parameters

μ = [μ (x_{1}), μ (x_{2}), \dots, μ (x_{I})]

. Similarly, let

{GPR}_{l o g [σ^{2}]} (x | θ)

be the GPR function modeling the logarithm of the true variance parameter

θ (x) = l o g [σ^{2} (x)]

, given the observations of the estimated parameters

θ = {[l o g [σ^{2} (x_{1})] l o g [σ^{2} (x_{2})] \dots l o g [σ^{2} (x_{I})]]}^{⊤}

.

Then, let us denote with

F (x | μ, θ)

the distribution of the estimated yield, which can be expressed as

\begin{array}{r} \begin{matrix} F (x | μ, θ) & = P r (Y \in S | Y \sim N (\tilde{μ}, {\tilde{σ}}^{2})) \\ \tilde{μ} & \sim {GPR}_{μ} (x | μ) \\ l o g [{\tilde{σ}}^{2}] & \sim {GPR}_{l o g [σ^{2}]} (x | θ)), \end{matrix} \end{array}

(38)

and let

\overline{μ} = {[\overline{μ} (x_{1}), \overline{μ} (x_{2}), \dots, \overline{μ} (x_{I})]}^{⊤}

and

\overline{θ} = {[l o g [{\overline{σ}}^{2} (x_{1})], l o g [{\overline{σ}}^{2} (x_{2})], \dots, l o g [{\overline{σ}}^{2} (x_{I})]]}^{⊤}

. Now, from the set of control parameters

X_{I} = {x_{1}, x_{2}, \dots, x_{I}}

, let us select vector

\hat{x}

which maximizes the posterior expectation of the yield modeled by

F (x | \overline{μ}, \overline{θ})

, i.e.,

\begin{matrix} \begin{array}{r} \hat{x} = \underset{{x \in X}_{I}}{argmax} E_{f} [f | f \sim F (x | \overline{μ}, \overline{θ})], \end{array} \end{matrix}

(39)

and let

{\hat{f}}_{* *}

be the expected value as per model

F (\hat{x} | \overline{μ}, \overline{θ}),

i.e.,

\begin{matrix} \begin{array}{r} {\hat{f}}_{* *} = E_{f} [f | f \sim F (\hat{x} | \overline{μ}, \overline{θ})] . \end{array} \end{matrix}

(40)

The next iteration

x_{I + 1}

of the candidate solution for the vector of control parameters is determined as

\begin{matrix} \begin{array}{r} x_{I + 1} & = \underset{x \in X}{argmax} {EI}_{F} (x), \end{array} \end{matrix}

(41)

where

\begin{matrix} \begin{array}{r} {EI}_{F} (x) & = α_{{EI}_{F}} (x | \overline{μ}, \overline{θ}) \end{array} \end{matrix}

(42)

is the acquisition function which, for a given vector of control parameters

x,

expresses the expected improvement in the yield relative to the inferred best expected yield

{\hat{f}}_{* *}

.

A major challenge associated with the calculation of (

39

) and (

41

) is that

F (x | \overline{μ}, \overline{θ})

is not an analytically tractable Gaussian distribution. Therefore, following [40], Monte Carlo sampling from the distribution

F (x | \overline{μ}, \overline{θ})

is proposed for evaluating the expected value of the yield in (

39

) and the acquisition function

{EI}_{F} (x)

for any vector

x

of the control parameters.

Let us now introduce

α_{{EI}_{F}} (x | μ, θ)

for any set of observed estimated parameters

μ

and

θ

for control parameter vectors

x_{1}, x_{2}, \dots, x_{I}

, i.e.,

\begin{matrix} \begin{array}{r} α_{{EI}_{F}} (x | μ, θ) & = E_{f} [m a x (f - {\hat{f}}_{* *}, 0) | f \sim F (x | μ, θ)] \end{array} . \end{matrix}

(43)

Obviously,

{EI}_{F} (x) = α_{{EI}_{F}} (x | \overline{μ}, \overline{θ})

. It can be noted that the mean parameters for control vector

x_{1}, x_{2}, \dots, x_{I}

can be modeled as random vector

{[μ_{GPR} (x_{1}), μ_{GPR} (x_{2}), \dots, μ_{GPR} (x_{I})]}^{T}

as below:

\begin{matrix} \begin{matrix} \begin{matrix} [\begin{matrix} μ_{GPR} (x_{1}) \\ ⋮ \\ μ_{GPR} (x_{I}) \end{matrix}] & \sim N (m^{μ_{GPR}}, S^{μ_{GPR}}) \\ m_{i}^{μ_{GPR}} & = E [{GPR}_{μ} (x_{i} | \overline{μ}))], i = 1,2, \dots, I \\ S_{i j}^{μ_{GPR}} & = C o v [{GPR}_{μ} (x_{i} | \overline{μ})), {GPR}_{μ} (x_{j} | \overline{μ}))] \end{matrix} \end{matrix} \end{matrix}

\begin{matrix} i, j = 1,2, \dots, I \end{matrix}

(44)

Similarly, the logarithm of variance parameters for control vector

x_{1}, x_{2}, \dots, x_{I}

can be modeled as random vector

{[θ_{GPR} (x_{1}), θ_{GPR} (x_{2}), \dots θ_{GPR} (x_{I})]}^{T}

following the distribution

\begin{matrix} \begin{matrix} \begin{matrix} [\begin{matrix} θ_{GPR} (x_{1}) \\ ⋮ \\ θ_{GPR} (x_{I}) \end{matrix}] & \sim N (m^{θ_{G P R}}, S^{θ_{G P R}}) \\ m_{i}^{θ_{G P R}} & = E [{GPR}_{l o g [σ^{2}]} (x_{i} | \overline{θ}))], i = 1,2, \dots, I \\ S_{i j}^{θ_{G P R}} & = Cov [{GPR}_{l o g [σ^{2}]} (x_{i} | \overline{θ}), {GPR}_{l o g [σ^{2}]} (x_{j} | \overline{θ}))], \end{matrix} \end{matrix} \end{matrix}

\begin{matrix} i, j = 1,2, \dots, I . \end{matrix}

(45)

If we denote

μ_{GPR} = {[μ_{GPR} (x_{1}), μ_{GPR} (x_{2}), \dots, μ_{GPR} (x_{I})]}^{⊤}

and

θ_{GPR} = {[θ_{GPR} (x_{1}), θ_{GPR} (x_{2}), \dots, θ_{GPR} (x_{I})]}^{⊤}

, then the NEI acquisition function is defined as

\begin{matrix} \begin{array}{r} {NEI}_{F} (x) & = E_{μ_{GPR}, θ_{GPR}} [α_{{EI}_{F}} (x | μ_{GPR}, θ_{GPR})] \end{array} . \end{matrix}

(46)

Finally, the selection of

x_{I + 1}

is accomplished as

\begin{matrix} \begin{array}{r} x_{I + 1} & = \underset{x \in X}{argmax} {NEI}_{F} (x) . \end{array} \end{matrix}

(47)

In order to calculate

{NEI}_{F} (x)

, Monte Carlo sampling is required for evaluating the expectation of

α_{{EI}_{F}} (x | μ_{GPR}, θ_{GPR})

over

μ_{GPR}

and

θ_{GPR}

, which means that the calculation cost for NEI becomes higher than that for EI. Nevertheless, the calculation of Monte Carlo sampling is apparently parallelizable in nature. Therefore, multiple cores are able to compensate for the burden of increase in calculations.

D. Discussion

In this section, we discuss the suitability of the BO algorithms described in Section 2.2-B,C. For problems where control parameters determine means and variances of Gaussian distributions, we deduce from Theorem 1 that applying GPRs to the parameters of Gaussian distributions is better than applying the GPR concept to the empirical, Bernoulli distribution-based estimators of yield. It is difficult to discuss the suitability and performance of the proposed approaches for the cases where the distributions of the outgoing quality are not Gaussian. Nevertheless, drawing an analogy with the discussion in Section 2.1-F, when the sample size

N

and the number of iterations is large, applying GPR to empirical probabilities is preferable due to the consistency of the corresponding estimator. On the other hand, when the sample size

N

and the number of iterations are small, applying GPRs to the parameters of the Gaussian distributions may be preferable, as the assumption of the Gaussian form of the outgoing quality characteristics compensates for the scarcity of samples.

Furthermore, when the specification area is narrow, applying GPRs to parameters of Gaussian distributions is expected to outperform applying the GPR concept to Bernoulli distribution-based estimators of yield, especially when the number of iterations is small. Essentially, if quality characteristics corresponding to a certain set of control parameters fall near the boundaries of the specification areas, these control parameters should be regarded as somewhat promising, even if the empirical probability that the quality characteristics satisfy the specification is zero. Applying GPRs to distribution parameters incorporates this intuitive insight. On the other hand, when GPR is applied to empirical probabilities, the algorithm cannot exploit the proximity of the quality characteristics to the specification area.

Finally, we comment on computational cost. Applying GPRs to the parameters of Gaussian distributions requires additional computation because Monte Carlo sampling is needed (see Section 2.2-C). The cost of these Monte Carlo calculations grows roughly with the number of samples and with the dimensionality of the characteristic space; so, high-dimensional problems or large sample sizes can become computationally demanding. When a single workstation is insufficient, the embarrassingly parallel nature of Monte Carlo sampling allows for straightforward distribution of independent samples (or independent control parameter evaluations) across multiple workstations.

3. Results

3.1. Experiments with Various Methods for Estimating Yields

In this section, we evaluate the performance of the yield estimators discussed in Section 2.1. In Section 3.1-A, we analyze the performance of yield estimators when the distribution of the outgoing quality is exactly Gaussian, while in Section 3.1-B, we consider the distribution of the outgoing quality not exactly following the Gaussian distribution form.

A. Estimation of yield when quality characteristic follows Gaussian distribution

Let us assume that a quality characteristic follows the normal distribution

N (0, 1)

and that the desired interval for that characteristic is

[- w^{*}, + w^{*}]

, where

w^{*}

is the solution of

\begin{matrix} \sqrt{2 π} (1 - 2 e r f (\frac{w_{*}}{\sqrt{2}})) - (2 w_{*} - 2 w_{*}^{3}) e^{\frac{- w_{*}^{2}}{2}} = 0, \end{matrix}

(48)

as described in Proposition 1. This situation is illustrated in Figure 1. MSEs for the classical yield estimator based on Bernoulli distribution

{\hat{p}}_{1}

and the newly proposed yield estimator

{\hat{p}}_{2}

for different sample sizes

N

are shown in Figure 2 and Table 1. We see that the theoretical expressions for relevant MSEs, namely (

5

) and (11), coincide with the empirical results.

The advantage of the newly proposed yield estimator is statistically supported by one-sided hypothesis testing using the bootstrap method, as shown in Table 2, for both

N = 2

and

N = 128

. In both cases, sufficiently small p-values are obtained. In addition, the portion of MSE related to the

1 / N

factor, as characterized by (

14

), predicts the MSE of

{\hat{p}}_{2}

when

N

is large, and the MSE of

{\hat{p}}_{2}

is smaller than that of

{\hat{p}}_{1}

, as stipulated by Theorem 1.

B. Estimation of Yield Expressed as Probabilities of Electron Densities and Temperatures Falling Within Specifications

In this subsection, we evaluate the performance of the classical yield estimator

{\hat{p}}_{1}

and the newly proposed yield estimator

{\hat{p}}_{2}

when the distribution of the outgoing process quality is not exactly Gaussian. This study will be performed through examination of the problem of ensuring plasma stability in an Inductively Coupled Plasma (ICP) chamber, which is a system widely utilized in semiconductor manufacturing.

ICP chambers use coils to generate a strong electric field within a vacuum chamber and thus generate plasmas. In the simulations considered in this paper, Ar and

O_{2}

gases are ionized to create plasmas, where Ar ions mainly provide mechanical energy for sputtering, while

O_{2}

ions enhance chemical reactions. Maintaining key plasma parameters within desired specification limits is crucial for ascertaining the desired quality of outgoing wafers, as those parameters have significant effects on the surface modifications of wafers. Specifically, following [41], this article focuses on the electron density and temperature among plasma parameters.

(1) Configuration of simulations:

COMSOL software package (version: 6.2) is utilized to perform simulations, with the model details available in [42]. A schematic sketch of the chamber considered in this paper is shown in Figure 3. It is assumed that one is able to set powers of the four coils to distinct values and thus generate an electric field within the chamber. From the gas inlet, Ar and

O_{2}

gases flow into the chamber and leave the chamber via the outlet. Random variations in the Ar/

O_{2}

concentration ratios occur due to the effects of residual gases inside the chamber, while random fluctuations of pressures attributes arise from uncertainties in the control actions of pressure controllers. Given inputs, COMSOL simulations provide the stationary spatial fields of electron densities and temperatures across the chamber, based on which measurements of electron densities and temperatures are assumed to be obtained via a plasma absorption probe installed at a specific location in the chamber. A plasma absorption probe is a device capable of measuring local absolute electron density [43] and can be used to estimate electron temperature [44]. The radius of the dielectric tube used in such probes is typically a few millimeters [43,45]. Therefore, for this simulation, probe measurements were modeled as the spatial average of the plasma parameters over an area of 1 cm × 1 cm, with the aim of keeping these measurements within some a priori specified intervals. The objective of the numerical experiments in this section is to accurately estimate the probability that the measurements of electron densities and temperatures will fall within the desired specification intervals.

(2) Probability of electron density falling within specification limits:

In this subsection, we compare the performance of the classical yield estimator

{\hat{p}}_{1}

and the newly proposed yield estimator

{\hat{p}}_{2}

in terms of their performance in estimating the probability that the sensed electron density falls within the desired specification interval. Stochastic variations in chamber conditions are simulated by randomly disturbing inputs of the simulator, and an empirical distribution of electron densities is acquired by running simulation replications with these disturbed inputs. The settings used in the simulations are summarized in Table 3. In order to ensure that the parameters stay within the valid range, we clip the values of xO2 to fall within

[0, 1]

and ensure that the values of p0 are positive after sampling from the Gaussian distribution. A total of 1000 simulation replications were run and, from the valid simulation runs, the spatial average of electron density was calculated for the area assumed to be occupied by the probe.

Figure 4 shows the histogram of electron densities obtained using the above-described simulations, with the desired specification interval superimposed over the true distribution. While the distribution illustrated in Figure 4 appears to be similar to a Gaussian distribution in terms of its mono-modality, the Shapiro–Wilk test [46] implies that it does not follow a Gaussian form (

p = 2.64 \times 10^{- 12}

). This is consistent with the fact that, in semiconductor manufacturing, nonlinear effects and substantial noise often lead to non-Gaussian distributions of quality characteristics. Nevertheless, we will still attempt to estimate the probability that electron density falls within the desired specification interval by estimating Gaussian distribution parameters from

N

independent, identically distributed samples, as in (

7

).

In Figure 5 and Table 4, the MSEs of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for the above-mentioned probability are shown. As discussed in Section 2.1-F, the MSE of the proposed estimator

{\hat{p}}_{2}

is smaller than that of the classical estimator

{\hat{p}}_{1}

until a certain threshold value of the sample size

N

is reached, beyond which the superiority flips and the MSE of

{\hat{p}}_{1}

becomes smaller. This finding is supported by one-sided hypothesis tests using the bootstrap method for

N = 2

and

N = 128,

as in Table 5. For

N = 2

, the test indicates that the empirical squared errors of

{\hat{p}}_{2}

are significantly smaller than those of

{\hat{p}}_{1}

. Conversely, for

N = 128

, the test indicates that the empirical square errors of

{\hat{p}}_{1}

are significantly smaller than those of

{\hat{p}}_{2}

.

(3) Probability of electron densities and temperatures falling within specifications:

This subsection evaluates the performance of the classical estimator

{\hat{p}}_{1}

and the newly proposed estimator

{\hat{p}}_{2}

in the simulations when the outgoing quality characteristics are multivariate.

Suppose that the desired specification intervals for electron density and temperature are given. As mentioned before, both electron density and temperature are measured as averages over the areas covered by the probes schematically shown in Figure 3. As in the previous subsection, we simulate the variability in chamber conditions by randomly disturbing inputs of the simulator, which yields the true distribution of output variables. Figure 6 depicts the true joint distribution of electron densities and temperatures, along with the desired specification area. Simple visual inspection indicates that, though mono-modal, the true distribution does not follow a Gaussian form and is also correlated, which does not match the assumptions from Theorem 2 in Section 2.1-E.

A comparison of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

is shown in Figure 7 and Table 6. Similar to what we observed in the scalar case, the MSE of the proposed yield estimator

{\hat{p}}_{2}

is smaller than that of the classical yield estimator

{\hat{p}}_{1}

until a certain threshold sample size

N

is reached, after which the performance of

{\hat{p}}_{1}

and

{\hat{p}}_{2}

flips, as discussed in Section 2.1-F. As in the scalar case, the hypothesis test using the bootstrap method indicates that the proposed estimator

{\hat{p}}_{2}

is significantly preferable for

N = 2

, whereas the classical yield estimator

{\hat{p}}_{1}

is significantly preferable for

N = 128

, as shown in Table 7.

3.2. Numerical Experiments for Maximizing Yield

In this section, we compare the performance of BO algorithms for maximizing the yield of the outgoing product quality. Specifically, we focus on the difference between applying the GPR paradigm to directly model the empirical yields, as described in Section 2.2-B, versus applying GPRs to model parameters of the Gaussian distributions of the outgoing quality, as described in Section 2.2-C.

In Section 3.2-A, we compare the performance of the yield estimation and optimization algorithms for a synthetic problem in which the outgoing quality follows a Gaussian distribution. In Section 3.2-B, we consider yield estimation and optimization in ICP chamber simulations. Finally, in Section 3.2-C, we use ICP chamber simulations to perform a set of sensitivity studies regarding the preferability of different estimation and optimization algorithms for that specific problem.

A. Optimization of control parameters which determine parameters of Gaussian distribution

In this subsection, we consider a synthetic problem of determining the control input for a process which will result in a maximal outgoing yield under the assumption that the outgoing quality follows a Gaussian distribution. More precisely, the problem is defined as follows. The domain of the control input parameter is

X = [- 3, + 3]

and is discretized into

1000

points. The distribution of the outgoing quality

y

is assumed to be distributed as

N (μ (x), σ^{2} (x))

, where

\begin{matrix} \begin{array}{r} μ (x) = 2 x^{2} \end{array} \end{matrix}

(49)

and

\begin{matrix} \begin{array}{r} σ^{2} (x) & = \{\begin{array}{l} - 2 x^{2} + 10 - 4 x & (0.1 \leq x) \\ 0.1 & (x < 0.1) \end{array} . \end{array} \end{matrix}

(50)

The desired specification interval for the outgoing quality variable

y

is assumed to be

S = [- 2, + 2]

. Figure 8 illustrates the dependency of the Gaussian distribution parameters and the corresponding yield on the control parameter

x

. It can be noted that though the mean

μ (x)

of the outgoing quality takes the value of 0 when

x = 0

, which corresponds to the center of the specification interval

S

, the variance

σ^{2} (x)

of the outgoing quality is relatively large when

x = 0

. Therefore, the maximum yield is realized when

x ≒ 0.72

, for which, even though the expected value of the outgoing quality is not centered in the middle of the specification interval, the variance

σ^{2} (x)

is relatively small. This synthetic problem is an example that demonstrates that it is inadequate to only focus on the mean parameter for optimization of the yield.

Two versions of the BO algorithm are used for solving the problem described above. In one version, labeled in the remainder of the paper as BO PROB, GPR is employed to model the empirical yields obtained via estimation of the relevant Bernoulli parameter, as described in Section 2.2-B. In the other BO version, labeled as BO DIST in the remainder of this paper, GPRs are employed to model the mean and variance parameters of the outgoing product quality, as described in Section 2.2-C. Both BO PROB and BO DIST exploit the NEI acquisition functions, as discussed in Section 2.2. In our experiments, we use RBF (squared-exponential) kernels for BO PROB and BO DIST. As new observations are collected, the kernel length-scale parameters are optimized by maximizing the marginal log-likelihood. We impose a log-normal prior on each length-scale with lnℓ ∼

N

(2 + 0.5d,

σ^{2} = 3

), where d denotes the number of control parameters. Detailed configurations are available in the Supplementary Codebase.

A set of four initial guesses for the control parameter x were randomly generated following a Sobol Sequence [47], and

N = 8

samples of the outgoing quality

y

were generated for each of those control parameters. Starting from that, BO PROB and BO DIST iterative procedures were run following Algorithm 1. In order to evaluate the stochastic behavior of the resulting BO methods, we performed 16 trials of the optimization procedures, with the progression of resulting quantiles of the yield suggested by the algorithms depicted in Figure 9 and Table 8.

As a tendency, one can see that as the iterations progress, the yields obtained using the suggested control parameters become higher. However, this tendency is not monotone due to the random nature of the problem and the BO procedure, as well as the limited number of trials. Inspection of the results corresponding to the 75th percentile shows little difference between BO DIST and BO PROB because both methods realize the yield close to the highest possible value. However, progressions of the 25th percentiles show the superiority of BO DIST in the sense that it realizes a relatively high yield very early in the iteration process. As per the discussions in Section 2.2-D, the superiority of the BO DIST method is reasonable because the outgoing quality y in this synthetic problem actually follows a Gaussian distribution.

We present additional experiments in Appendix D that examine the effects of kernel selection and the differences between the EI and NEI acquisition functions.

B. Optimization of control parameters to maximize probability that electron densities and temperatures are within specifications

In this subsection, we study an optimization problem in which the objective is to maximize the yield defined as the probability that electron densities and temperatures in an ICP chamber fall within the desired specification limits. Similarly to what was performed in the previous subsection, we compare the performance of the BO PROB and BO DIST algorithms.

For this purpose, we use the simulation environment described in Section 3.1-B, with the basic simulation settings depicted in Figure 3 and discussed in Section 3.1-B-1. In order to derive the empirical distributions of electron densities and temperatures for each setting of control parameters, one needs to conduct a prohibitively large number of simulations. To address this challenge, we used the simulation environment to construct a Machine Learning (ML)-based surrogate model capable of predicting measurements of electron densities and temperatures based on a limited number of simulation results. Specifically, a Random Forest regression with z-score normalization [48] was employed because of its well-documented ability to achieve high accuracy even with a limited number of training samples [49]. This model was trained using the results of 2000 COMSOL simulation runs conducted for input parameters sampled using the Sobol Sequence method from the domain described in Table 9, thus pursuing an informative training set via a limited number of simulations.

Table 10 summarizes the input settings for the problem. The parameters labeled as Control are iteratively selected through the optimization algorithms, while the parameters labeled as Randomly Varied correspond to the uncontrollable inputs whose variations lead to variability in chamber conditions and, consequently, randomness in the measured electron densities and temperatures. As for the parameters labeled as Fixed, their values are not varied as inputs into the ML surrogate model. The domain of each control parameter, i.e., the domains of values for each of the two controllable power coils, is discretized into 20 points. The goal of the BO procedures is to determine control parameters that would maximize the probability of electron densities and temperatures measured in the probe areas illustrated Figure 3 being within desired specification limits. Specification limits for the electron densities and temperatures are summarized in Table 11.

To clarify the relationship between surrogate-based evaluations and real experiments, we summarize their correspondences in Table 12. While the surrogate model and simulations are designed to mimic the measurements obtained from the physical system and therefore produce data that can be treated with the same evaluation metrics, real experiments require repeated measurements at fixed settings to directly observe variability arising from environmental factors and measurement noise. A typical real experiment loop consists of performing repeated measurements at a chosen setting, training a Gaussian Process (GP) on the acquired data, computing the acquisition function, and selecting the next setting sequentially. Because real experiments incur substantial time, material, and operational costs, a pragmatic workflow is to first screen and evaluate promising methods using surrogate models and then validate selected candidates on the actual equipment.

Figure 10 illustrates the relationship between control parameters and the probabilities that electron densities and temperatures fall within the specification intervals, as determined by the surrogate ML model with empirical samples from the domain described in Table 10.

A set of four initial guesses for the pair of coil powers were randomly selected following a Sobol Sequence, and

N = 8

samples of electron densities and temperatures were produced by the surrogate ML model for each of those pairs. Starting from that, BO PROB and BO DIST iterative procedures were run following Algorithm 1 in order to optimize the pair of controllable coil powers. In order to evaluate the stochastic behavior of the resulting BO methods, we ran 16 trials of the optimization procedures, with the progression of the 25th, 50th and 75th percentiles of the yields obtained by the BO algorithms depicted in Figure 11 and Table 13. For one trial, approximately 10 s and 80 s were necessary for BO PROB and BO DIST, respectively, in this evaluation setting with the standard workstation.

One can see that when the number of iterations is small, the BO DIST approach which relies on estimating means and variances of the underlying distributions achieves higher yields than the BO PROB method which relies on directly modeling empirical yields. As discussed in Section 2.2-D, this may be attributed to the scarcity of experiments compensated for by the assumption of a Gaussian form. If one looks at the progression of the 75th percentiles, BO PROB improves its performance as iterations progress, while the BO DIST method deteriorates in the later iterations. This may be attributed to the incorrect Gaussianity assumption, leading to the BO DIST method being informed by incorrect yield estimations, as discussed in Section 2.1-F.

C. Sensitivity studies for BO performance in ICP chamber simulations

As discussed in Section 2.2-D, the preferability of the newly proposed BO DIST algorithm, which relies on GPRs modeling the Gaussian distribution parameters, depends on the properties of the underlying problem. Here, we perform two sensitivity studies in order to illustrate the following:

When the desired specification area is narrower, the newly proposed algorithm works more favorably.
When the number of experiments is smaller, the newly proposed algorithm works more favorably.

Table 14 shows the specification intervals used in numerical experiments investigating the influence of the strictness of the desired specification on BO performance. In the case labeled as STRICT, the specified intervals of electron density and temperature are narrow, and the maximal possible yield is approximately 0.52. In the case labeled as LENIENT, the specification intervals are wider, with the maximal possible yield being approximately 0.71.

The number of experiments per iteration

N

was set to be 8 and the initial settings of control parameters were randomly selected, after which the BO PROB and BO DIST iterative procedures were run following Algorithm 1 in order to maximize the yield via optimization of the pair of controllable coil powers. The performance of each algorithm was investigated using the ratio of the yield determined by that algorithm vs. the actual maximal possible yield for the problems. For each BO algorithm, we performed 32 trials of the optimization procedure and calculated the medians of the aforementioned yield ratios, which are depicted in Figure 12.

It can be seen that when the specifications are strict, the BO DIST algorithm relying on the newly proposed concept of GPRs modeling the mean and variance parameters of the outgoing quality outperforms the classical approach relying on the GPR modeling empirical yields. On the other hand, when the specifications are lenient, the classical approach achieves high yields for a relatively smaller number of iterations. As discussed in Section 2.2-D, these observations may be attributed to the fact that the newly proposed algorithm takes into consideration the proximity of the observed characteristics to the desired specifications, thus facilitating better yield estimates from a smaller number of samples.

Under the same input parameter settings and specification limits as those considered in the previous subsection (input parameter settings listed in Table 10 and specification limits summarized in Table 11), we evaluated the performance of the optimization algorithms for different sample sizes

N

available at each iteration in the BO algorithms. A small sample size

N = 4

and a large sample size of

N = 256

were considered, with 32 random trials of the BO procedures conducted. Figure 13 reports the progression of medians of the resulting ratios of the yields produced by the BO PROB and BO DIST methods versus the maximal possible yield.

One can see that when the sample size

N

is small, the BO DIST approach relying on the newly proposed concept of GPRs modeling the parameters of the Gaussian distribution performed better. On the other hand, when the sample size

N

is large, the classical approach relying on the GPR modeling empirical yield becomes superior, except at the beginning of the iterations. As discussed in Section 2.2-D, the effectiveness of the proposed algorithm for small sample sizes

N

can be attributed to the ability of the Gaussian distribution assumption to compensate for a limited number of experiments, even when the Gaussianity assumption does not hold exactly true. In contrast, the classical BO PROB algorithm shows its strength with the large sample size

N

due to the consistency of the Bernoulli distribution-based yield estimator—a property that the yield estimator proposed in this paper does not possess.

4. Conclusions

In this paper, we introduced novel methodologies for estimating and maximizing yield in a manufacturing process, with yield being defined as the probability that quality characteristics fall within the desired specifications. The newly proposed yield estimator relies on estimating the expected value and variance of the outgoing product quality, based on which a Gaussian distribution form is assumed to calculate the yield. It was mathematically proven that if the outgoing product quality characteristics follow an uncorrelated multivariate Gaussian distribution, the newly proposed yield estimator achieves asymptotically lower MSE values compared to the classical yield estimator, which directly estimates the relevant Bernoulli distribution parameters. Additionally, we integrated this yield estimator into a BO framework in order to effectively optimize yield in the presence of uncertainties in the relevant manufacturing processes.

Theoretical insights regarding the properties of the novel yield estimator were verified in a synthetic example of yield estimation and optimization. In addition, the BO-based yield optimization procedures using the novel and traditional yield estimators were evaluated through high-fidelity simulations of a reactive ion etch chamber. The objective of these simulations was to determine tool parameters that maximize the probability of delivering desired electron density and temperature characteristics in plasma. BO using the newly proposed yield estimation algorithm showed advantages when the number of experiments was limited and when the desired specifications were strict, with advantages being more prevalent when the underlying distribution of the outgoing quality characteristics can be approximated well with an uncorrelated multivariate Gaussian distribution.

Looking ahead, future research should emphasize applications of the proposed method to experiments performed with physical equipment. Furthermore, the proposed method can be applied to estimate and maximize yields of various outgoing characteristics in different fields of manufacturing. From the perspective of theoretical developments, extending our methodologies to account for distributions beyond uncorrelated multivariate Gaussian distributions is essential. It will enable one to handle the critical question of when and to what degree the incorporation of prior knowledge about the underlying distribution form leads to a reduction in MSE when the distribution form differs from the uncorrelated multivariate Gaussian distributions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16073213/s1.

Author Contributions

Conceptualization, K.S. and D.K.; methodology, K.S. and D.K.; software, K.S.; validation, K.S., D.K., and D.D.; formal analysis, K.S. and D.K.; resources, D.K., Y.S. and H.M.; writing—original draft preparation, K.S.; writing—review and editing, K.S., D.K. and D.D.; project administration, Y.S., H.M. and D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets generated during the current study are available from the corresponding author on reasonable request. Please note that, as this research was conducted within a corporate setting, access to the data and scripts may require a Non-Disclosure Agreement (NDA) or other contractual arrangements. The script generated during the current study is provided as the Supplemental Materials.

Acknowledgments

We thank Takahito Matsuzawa and Masato Kazui for their assistance and support. We also gratefully acknowledge the helpful comments and feedback on the manuscript from members of Dragan Djurdjanovic’s laboratory. Finally, we thank our colleagues for their support during this work. During the preparation of this manuscript/study, the authors used GPT-4.1, GPT-4.1 mini, GPT-4o, and GPT-5 mini for the purposes of text editing and proof reading.

Conflicts of Interest

Authors Kei Sano, Daiki Kawahito, Yukiya Saito, Hironori Moki were employed by the Tokyo Electron Ltd. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Notations and Abbreviations

Symbol	Description
$x$	Scalar value
$x$	Vector (column vector)
$X$	Matrix
$x_{i}$	$i$ -th element of vector $x$
$X_{i j}$	$(i, j)$ -th element of matrix $X$
Abbreviation/Symbol	Full Term/Meaning
BO	Bayesian Optimization
MSE	Mean Squared Error
GPR	Gaussian Process Regression
EI	Expected Improvement (an acquisition function)
NEI	Noisy Expected Improvement (an acquisition function)
ICP	Inductively Coupled Plasma
ML	Machine learning
$y$	Quality characteristics
$S = [a, b]$	Specification interval
$p_{*}$	True yield
$μ_{*}$	True mean
$σ_{*}^{2}$	True variance
$N$	Number of samples
${\hat{p}}_{1}$	Estimator based on Bernoulli distribution
${\hat{p}}_{2}$	Proposed estimator
$\hat{μ}$	Unbiased estimator of mean
${\hat{σ}}^{2}$	Unbiased estimator of variance
$e r f$	Error function
$N (\cdot, \cdot)$	Gaussian distribution
$χ_{N - 1}^{2}$	Chi-square distribution with $N - 1$ degrees of freedom
$P r$ ( $\cdot$ )	Probability Operator
$E [\cdot]$	Expectation Operator
$VAR [\cdot]$	Variance Operator
$x$	Control parameter vector
$\overline{f} (x)$	Empirical estimate of yield at $x$
$α_{E I} (x ∣ \overline{f})$	Acquisition function based on EI given observations $\overline{f}$
${G P R}_{f} (x ∣ f)$	Distribution of $f$ modeled by Gaussian Process Regression given observations $f$

Appendix A. Derivation of the Confidence Interval for the Newly Proposed Yield Estimator

While the main text of our paper discusses the accuracy of the newly proposed yield estimator, its confidence interval is also important from a practical application perspective. In this appendix, we present the derivation of the confidence interval for the estimated yield using the newly proposed method. This derivation is supported with numerical experiments as well.

Appendix A.1. Confidence Interval for Newly Proposed Yield Estimator

Let

V

denote a random variable describing the proposed estimator under the assumptions outlined in Section 2.1-C, where the quality characteristic

y

follows a Gaussian distribution. Then,

V

is expressed by (10), i.e.,

\begin{matrix} V = \begin{array}{r} F (a, b, Y, Z), \end{array} \end{matrix}

(A1)

where

\begin{matrix} \begin{matrix} Z \sim N (μ_{*}, \frac{σ_{*}^{2}}{N}) \\ \frac{Y}{σ_{*}^{2} (N - 1)} \sim χ_{N - 1}^{2} . \end{matrix} \end{matrix}

(A2)

By substituting

μ_{*}

and

σ_{*}^{2}

with the corresponding estimates

\hat{μ}

and

{\hat{σ}}^{2}

, respectively, we can approximate the random variable

V

with

V'

as

\begin{matrix} \begin{array}{r} V' = F (a, b, Y', Z') \end{array}, \end{matrix}

(A3)

where

\begin{matrix} \begin{matrix} Z^{'} \sim N (\hat{μ}, \frac{{\hat{σ}}^{2}}{N}) \\ \frac{Y'}{{\hat{σ}}^{2} (N - 1)} \sim χ_{N - 1}^{2} . \end{matrix} \end{matrix}

(A4)

Let us denote

\hat{p}' (y)

as the probability density function of

V

. Note that Monte Carlo sampling is necessary to evaluate

\hat{p}' (y)

, as its analytical form is intractable. Let

α

denote the significance level. Then, it is possible to define the confidence interval as

\begin{matrix} \begin{array}{r} [{\hat{p}}_{2} - h_{α}, {\hat{p}}_{2} + h_{α}] \end{array}, \end{matrix}

(A5)

where

\begin{matrix} \begin{array}{r} h_{α} = \min_{h > 0} \{h ∣ \int_{{\hat{p}}_{2} - h}^{{\hat{p}}_{2} + h} {\hat{p}}_{2}^{'} (y) d y \geq 1 - α\} \end{array} . \end{matrix}

(A6)

Appendix A.2. Numerical Evaluation of the Confidence Interval

We now examine the confidence intervals of the yield estimates. The experimental settings are the same as those described in Section 3.1-A,B-2 of the main text. For the confidence interval of

{\hat{p}}_{1}

, the Wald interval [50] is used, while for the confidence interval of

{\hat{p}}_{2}

, the method described above is employed, with the number of samples for Monte Carlo sampling for

\hat{p}' (y)

set to

10^{3}

. The significance level is set to be

α = 0.05

. The confidence intervals in Section 3.1-A,B-2 are shown in Figure A1 and Figure A2, respectively.

Figure A1. The confidence intervals of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for the evaluation setting in Section 3.1-A of the main text. Sample sizes

[2, 4, 8, 16, 32, 64, 128]

and the significance level

α = 0.05

. We repeated the

10^{4}

experiments, created the box plot and took the average for the upper and lower bounds of the confidence intervals.

Figure A1. The confidence intervals of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for the evaluation setting in Section 3.1-A of the main text. Sample sizes

[2, 4, 8, 16, 32, 64, 128]

and the significance level

α = 0.05

. We repeated the

10^{4}

experiments, created the box plot and took the average for the upper and lower bounds of the confidence intervals.

Figure A2. The confidence intervals of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for the evaluation setting in Section 3.1-B-1 of the main text. Sample sizes

[2, 4, 8, 16, 32, 64, 128]

and the significance level

α = 0.05

. We repeated the

10^{4}

experiments, created the box plot and took the average for the upper and lower bounds of the confidence intervals.

Figure A2. The confidence intervals of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for the evaluation setting in Section 3.1-B-1 of the main text. Sample sizes

[2, 4, 8, 16, 32, 64, 128]

and the significance level

α = 0.05

. We repeated the

10^{4}

experiments, created the box plot and took the average for the upper and lower bounds of the confidence intervals.

As shown in Figure A1, the quality characteristics follow a Gaussian distribution. In this case, the superiority of the proposed estimator

{\hat{p}}_{2}

is clearly visible. When the sample size

N

is small, the estimates are less accurate and the confidence intervals are wider.

In particular, the confidence interval of the classical estimator is unstable when

N = 2

and

N = 4

since the Wald interval is based on the central limit theorem, which is not valid for small

N

. As the sample size

N

increases, we observe that both the confidence intervals and estimation variability of the proposed estimator

{\hat{p}}_{2}

are smaller than those of the classical estimator

{\hat{p}}_{1}

.

As shown in Figure A2, the quality characteristics do not follow a Gaussian distribution. Although, when the sample size

N

is small, the estimates of the proposed estimator

{\hat{p}}_{2}

are closer to the true yield than those of the classical estimator

{\hat{p}}_{1}

, the confidence intervals of the classical estimator

{\hat{p}}_{1}

are clearly superior to those of the newly proposed estimator

{\hat{p}}_{2}

, especially when the sample size

N

is sufficiently large. As the sample size

N

increases, we observe that the confidence intervals and estimation variability of both estimators become smaller. Nevertheless, the confidence interval and the estimates of the proposed estimator

{\hat{p}}_{2}

diverge from the true yield since the assumption of Gaussianity does not hold. On the other hand, the confidence interval and estimates of the classical estimator

{\hat{p}}_{1}

converge to the true yield as the sample size

N

increases.

Appendix B. Proof of Lemma 1

In this appendix, we derive the following lemma needed for proving Lemma 1.

Lemma A1 (asymptotic expectation and variance of

{\hat{p}}_{2}

).

Under the same premises as in Lemma 1, the mean and variance of

{\hat{p}}_{2}

are represented as follows:

\begin{matrix} \begin{matrix} E [{\hat{p}}_{2}] & = p_{*} + \frac{σ_{*}^{- 1}}{4 N \sqrt{2 π}} e^{- \frac{1}{2 σ_{*}^{2}} {(a - μ_{*})}^{2}} (σ_{*}^{- 2} {(a - μ_{*})}^{3} - (a - μ_{*})) \\ - \frac{σ_{*}^{- 1}}{4 N \sqrt{2 π}} e^{- \frac{1}{2 σ_{*}^{2}} {(b - μ_{*})}^{2}} (σ_{*}^{- 2} {(b - μ_{*})}^{3} - (b - μ_{*})) + O (\frac{1}{N^{2}}) \end{matrix} \end{matrix}

(A7)

\begin{matrix} \begin{matrix} \begin{matrix} VAR [{\hat{p}}_{2}] & = \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{σ_{*}^{2}} {(a - μ_{*})}^{2}} (2 σ_{*}^{2} + {(a - μ_{*})}^{2}) + \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{σ_{*}^{2}} {(b - μ_{*})}^{2}} (2 σ_{*}^{2} + {(b - μ_{*})}^{2}) \\ - \frac{1}{4 π σ_{*}^{2} N} e^{- \frac{1}{2 σ_{*}^{2}} ({(a - μ_{*})}^{2} + {(b - μ_{*})}^{2})} (4 σ_{*}^{2} + 2 (a - μ_{*}) (b - μ_{*})) + O (\frac{1}{N^{2}}) \end{matrix} \end{matrix} \end{matrix}

(A8)

Proof of Lemma A1.

Let

F_{a, b} (x, y) : R \times R^{+} \to [0,1]

be a function whose value is the probability that a random variable

Y \sim N (x, y)

falls within an interval

[a, b]

. The function

F_{a, b} (x, y)

is expressed as

\begin{array}{r} F_{a, b} (x, y) = \frac{1}{2} (e r f (\frac{b - x}{\sqrt{2 y}}) - e r f (\frac{a - x}{\sqrt{2 y}})) . \end{array}

Note that

{\hat{p}}_{2} = F_{a, b} (\hat{μ}, {\hat{σ}}^{2})

. Let us now apply Taylor expansion to

{\hat{p}}_{2}

around

(μ_{*}, σ_{*}^{2})

, noting that since

e r f (\cdot)

is analytical, as is

y^{- \frac{1}{2}}

for

y \in R^{+}

, the Taylor series expansion of

F_{a, b} (\hat{μ}, {\hat{σ}}^{2})

around

(μ_{*}, σ_{*}^{2})

converges pointwise to

F_{a, b}

for any

(x, y) \in R \times R^{+}

.

The partial derivatives of

F

become as follows:

\begin{matrix} \begin{array}{r} \frac{\partial F_{a, b} (x, y)}{\partial x} = \frac{y^{- \frac{1}{2}}}{\sqrt{2 π}} (e^{- \frac{1}{2 y} {(a - x)}^{2}} - e^{- \frac{1}{2 y} {(b - x)}^{2}}) \end{array} \end{matrix}

(A9)

\begin{matrix} \begin{array}{r} \frac{\partial F_{a, b} (x, y)}{\partial y} = \frac{y^{- \frac{3}{2}}}{2 \sqrt{2 π}} ((a - x) e^{- \frac{1}{2 y} {(a - x)}^{2}} - (b - x) e^{- \frac{1}{2 y} {(b - x)}^{2}}) \end{array} \end{matrix}

(A10)

\begin{matrix} \begin{array}{r} \frac{\partial^{2} F_{a, b} (x, y)}{\partial x^{2}} = \frac{y^{- \frac{3}{2}}}{\sqrt{2 π}} ((a - x) e^{- \frac{1}{2 y} {(a - x)}^{2}} - (b - x) e^{- \frac{1}{2 y} {(b - x)}^{2}}) \end{array} \end{matrix}

(A11)

\begin{matrix} \begin{array}{r} \frac{\partial^{2} F_{a, b} (x, y)}{\partial y^{2}} = \frac{y^{- \frac{7}{2}}}{4 \sqrt{2 π}} (({(a - x)}^{3} - 3 (a - x) y) e^{- \frac{1}{2 y} {(a - x)}^{2}} - ({(b - x)}^{3} - 3 (b - x) y) e^{- \frac{1}{2 y} {(b - x)}^{2}}) \end{array} \end{matrix}

(A12)

Since

\hat{μ} \sim N (μ_{*}, \frac{σ_{*}^{2}}{N})

, the moments of

\hat{μ}

are as follows:

\begin{matrix} \begin{array}{r} \begin{array}{r} E [\hat{μ}] = μ_{*}; VAR [\hat{μ}] = \frac{σ_{*}^{2}}{N}; E [{(\hat{μ} - μ_{*})}^{2 m + 1}] = 0; E [{(\hat{μ} - μ_{*})}^{2 m}] = O (\frac{1}{N^{m}}) \end{array} \end{array} \end{matrix}

(A13)

Furthermore, since

\frac{(N - 1) {\hat{σ}}^{2}}{σ_{*}^{2}} \sim χ_{N - 1}^{2}

, the central moment of

{\hat{σ}}^{2}

is as follows:

\begin{matrix} \begin{array}{r} E [{({\hat{σ}}^{2} - σ_{*}^{2})}^{m}] = {(\frac{σ_{*}^{2}}{(N - 1)})}^{m} E [{(V - E [V])}^{m}] \end{array} \end{matrix}

(A14)

where

V

is a random variable which follows a

χ_{N - 1}^{2}

distribution. From the recurrence relation of the central moments for

χ_{N - 1}^{2}

[51], the

m

-th central moment is

\begin{array}{r} E [{(V - E [V])}^{2}] = 2 N; E [{(V - E [V])}^{2 m}] = E [{(V - E [V])}^{2 m + 1}] = O (N^{m}) . \end{array}

Therefore,

\begin{matrix} \begin{array}{r} \begin{array}{r} E [({\hat{σ}}^{2} - σ_{*}^{2})] = 0; E [{({\hat{σ}}^{2} - σ_{*}^{2})}^{2}] = \frac{2 σ_{*}^{4} N}{{(N - 1)}^{2}} \end{array} \end{array}, \end{matrix}

(A15)

and for

s \geq 1

,

\begin{matrix} \begin{array}{r} E [{({\hat{σ}}^{2} - σ_{*}^{2})}^{2 s}] = O (\frac{1}{N^{s}}); E [{({\hat{σ}}^{2} - σ_{*}^{2})}^{2 s + 1}] = O (\frac{1}{N^{s}}) \end{array} . \end{matrix}

(A16)

Since

\hat{μ}

and

{\hat{σ}}^{2}

are independent, for any

s, t \geq 1

, we also have

\begin{matrix} \begin{array}{r} E [{(\hat{μ} - μ_{*})}^{2 s - 1} {({\hat{σ}}^{2} - σ_{*}^{2})}^{t}] = 0 \end{array} \end{matrix}

(A17)

\begin{matrix} \begin{array}{r} E [{(\hat{μ} - μ_{*})}^{2 s} {({\hat{σ}}^{2} - σ_{*}^{2})}^{2 t}] = O (\frac{1}{N^{s + t}}) \end{array} \end{matrix}

(A18)

\begin{matrix} \begin{array}{r} E [{(\hat{μ} - μ_{*})}^{2 s} {({\hat{σ}}^{2} - σ_{*}^{2})}^{2 t + 1}] = O (\frac{1}{N^{s + t}}) \end{array} . \end{matrix}

(A19)

Focusing on the coefficients related to the factor of

\frac{1}{N}

, (

63

)–(

69

) lead to the following expression for the Taylor expansion of

{\hat{p}}_{2} = E [F (\hat{μ}, {\hat{σ}}^{2})]

around

(μ_{*}, σ_{*}^{2})

:

\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} E [{\hat{p}}_{2}] & = F_{a, b} (μ_{*}, σ_{*}^{2}) + \frac{1}{2} E [{(\hat{μ} - μ_{*})}^{2}] {[\frac{\partial^{2} F_{a, b} (x, y)}{\partial x^{2}}]}_{x = μ_{*}, y = σ_{*}^{2}} + \frac{1}{2} E [{({\hat{σ}}^{2} - σ_{*}^{2})}^{2}] {[\frac{\partial^{2} F_{a, b} (x, y)}{\partial y^{2}}]}_{x = μ_{*}, y = σ_{*}^{2}} + O (\frac{1}{N^{2}}) \\ = F (μ_{*}, σ_{*}^{2}) + \frac{1}{2} \frac{σ_{*}^{2}}{N} {[\frac{\partial^{2} F_{a, b} (x, y)}{\partial x^{2}}]}_{x = μ_{*}, y = σ_{*}^{2}} + \frac{1}{2} \frac{2 σ_{*}^{4}}{N} {[\frac{\partial^{2} F_{a, b} (x, y)}{\partial y^{2}}]}_{x = μ_{*}, y = σ_{*}^{2}} + O (\frac{1}{N^{2}}) \end{matrix} \end{matrix} \end{matrix} \end{matrix}

(A20)

Similarly,

\begin{matrix} \begin{matrix} \begin{matrix} VAR [{\hat{p}}_{2}] & = V A R [\hat{μ}] {({[\frac{\partial F_{a, b} (x, y)}{\partial x}]}_{x = μ_{*}, y = σ_{*}^{2}})}^{2} + VAR [{\hat{σ}}^{2}] {({[\frac{\partial F_{a, b} (x, y)}{\partial y}]}_{x = μ_{*}, y = σ_{*}^{2}})}^{2} + O (\frac{1}{N^{2}}) \\ = \frac{σ_{*}^{2}}{N} {({[\frac{\partial F_{a, b} (x, y)}{\partial x}]}_{x = μ_{*}, y = σ_{*}^{2}})}^{2} + \frac{2 σ_{*}^{4}}{N} {({[\frac{\partial F_{a, b} (x, y)}{\partial y}]}_{x = μ_{*}, y = σ_{*}^{2}})}^{2} + O (\frac{1}{N^{2}}) . \end{matrix} \end{matrix} \end{matrix}

(A21)

By substituting (A9)–(A12) into (A20) and (A21), we obtain Equations (A7) and (A8), thus deriving Lemma A1. ◻

Note regarding Lemma A1.

Since

E [{({\hat{p}}_{2} - p_{*})}^{2}] = {(E [{\hat{p}}_{2}] - p_{*})}^{2} + V A R [{\hat{p}}_{2}]

, Lemma 1 is easily derived from Lemma A1.

Appendix C. Proof of Theorem 1 and Proposition 1

Due to the possibility of transforming any Gaussian random variable into a standard normal random variable, without loss of generality, it is sufficient to consider Theorem 1 only for the case in which

y

follows the standard normal distribution. Let us denote

S_{1} = {(a, b) \in R^{2} | 0 \leq a \leq b}

and

S_{2} = {(a, b) \in R^{2} | a \leq 0, - a \leq b}

. Let

\partial S

and

S^{o}

, respectively, be the boundary and interior of

S_{1} \cup S_{2}

, i.e.,

\partial S = {(a, b) \in R^{2} |a \leq 0, - a = b} \cup {(a, b) \in R^{2}| 0 \leq a, a = b}

and

S^{o} = {(a, b) \in R^{2} |0 \leq a < b} \cup {(a, b) \in R^{2}| a \leq 0, - a < b}

.

Let us define the function

G (a, b) : S_{1} \cup S_{2} \to R

as follows:

G (a, b) = {l i m}_{N \to \infty} \frac{E [{({\hat{p}}_{1} - p_{*})}^{2}] - E [{({\hat{p}}_{2} - p_{*})}^{2}]}{1 / N}

The function

G (a, b)

is expressed as follows:

\begin{array}{r} \begin{array}{r} G (a, b) = q (a, b) (1 - q (a, b)) - \frac{1}{4 π} ((2 + a^{2}) e^{- a^{2}} + (2 + b^{2}) e^{- b^{2}} - (4 + 2 a b) e^{\frac{- (a^{2} + b^{2})}{2}}) \end{array} \end{array}

where:

q (a, b) = \frac{1}{2} (e r f (\frac{b}{\sqrt{2}}) - e r f (\frac{a}{\sqrt{2}}))

We aim to show that

G (a, b) \geq 0 \forall (a, b) \in S_{1} \cup S_{2}

. In order to do that, let us analyze the behavior of

G (a, b)

via analysis of its partial derivatives. The partial derivatives of

G (a, b)

are as follows:

\frac{\partial G (a, b)}{\partial a} = \frac{e^{- \frac{a^{2}}{2}}}{2 π} (- \sqrt{2 π} (1 - 2 q (a, b)) - (2 a + a^{2} b - b) e^{- \frac{b^{2}}{2}} + (a + a^{3}) e^{- \frac{a^{2}}{2}})

\begin{array}{r} \begin{array}{r} \frac{\partial G (a, b)}{\partial b} = \frac{e^{- \frac{b^{2}}{2}}}{2 π} (\sqrt{2 π} (1 - 2 q (a, b)) - (2 b + a b^{2} - a) e^{- \frac{a^{2}}{2}} + (b + b^{3}) e^{- \frac{b^{2}}{2}}) \end{array} \end{array}

Let us introduce

g (a, b)

and

h (a, b)

as simpler functions whose signs are equivalent to

\frac{\partial G (a, b)}{\partial a}

and

\frac{\partial G (a, b)}{\partial b}

, respectively. Namely, we define

g (a, b)

and

h (a, b)

as

\begin{array}{r} g (a, b) = - \sqrt{2 π} (1 - 2 q (a, b)) - (2 a + a^{2} b - b) e^{- \frac{b^{2}}{2}} + (a + a^{3}) e^{- \frac{a^{2}}{2}} \end{array},

\begin{array}{r} h (a, b) = \sqrt{2 π} (1 - 2 q (a, b)) - (2 b + a b^{2} - a) e^{- \frac{a^{2}}{2}} + (b + b^{3}) e^{- \frac{b^{2}}{2}} . \end{array}

The partial derivatives of

g (a, b)

and

h (a, b)

are as follows:

\begin{array}{r} \frac{\partial g (a, b)}{\partial a} = - ((a^{4} - 2 a^{2} + 1) e^{- \frac{a^{2}}{2}} + 2 (1 + a b) e^{- \frac{b^{2}}{2}}) \end{array}

\begin{array}{r} \frac{\partial g (a, b)}{\partial b} = (a^{2} b^{2} - {(b - a)}^{2} + 3) e^{- \frac{b^{2}}{2}} \end{array}

\begin{array}{r} \frac{\partial h (a, b)}{\partial a} = (a^{2} b^{2} - {(b - a)}^{2} + 3) e^{- \frac{a^{2}}{2}} \end{array}

\frac{\partial h (a, b)}{\partial b} = - ((b^{4} - 2 b^{2} + 1) e^{- \frac{b^{2}}{2}} + 2 (1 + a b) e^{- \frac{a^{2}}{2}})

Let us use

g_{b = \infty} (a)

to denote

{l i m}_{b \to \infty} g (a, b)

. The function

g_{b = \infty} (a)

and its derivative are expressed as below.

\begin{array}{r} g_{b = \infty} (a) = \underset{b \to \infty}{l i m} g (a, b) = - \sqrt{2 π} e r f (\frac{a}{\sqrt{2}}) + (a + a^{3}) e^{- \frac{a^{2}}{2}} \end{array}

\frac{d}{d a} g_{b = \infty} (a) = - (a^{4} - 2 a^{2} + 1) e^{- \frac{a^{2}}{2}}

Obviously,

\frac{d}{d a} g_{b = \infty} (a) = - (a^{4} - 2 a^{2} + 1) e^{- \frac{a^{2}}{2}} \leq 0

.

In order to prove that

0 \leq G (a, b)

over

S_{1} \cup S_{2}

, we will prove the following statements.

The function $G (a, b)$ is not negative over the boundary $\partial S$ of $S_{1} \cup S_{2}$ .
$\underset{b \to + \infty}{l i m} G (a, b) \geq 0 \forall a \in R . \geq 0$ .
There exist no local minima of $G (a, b)$ over the interior $S^{o}$ of $S_{1} \cup S_{2}$ .

Statements (i) and (ii) will be shown via Lemma A2 and Lemma A3, respectively, after which, statement (iii) will be proven.

Lemma A2.

Non-negativity of

G (a, b)

over

\partial S

:

\forall (a, b) \in \partial S, G (a, b) \geq 0 .

Proof of Lemma A2.

The proof of this lemma will be conducted by showing that

\begin{array}{r} \underset{b \to a}{l i m} G (a, b) = 0; \underset{0 \leq b}{m i n} G (- b, b) = 0, \end{array}

and that

\underset{0 \leq b}{m a x} G (- b, b) = G (- b_{*}, b_{*}) > 0,

where

b_{*}

is the unique solution that satisfies

\sqrt{2 π} (1 - 2 e r f (\frac{b_{*}}{\sqrt{2}})) - (2 b_{*} - 2 b_{*}^{3}) e^{\frac{- b_{*}^{2}}{2}} = 0

.

One can easily confirm that

{l i m}_{b \to a} G (a, b) = 0

. Next, we show that

{m i n}_{0 \leq b} G (- b, b) = 0

and that

{m a x}_{0 \leq b} G (- b, b) > 0

. One can obtain the following:

\begin{array}{r} G (- b, b) = (1 - e r f (\frac{b}{\sqrt{2}})) e r f (\frac{b}{\sqrt{2}}) - \frac{1}{π} b^{2} e^{- b^{2}} \end{array}

\begin{array}{r} \frac{d G (- b, b)}{d b} = \frac{e^{\frac{- b^{2}}{2}}}{π} (\sqrt{2 π} (1 - 2 e r f (\frac{b}{\sqrt{2}})) - (2 b - 2 b^{3}) e^{\frac{- b^{2}}{2}}) \end{array}

Let us denote

l (b) = (\sqrt{2 π} (1 - 2 e r f (\frac{b}{\sqrt{2}})) - (2 b - 2 b^{3}) e^{\frac{- b^{2}}{2}})

. Note that the signs of

l (b)

are equivalent to the signs of

\frac{d G (- b, b)}{d b}

and note that

\frac{d}{d b} l (b) = - 2 (b^{2} - 1) (b^{2} - 3) e^{- b^{2} / 2}

. It is easy to confirm that

\begin{array}{r} l (0) = \sqrt{2 π} > 0 \end{array}

\begin{array}{r} l (1) = \sqrt{2 π} (1 - 2 e r f (\frac{1}{\sqrt{2}})) < 0 \end{array}

\begin{array}{r} l (\sqrt{3}) = \sqrt{2 π} (1 - 2 e r f (\sqrt{\frac{3}{2}})) + 4 \sqrt{3} e^{- \frac{3}{2}} < 0 . \end{array}

Therefore, it has been shown that unique

b_{*}

exists in

(0, 1)

such that

G (- b, b)

increases on the interval

[0, b_{*}]

and decreases on the interval

[b_{*}, \infty)

. Since

\begin{array}{r} G (0, 0) = 0 and \underset{b \to \infty}{l i m} G (- b, b) = 0, \end{array}

G (- b, b) \geq 0

holds. Furthermore,

{m a x}_{0 \leq b} G (- b, b) = G (- b_{*}, b_{*})

and

b_{*}

satisfies

\sqrt{2 π} (1 - 2 e r f (\frac{b_{*}}{\sqrt{2}})) - (2 b_{*} - 2 b_{*}^{3}) e^{\frac{- b_{*}^{2}}{2}} = 0 .

◻

Lemma A3.

Non-negativity of

G (a, b)

at infinity:

\begin{array}{r} \underset{b \to + \infty}{l i m} G (a, b) \geq 0 \forall a \in R . \end{array}

Proof of Lemma A3.

Let

G_{b = \infty} (a)

denote

{l i m}_{b \to \infty} G (a, b)

.

G_{b = \infty} (a)

is expressed as

\begin{array}{r} G_{b = \infty} (a) = \frac{1}{4} (1 - {e r f}^{2} (\frac{a}{\sqrt{2}})) - \frac{1}{4 π} (2 + a^{2}) e^{- a^{2}} . \end{array}

The derivative

\frac{d}{d a} G_{b = \infty} (a)

is expressed as

\begin{array}{r} \frac{d}{d a} G_{b = \infty} (a) = \frac{e^{\frac{- a^{2}}{2}}}{2 π} ((a^{3} + a) e^{\frac{- a^{2}}{2}} - \sqrt{2 π} e r f (\frac{a}{\sqrt{2}})) . \end{array}

It is easy to see that

\begin{array}{r} \frac{d}{d a} ((a^{3} + a) e^{\frac{- a^{2}}{2}} - \sqrt{2 π} e r f (\frac{a}{\sqrt{2}})) = - e^{- \frac{a^{2}}{2}} {(a^{2} - 1)}^{2} \leq 0 . \end{array}

Also, it is easy to see that

\frac{d}{d a} G_{b = \infty} (0) = 0

. Therefore,

\begin{array}{r} \{\begin{array}{l} \frac{d}{d a} G_{b = \infty} (a) > 0 & (a < 0) \\ \frac{d}{d a} G_{b = \infty} (a) < 0 & (0 < a) \end{array} \end{array} .

In other words,

G_{b = \infty} (a)

has its maximum at

a = 0

. It is easy to confirm that

\begin{array}{r} G_{b = \infty} (0) = \frac{1}{4} - \frac{1}{4} \frac{2}{π} > 0 \end{array}

and

\begin{array}{r} \underset{a \to \infty}{l i m} G_{b = \infty} (a) = \underset{a \to - \infty}{l i m} G_{b = \infty} (a) = 0 \end{array} .

Hence,

G_{b = \infty} (a) \geq 0

for all

a

. ◻

To prove statement (iii), one should note that the necessary conditions for

(a, b)

to become a local minimum are as follows:

\begin{array}{r} g (a, b) = 0; \frac{\partial g (a, b)}{\partial a} \geq 0 \end{array}

\begin{array}{r} h (a, b) = 0; \frac{\partial h (a, b)}{\partial b} \geq 0 \end{array}

Let us now introduce the following three sets:

S_{3} = {(a, b) \in R | a \leq - \sqrt{3}, - a < b}

,

S_{4} = {(a, b) \in R^{2} | - \sqrt{3} \leq a < - 1, - a < b}

, and

S_{5} = {(a, b) \in R^{2} | - 1 \leq a < 0, - \frac{1}{a} \leq b}

. We will first prove that

g (a, b) > 0

for

S_{3}

,

S_{4}

, and

S_{5}

in Lemma A4, Lemma A5, and Lemma A6, respectively.

Lemma A4.

g (a, b) > 0

for

(a, b) \in S_{3} = {(a, b) \in R^{2} | a \leq - \sqrt{3}, - a < b}

.

Proof of Lemma A4.

Firstly,

\begin{array}{r} g (a, - a) = - \sqrt{2 π} (1 - 2 q (a, - a)) + (2 a^{3} - 2 a) e^{- \frac{a^{2}}{2}}, \end{array}

and it is easy to confirm that

g (- \sqrt{3}, \sqrt{3}) = 2 \sqrt{2 π} e r f (\sqrt{\frac{3}{2}}) - \sqrt{2 π} - 4 \sqrt{3} e^{- 3 / 2} > 0

. Since

\frac{\partial}{\partial a} g (a, - a) = - 2 (a^{2} - 3) (a^{2} - 1) e^{- a^{2} / 2}

, it is shown that

g (a, - a) \geq 0

for

\forall a \leq - \sqrt{3}

. In addition,

\frac{\partial g (a, b)}{\partial b} (a, b) \geq 0

for

(a, b) \in S_{3}

holds since

\begin{array}{r} \begin{array}{r} {\tilde{g}}_{b} (a, - a) = (a^{2} - 1) (a^{2} - 3) \geq 0 \end{array} \end{array}

and

\begin{array}{r} \frac{\partial}{\partial b} {\tilde{g}}_{b} (a, b) = 2 (a^{2} - 1) b + 2 a \geq 2 (a^{2} - 1) b - 2 b = 2 (a^{2} - 2) b > 0 . \end{array}

Therefore,

g (a, b) > 0

for

(a, b) \in S_{3}

. ◻

Lemma A5.

g (a, b) > 0

for

(a, b) \in S_{4} = {(a, b) \in R^{2} | - \sqrt{3} \leq a < - 1, - a \leq b}

.

Proof of Lemma A5.

For each fixed

a \in [- \sqrt{3}, - 1)

,

g (a, b)

attains the minimum at

b = b_{c} (a)

, where

\begin{array}{r} b_{c} (a) = \frac{- a + \sqrt{a^{4} - 3 a^{2} + 3}}{a^{2} - 1} . \end{array}

Note that when

a \in [- \sqrt{3}, - 1)

,

\begin{array}{r} \sqrt{3} \leq b_{c} (a) \leq \frac{- 2 a}{a^{2} - 1} \end{array}

holds. Subsequently,

\begin{array}{r} 2 \sqrt{2 π} q (a, b_{c} (a)) \geq 2 \sqrt{2 π} q (a, \sqrt{3}) \end{array}

\begin{array}{r} - (2 a + a^{2} b_{c} (a) - b_{c} (a)) \geq 0 \end{array} .

Therefore, for

(a, b) \in S_{4}

,

\begin{array}{r} g (a, b) \geq g (a, b_{c} (c)) \geq - \sqrt{2 π} (1 - 2 q (a, \sqrt{3})) + (a + a^{3}) e^{- \frac{a^{2}}{2}} . \end{array}

Here,

\frac{d}{d a} (- \sqrt{2 π} (1 - 2 q (a, \sqrt{3})) + (a + a^{3}) e^{- a^{2} / 2}) = - {(a^{2} - 1)}^{2} e^{- a^{2} / 2} \leq 0

.

Hence,

\begin{matrix} \begin{matrix} g (a, b) & \geq g (a, b_{c} (c)) \\ \geq - \sqrt{2 π} (1 - 2 q (a, \sqrt{3})) + (a + a^{3}) e^{- \frac{a^{2}}{2}} \\ \geq - \sqrt{2 π} (1 - 2 q (- 1, \sqrt{3}) - 2 e^{- \frac{1}{2}} \\ > 0 . \end{matrix} \end{matrix}

◻

Lemma A6.

g (a, b) > 0

for

(a, b) \in S_{5} = {(a, b) \in R^{2} | - 1 \leq a < 0, - \frac{1}{a} \leq b}

.

Proof of Lemma A6.

When

b \geq - \frac{1}{a}

, the inequality

\frac{\partial g (a, b)}{\partial b} \leq 0

holds because

\begin{array}{r} {\tilde{g}}_{b} (a, - \frac{1}{a}) = 4 - {(a + \frac{1}{a})}^{2} \leq 0 \end{array}

and

\begin{array}{r} \frac{\partial}{\partial b} {\tilde{g}}_{b} (a, b) = 2 ((a^{2} - 1) b + a) \leq 2 ((a^{2} - 1) (- \frac{1}{a}) + a) = \frac{2}{a} < 0 . \end{array}

Therefore,

\begin{array}{r} \begin{array}{r} g (a, b) \geq g_{b = \infty} (a) > g_{b = \infty} (0) = 0 \end{array} . \end{array}

◻

In order to prove statement (iii), let us first note that

\frac{\partial g (a, b)}{\partial a} < 0

when

a b > 1 .

Therefore, condition

a b \leq - 1

is necessary for a local minimum of

G (a, b)

. It is easy to confirm that the set

S^{o} \cap {(a, b) \in R^{2} | a b \leq - 1}

is a subset of

S_{3} \cup S_{4} \cup S_{5}

, which means that a local minimum of

G (a, b)

would need to be somewhere in the set

S_{3} \cup S_{4} \cup S_{5}

. However, Lemmas A4–A6 show that no points

(a, b)

\in

S_{3} \cup S_{4} \cup S_{5}

can be a local minimum of

G (a, b)

. Therefore, statement (iii) holds.

Proof of Theorem 1.

Statement (iii) shows that no points

(a, b)

in

S^{o}

can be the minimum of

G (a, b)

. Thus, the minimum of

G (a, b)

exists only on

\partial S

, or at infinity points. To that end, statement (i) and statement (ii) show that

G (a, b) \geq

0 on

\partial S

and at infinity points, respectively. Hence,

G (a, b)

is non-negative over

S_{1} \cup S_{2}

, which completes the proof of Theorem 1. ◻

Proof of Proposition 1.

When the true mean of the outgoing product quality is in the center of the specification interval, the asymptotic difference of

{\hat{p}}_{1}

and

{\hat{p}}_{2}

is represented as

G (- b, b), b \geq 0

. Lemma A2 shows that

\underset{0 \leq b}{m a x} G (- b, b) = G (- b_{*}, b_{*})

, where

b_{*}

is the unique solution that satisfies

\sqrt{2 π} (1 - 2 e r f (\frac{b_{*}}{\sqrt{2}})) - (2 b_{*} - 2 b_{*}^{3}) e^{\frac{- b_{*}^{2}}{2}} = 0 .

This completes the proof of Proposition 1. ◻

Appendix D. Ablation Study of BO Configurations with Yield Estimators

We conduct an ablation study on the problem introduced in Section 3.1 to isolate the impact of our proposed yield estimator from other BO design choices. We compare Bayesian Optimization using the proposed estimator versus the classical estimator across different acquisition functions and kernel choices. For each configuration, we report quantiles of performance at each iteration. Although acquisition and kernel choices affect absolute performance, the results show that the proposed estimator consistently improves sample efficiency and final yields across the tested configurations. Detailed experimental settings and code are provided in the Supplemental Codebase.

Appendix D.1. Difference in Acquisition Functions

In this subsection, we investigate the effect of the acquisition function. We repeated the experiments from Section 3.1 using Expected Improvement (EI) as the acquisition function; detailed settings are available in the Supplemental Codebase. We consider four GPR configurations: DIST-EI, DIST-NEI, PROB-EI, and PROB-NEI. The prefixes DIST and PROB denote the GPR target as in the main text, while the suffixes EI and NEI denote the acquisition functions—Expected Improvement and Noisy Expected Improvement. Hence, DIST-EI uses estimated distribution parameters with EI; DIST-NEI uses estimated distribution parameters with NEI; PROB-EI uses empirical yield with EI; and PROB-NEI uses empirical yield with NEI. The results are shown in Figure A3 and Table A1. Focusing on acquisition function differences, Noisy EI outperforms standard EI in this problem setting. This is expected because the task inherently includes observation noise, and Noisy EI explicitly accounts for that noise when evaluating improvement.

Figure A3. The progression of the 25th, 50th and 75th percentiles of yields obtained for the control parameters suggested over 16 random replications of each algorithm. The black dashed line corresponds to the highest possible yield in this synthetic problem.

Table A2 reports the computation times for the experiments, measured on the standard workstation. As noted in Section 2.2-C, NEI requires substantially more computation than EI because NEI relies on Monte Carlo simulation to evaluate expected improvement. Similarly, the proposed estimator is more computationally demanding than the classical estimator because it uses Monte Carlo methods to compute posterior distributions. In these experiments, NEI required approximately 1.5× the computation time of EI, while GPRs using the proposed yield estimator required about 5× the computation time of the classical yield estimator.

Table A1. Yields suggested by BOo methods as iterations progressed.

Percentile	25				50
Method	BO DIST-EI	BO DIST-NEI	BO PROB-EI	BO PROB-NEI	BO DIST-EI	BO DIST-NEI	BO PROB-EI	BO PROB-NEI
Iteration
4	0.461	0.501	0.447	0.429	0.496	0.528	0.472	0.501
9	0.454	0.536	0.448	0.482	0.495	0.542	0.488	0.511
14	0.466	0.529	0.447	0.476	0.531	0.539	0.485	0.496
19	0.473	0.525	0.441	0.484	0.532	0.538	0.484	0.5
24	0.481	0.534	0.444	0.482	0.534	0.54	0.488	0.516
29	0.479	0.531	0.449	0.479	0.536	0.54	0.481	0.498
34	0.514	0.534	0.448	0.497	0.536	0.54	0.488	0.526
39	0.479	0.536	0.448	0.5	0.527	0.54	0.487	0.522
44	0.52	0.531	0.448	0.497	0.536	0.54	0.487	0.522
49	0.514	0.535	0.448	0.497	0.529	0.542	0.487	0.525
Percentile	75
Method	BO DIST-EI	BO DIST-NEI	BO PROB-EI	BO PROB-NEI
4	0.538	0.539	0.517	0.524
9	0.533	0.543	0.518	0.524
14	0.535	0.541	0.521	0.524
19	0.537	0.541	0.517	0.528
24	0.538	0.541	0.517	0.534
29	0.539	0.542	0.517	0.534
34	0.538	0.542	0.517	0.534
39	0.537	0.542	0.518	0.539
44	0.539	0.542	0.518	0.541
49	0.537	0.543	0.518	0.535

Table A2. Calculation time for 1 trial (50 iterations)/mean ± STD.

BO DIST-EI	BO DIST-NEI	BO PROB-EI	BO PROB-NEI
23.447 ± 3.705	31.102 ± 3.105	4.670 ± 0.638	6.148 ± 0.609

Appendix D.2. Difference in Kernel Functions

To illustrate differences between kernel functions, we compare the RBF kernel and the exponential kernel. Both are special cases of the Matern kernel: the exponential kernel corresponds to Matern with ν = 1/2, while the RBF kernel is obtained when ν → ∞. Because ν controls smoothness, this comparison provides an overview of how performance depends on the assumed kernel smoothness.

When comparing variants, we consider four GPR configurations: DIST-EXP, DIST-RBF, PROB-EXP, and PROB-RBF. Here, the prefix DIST and PROB represent the target of GPR as in the main text. The suffix EXP denotes the exponential kernel and RBF denotes the radial basis function kernel. Thus, DIST-EXP uses estimated distribution parameters with an exponential kernel, DIST-RBF uses estimated distribution parameters with an RBF kernel, PROB-EXP uses empirical yield with an exponential kernel, and PROB-RBF uses empirical yield with an RBF kernel. The results are shown in Figure A4 and Table A3.

When we examine the trajectories at the 25th and 50th percentiles, the RBF kernel appears to perform slightly better than the exponential kernel. Although acceptability (i.e., whether quality characteristics fall within specification) is discontinuous with respect to the quality metrics, the relationships between control parameters and yield are relatively smooth in this problem, as shown in Figure 8. This may partly explain the RBF kernel’s slight advantage. Nonetheless, the variation in performance due to the chosen GPR target is larger than that due to the kernel choice, as evidenced by the 25th-percentile trajectory.

Figure A4. The progression of the 25th, 50th and 75th percentiles of yields obtained for the control parameters suggested over 16 random replications of each algorithm. The black dashed line corresponds to the highest possible yield in this synthetic problem.

Table A3. Yields suggested by BO methods as iterations progressed.

Percentile	25				50
Method	BO DIST- EXP	BO DIST-RBF	BO PROB-EXP	BO PROB- RBF	BO DIST-EXP	BO DIST-RBF	BO PROB-EXP	BO PROB- RBF
Iteration
4	0.479	0.479	0.447	0.449	0.512	0.523	0.483	0.509
9	0.504	0.508	0.472	0.490	0.531	0.538	0.487	0.527
14	0.520	0.538	0.482	0.468	0.536	0.540	0.501	0.517
19	0.526	0.532	0.455	0.467	0.540	0.537	0.491	0.517
24	0.530	0.531	0.482	0.466	0.537	0.537	0.506	0.509
29	0.529	0.532	0.501	0.469	0.538	0.538	0.523	0.513
34	0.531	0.532	0.487	0.517	0.537	0.541	0.524	0.531
39	0.529	0.533	0.511	0.510	0.540	0.541	0.524	0.534
44	0.533	0.537	0.499	0.511	0.540	0.542	0.514	0.527
49	0.531	0.539	0.499	0.510	0.540	0.541	0.519	0.526
Percentile	75
Method	BO DIST- EXP	BO DIST-RBF	BO PROB-EXP	BO PROB- RBF
4	0.526	0.535	0.520	0.538
9	0.542	0.542	0.524	0.541
14	0.542	0.542	0.530	0.536
19	0.542	0.541	0.510	0.536
24	0.542	0.542	0.533	0.535
29	0.542	0.542	0.532	0.537
34	0.541	0.542	0.536	0.538
39	0.541	0.542	0.534	0.538
44	0.541	0.543	0.532	0.541
49	0.542	0.543	0.532	0.540

References

Brajlih, T.; Valentan, B.; Balic, J.; Drstvensek, I. Speed and accuracy evaluation of additive manufacturing machines. Rapid Prototyp. J. 2011, 17, 64–75. [Google Scholar] [CrossRef]
Kumar, N.; Kennedy, K.; Gildersleeve, K.; Abelson, R.; Mastrangelo, C.M.; Montgomery, D.C. A review of yield modelling techniques for semiconductor manufacturing. Int. J. Prod. Res. 2006, 44, 5019–5036. [Google Scholar] [CrossRef]
Katari, M.; Shanmugam, L.; Malaiyappan, J.N.A. Integration of AI and machine learning in semiconductor manufacturing for defect detection and yield improvement. J. Artif. Intell. Gen. Sci. 2024, 3, 418–431. [Google Scholar] [CrossRef]
Liang, Y.-Z.; Xie, P.; Chan, K. Quality control of herbal medicines. J. Chromatogr. B 2004, 812, 53–70. [Google Scholar] [CrossRef]
Appleton, T.; Bryan, P.; Contos, D.; Henry, T.R.; Lehmann, P.; Ohorodnik, S.; Reed, D.; Robichaud, C.; Schetter, J.; South, N.; et al. Nonclinical dose formulation: Out of specification investigations. AAPS J. 2012, 14, 523–529. [Google Scholar] [CrossRef]
Chen, W.; Wiecek, M.M.; Zhang, J. Quality utility—A compromise programming approach to robust design. J. Mech. Des. 1999, 121, 179–187. [Google Scholar] [CrossRef]
Zang, C.; Friswell, M.; Mottershead, J. A review of robust optimal design and its application in dynamics. Comput. Struct. 2005, 83, 315–326. [Google Scholar] [CrossRef]
Du, X.; Chen, W. Towards a better understanding of modeling feasibility robustness in engineering design. J. Mech. Des. 1999, 122, 385–394. [Google Scholar] [CrossRef]
Ochoa, J.S.; Cangellaris, A.C. Random-space dimensionality reduction for expedient yield estimation of passive microwave structures. IEEE Trans. Microw. Theory Tech. 2013, 61, 4313–4321. [Google Scholar] [CrossRef]
Styblinski, M.A.; Opalski, L.J. Algorithms and software tools for IC yield optimization based on fundamental fabrication parameters. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 1986, 5, 79–89. [Google Scholar] [CrossRef]
Dolecek, L.; Qazi, M.; Shah, D.; Chandrakasan, A. Breaking the simulation barrier: SRAM evaluation through norm minimization. In Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, USA, 10–13 November 2008; pp. 322–329. [Google Scholar]
Fang, C.; Yang, F.; Zeng, X.; Li, X. BMF-BD: Bayesian model fusion on bernoulli distribution for efficient yield estimation of integrated circuits. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 1–5 June 2014; pp. 1–6. [Google Scholar]
Ashby, M.F. Multi-objective optimization in material design and selection. Acta Mater. 2000, 48, 359–369. [Google Scholar] [CrossRef]
Sarkar, D.; Rohani, S.; Jutan, A. Multi-objective optimization of seeded batch crystallization processes. Chem. Eng. Sci. 2006, 61, 5282–5295. [Google Scholar] [CrossRef]
Shao, J. Mathematical Statistics; Springer: New York, NY, USA, 2003. [Google Scholar]
Sano, S.; Kadowaki, T.; Tsuda, K.; Kimura, S. Application of bayesian optimization for pharmaceutical product development. J. Pharm. Innov. 2020, 15, 333–343. [Google Scholar] [CrossRef]
Cho, S.; Kim, M.; Lee, J.; Han, A.; Na, J.; Moon, I. Multi-objective optimization of explosive waste treatment process considering environment via bayesian active learning. Eng. Appl. Artif. Intell. 2023, 117, 105463. [Google Scholar] [CrossRef]
Iwama, R.; Kaneko, H. Design of ethylene oxide production process based on adaptive design of experiments and bayesian optimization. J. Adv. Manuf. Process. 2021, 3, e10085. [Google Scholar] [CrossRef]
Al-Qahtani, K.; Elkamel, A.; Ponnambalam, K. Robust optimization for petrochemical network design under uncertainty. Ind. Eng. Chem. Res. 2008, 47, 3912–3919. [Google Scholar] [CrossRef]
Rashidi, N.; Wang, Q.; Burgos, R.; Roy, C.; Boroyevich, D. Multi-objective design and optimization of power electronics converters with uncertainty quantification—Part I: Parametric uncertainty. IEEE Trans. Power Electron. 2020, 36, 1463–1474. [Google Scholar] [CrossRef]
Zhang, Y.; Apley, D.W.; Chen, W. Bayesian optimization for materials design with mixed quantitative and qualitative variables. Sci. Rep. 2020, 10, 4924. [Google Scholar] [CrossRef]
Chen, Y.; Hassani, H.; Krause, A. Near-optimal bayesian active learning with correlated and noisy tests. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 223–231. [Google Scholar]
Cakmak, S.; Astudillo Marban, R.; Frazier, P.; Zhou, E. Bayesian optimization of risk measures. Adv. Neural Inf. Process. Syst. 2020, 33, 20130–20141. [Google Scholar]
Picheny, V.; Moss, H.; Torossian, L.; Durrande, N. Bayesian quantile and expectile optimisation. In Proceedings of the Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022; pp. 1623–1633. [Google Scholar]
Makarova, A.; Usmanova, I.; Bogunovic, I.; Krause, A. Risk-averse heteroscedastic bayesian optimization. Adv. Neural Inf. Process. Syst. 2021, 34, 17235–17245. [Google Scholar]
Daulton, S.; Cakmak, S.; Balandat, M.; Osborne, M.A.; Zhou, E.; Bakshy, E. Robust multi-objective bayesian optimization under input noise. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 4831–4866. [Google Scholar]
Wang, X.; Yan, C.; Ma, Y.; Yu, B.; Yang, F.; Zhou, D.; Zeng, X. Analog circuit yield optimization via freeze–thaw bayesian optimization technique. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4887–4900. [Google Scholar] [CrossRef]
Iwazaki, S.; Inatsu, Y.; Takeuchi, I. Bayesian quadrature optimization for probability threshold robustness measure. Neural Comput. 2021, 33, 3413–3466. [Google Scholar] [CrossRef] [PubMed]
Inatsu, Y.; Iwazaki, S.; Takeuchi, I. Active learning for distributionally robust level-set estimation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4574–4584. [Google Scholar]
Routtenberg, T.; Tabrikian, J. General classes of bayesian lower bounds for outage error probability and MSE. In Proceedings of the 2009 15th IEEE/SP Workshop on Statistical Signal Processing, Cardiff, UK, 31 August–3 September 2009; pp. 69–72. [Google Scholar]
Cabilio, P. Sequential estimation in bernoulli trials. Ann. Stat. 1977, 5, 342–356. [Google Scholar] [CrossRef]
Rice, J.A. Mathematical Statistics and Data Analysis; Thomson/Brooks/Cole: Belmont, CA, USA, 2007; Volume 371. [Google Scholar]
Oehlert, G.W. A note on the delta method. Am. Stat. 1992, 46, 27–29. [Google Scholar] [CrossRef]
Casella, G.; Berger, R. Statistical Inference; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar]
Zhao, P.; Wu, J.; Liu, Z.; Wang, C.; Fan, R.; Li, Q. Differential private stochastic optimization with heavy-tailed data: Towards optimal rates. In Proceedings of the AAAI Conference on Artificial Intelligence; Philadelphia, PA, USA, 25 February–4 March 2025, AAAI: Washington, DC, USA, 2025; Volume 39, pp. 22795–22803. [Google Scholar]
Lehmann, E.L.; Casella, G. Theory of Point Estimation; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
Zhang, Q.; Huang, W.; Jin, C.; Zhao, P.; Shu, Y.; Shen, L.; Tao, D. Multinoulli extension: A lossless yet effective probabilistic framework for subset selection over partition constraints. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Frazier, P.I. A tutorial on bayesian optimization. arXiv 2018, arXiv:1807.02811. [Google Scholar] [CrossRef]
Letham, B.; Karrer, B.; Ottoni, G.; Bakshy, E. Constrained bayesian optimization with noisy experiments. Bayesian Anal. 2019, 14, 495–519. [Google Scholar] [CrossRef]
Balandat, M.; Karrer, B.; Jiang, D.R.; Daulton, S.; Letham, B.; Wilson, A.G.; Bakshy, E. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; NeurIPS: San Diego, CA, USA, 2020; Volume 33. [Google Scholar]
Lee, C.; Graves, D.B.; Lieberman, M.A.; Hess, D.W. Global model of plasma chemistry in a high density oxygen discharge. J. Electrochem. Soc. 1994, 141, 1546. [Google Scholar] [CrossRef]
COMSOL. Coil Optimization of an ICP Reactor. Available online: https://www.comsol.jp/model/coil-optimization-of-an-icp-reactor-96771 (accessed on 10 March 2026).
Kokura, H.; Nakamura, K.; Ghanashev, I.P.; Sugai, H. Plasma absorption probe for measuring electron density in an environment soiled with processing plasmas. Jpn. J. Appl. Phys. 1999, 38, 5262. [Google Scholar] [CrossRef]
Nakamura, K.; Ohata, M.; Sugai, H. Highly sensitive plasma absorption probe for measuring low-density high-pressure plasmas. J. Vac. Sci. Technol. A 2003, 21, 325–331. [Google Scholar] [CrossRef]
Scharwitz, C.; Böke, M.; Hong, S.-H.; Winter, J. Experimental characterisation of the plasma absorption probe. Plasma Process. Polym. 2007, 4, 605–611. [Google Scholar] [CrossRef]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Sobol, I.M. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Comput. Math. Math. Phys. 1967, 7, 86–112. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Rodriguez-Galiano, V.; Sanchez-Castillo, M.; Chica-Olmo, M.; Chica-Rivas, M. Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geol. Rev. 2015, 71, 804–818. [Google Scholar] [CrossRef]
Wald, A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Am. Math. Soc. 1943, 54, 426–482. [Google Scholar] [CrossRef]
Willink, R. Relationships between central moments and cumulants, with formulae for the central moments of gamma distributions. Commun. Stat.-Theory Methods 2003, 32, 701–704. [Google Scholar] [CrossRef]

Figure 1. The probability density function of the standard normal distribution and the specification interval. This distribution and interval correspond to the case described in Proposition 1.

Figure 2. MSEs of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for estimating the probability shown in Figure 1 for sample sizes

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated

10^{4}

experiments and took the average.

Figure 2. MSEs of the classical estimator

{\hat{p}}_{1}

and the proposed estimator

{\hat{p}}_{2}

for estimating the probability shown in Figure 1 for sample sizes

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated

10^{4}

experiments and took the average.

Figure 3. A sketch of the ICP chamber used in the numerical experiments considered in this paper. The settings of simulation parameters are based on [42]. The model of simulations is revised so that powers of four coils can be set to distinct values and are considered control parameters. In addition, pressures at the outlet are also parametrized to emulate the variations in pressure controllers.

Figure 4. The true distribution of electron density constructed from multiple simulations and the specified interval. Though this distribution is mono-modal and appropriate to be approximated with a Gaussian distribution in this regard, the p-value of the Shapiro–Wilk test is

2.64 \times 10^{- 12}

.

Figure 4. The true distribution of electron density constructed from multiple simulations and the specified interval. Though this distribution is mono-modal and appropriate to be approximated with a Gaussian distribution in this regard, the p-value of the Shapiro–Wilk test is

2.64 \times 10^{- 12}

.

Figure 5. The MSEs of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for estimating the probability shown in Figure 4 for sample sizes (

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated

10^{4}

experiments and took the average.

Figure 5. The MSEs of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for estimating the probability shown in Figure 4 for sample sizes (

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated

10^{4}

experiments and took the average.

Figure 6. The true distribution of electron density and temperature constructed from multiple runs of simulations and the specified intervals. We can see that the true distribution is not a Gaussian distribution and correlated. The desired specification area is represented with the pink square.

Figure 7. The MSEs of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for estimating the probability shown in Figure 6 for various sample sizes (

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated the experiment

10^{4}

times and took the average.

Figure 7. The MSEs of yield estimators

{\hat{p}}_{1}

and

{\hat{p}}_{2}

for estimating the probability shown in Figure 6 for various sample sizes (

N = {2, 4, 8, 16, 32, 64, 128}

). For empirical MSEs, we repeated the experiment

10^{4}

times and took the average.

Figure 8. The relationship between control parameter

x

and Gaussian distribution parameters. Both mean parameters and variance parameters are determined by given

x

. The red X corresponds the maximum yield in this problem.

Figure 8. The relationship between control parameter

x

and Gaussian distribution parameters. Both mean parameters and variance parameters are determined by given

x

. The red X corresponds the maximum yield in this problem.

Figure 9. The progression of the 25th, 50th and 75th percentiles of yields obtained for the control parameters suggested over 16 random replications of the BO PROB and BO DIST algorithms. The black dashed line corresponds to the highest possible yield in this synthetic problem.

Figure 10. The relationship between control parameters and yield in the problem, obtained using the surrogate model. Pw2 and Pw3 correspond to the powers of coils described in Figure 3.

Figure 11. The progression of the 25th, 50th and 75th percentiles of yields obtained for the control parameters suggested over 16 random replications of the BO algorithms considered in this paper. The black dashed line corresponds to the maximal possible yield, obtained via analysis of yield dependency on the ICP control parameters illustrated in Figure 10.

Figure 12. Progressions of the ratios to the maximized yield produced by the algorithms versus the maximal possible yield for the STRICT and LENIENT problems defined in Table 14.

Figure 13. Progressions of the ratios to the maximized yield produced by the BO PROB and BO DIST algorithms versus the maximal possible yield as determined via analysis of the function illustrated in Figure 10.

Table 1. Empirical squared errors of estimators (Mean ± STD).

N	$Empirical Squared Errors of {\hat{p}}_{1}$	$Empirical Squared Errors of {\hat{p}}_{2}$
2	1.15 × $10^{- 1}$ ± 1.25 × $10^{- 1}$	6.93 × $10^{- 2}$ ± 1.01 × $10^{- 1}$
4	5.68 × $10^{- 2}$ ± 7.27 × $10^{- 2}$	2.39 × $10^{- 2}$ ± 4.87 × $10^{- 2}$
8	2.89 × $10^{- 2}$ ± 3.90 × $10^{- 2}$	9.30 × $10^{- 3}$ ± 1.90 × $10^{- 2}$
16	1.43 × $10^{- 2}$ ± 1.99 × $10^{- 2}$	4.16 × $10^{- 3}$ ± 7.91 × $10^{- 3}$
32	7.01 × $10^{- 3}$ ± 9.86 × $10^{- 3}$	1.92 × $10^{- 3}$ ± 3.23 × $10^{- 3}$
64	3.57 × $10^{- 3}$ ± 5.05 × $10^{- 3}$	9.44 × $10^{- 4}$ ± 1.48 × $10^{- 3}$
128	1.83 × $10^{- 3}$ ± 2.57 × $10^{- 3}$	4.54 × $10^{- 4}$ ± 6.64 × $10^{- 4}$

Table 2. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

Table 2. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

N	One-Sided Alternative Hypothesis	p-Value	Effect Size (Difference in Average)	95% CI of Difference in Squared Error (Percentiles from 2.5 to 97.5)
2	Squared error of ${\hat{p}}_{2}$ is less than that of ${\hat{p}}_{1}$ .	< $10^{- 16}$	−0.0460	[−0.04912, −0.04284]
128	Squared error of ${\hat{p}}_{2}$ is less than that of ${\hat{p}}_{1}$ .	< $10^{- 16}$	−0.00138	[−0.00143, −0.001327]

Table 3. Input parameters of simulations for constructing true distribution of electron density.

Item	Type	Value/Distribution
Power (Pw1-4)	Fixed	100 [W]
Ratio of O2 at inlet (xO2)	Varied	$N (0.6, {(0.1)}^{2})$
Pressure (p0)	Varied	$N (0.04, {(0.01)}^{2})$ [torr]
Pressure at outlet (outlet_p)	Fixed	0 [torr]

Table 4. Empirical squared errors of estimators (Mean ± STD).

N	$Empirical Squared Errors of {\hat{p}}_{1}$	$Empirical Squared Errors of {\hat{p}}_{2}$
2	9.26 × $10^{- 2}$ ± 1.20 × $10^{- 1}$	5.58 × $10^{- 2}$ ± 1.06 × $10^{- 1}$
4	4.66 × $10^{- 2}$ ± 6.21 × $10^{- 2}$	2.27 × $10^{- 2}$ ± 4.19 × $10^{- 2}$
8	2.40 × $10^{- 2}$ ± 3.34 × $10^{- 2}$	1.12 × $10^{- 2}$ ± 1.72 × $10^{- 2}$
16	1.18 × $10^{- 2}$ ± 1.67 × $10^{- 2}$	6.16 × $10^{- 3}$ ± 8.32 × $10^{- 3}$
32	5.81 × $10^{- 3}$ ± 8.36 × $10^{- 3}$	4.18 × $10^{- 3}$ ± 4.73 × $10^{- 3}$
64	2.86 × $10^{- 3}$ ± 3.99 × $10^{- 3}$	3.48 × $10^{- 3}$ ± 3.24 × $10^{- 3}$
128	1.43 × $10^{- 3}$ ± 2.00 × $10^{- 3}$	3.19 × $10^{- 3}$ ± 2.32 × $10^{- 3}$

Table 5. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

Table 5. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

N	One-Sided Alternative Hypothesis	p-Value	Effect Size (Difference in Average)	95% CI of Difference in Squared Error (Percentiles from 2.5 to 97.5)
2	Squared error of ${\hat{p}}_{2}$ is less than that of ${\hat{p}}_{1}$ .	< $10^{- 16}$	−0.0369	[−0.0400, −0.0337]
128	Squared error of ${\hat{p}}_{1}$ is less than that of ${\hat{p}}_{2}$ .	< $10^{- 16}$	−0.00176	[−0.00183, −0.00170]

Table 6. Empirical squared errors of estimators (Mean ± STD).

N	$Empirical Squared Errors of {\hat{p}}_{1}$	$Empirical Squared Errors of {\hat{p}}_{2}$
2	8.81 × $10^{- 2}$ ± 1.19 × $10^{- 1}$	5.07 × $10^{- 2}$ ± 8.96 × $10^{- 2}$
4	4.41 × $10^{- 2}$ ± 6.06 × $10^{- 2}$	2.34 × $10^{- 2}$ ± 3.53 × $10^{- 2}$
8	2.26 × $10^{- 2}$ ± 3.14 × $10^{- 2}$	1.31 × $10^{- 2}$ ± 1.50 × $10^{- 2}$
16	1.12 × $10^{- 2}$ ± 1.57 × $10^{- 2}$	8.29 × $10^{- 3}$ ± 8.82 × $10^{- 3}$
32	5.53 × $10^{- 3}$ ± 8.01 × $10^{- 3}$	6.10 × $10^{- 3}$ ± 5.97 × $10^{- 3}$
64	2.71 × $10^{- 3}$ ± 3.81 × $10^{- 3}$	4.90 × $10^{- 3}$ ± 4.20 × $10^{- 3}$
128	1.34 × $10^{- 3}$ ± 1.86 × $10^{- 3}$	4.35 × $10^{- 3}$ ± 3.05 × $10^{- 3}$

Table 7. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

Table 7. One-sided hypothesis testing using the bootstrap method (number of resamples =

10^{4}

).

N	One-Sided Alternative Hypothesis	p-Value	Effect Size (Difference in Average)	95% CI of Difference in Squared Error (Percentiles from 2.5 to 97.5)
2	Squared error of ${\hat{p}}_{2}$ is less than that of ${\hat{p}}_{1}$ .	< $10^{- 16}$	−0.0373	[−0.0402, −0.0344]
128	Squared error of ${\hat{p}}_{1}$ is less than that of ${\hat{p}}_{2}$ .	< $10^{- 16}$	−0.00301	[−0.00309, −0.00294]

Table 8. Yields suggested by Bo methods as iterations progressed.

Percentile	25		50		75
Method	BO DIST	BO PROB	BO DIST	BO PROB	BO DIST	BO PROB
Iteration
4	0.501	0.429	0.528	0.501	0.539	0.524
9	0.536	0.482	0.542	0.511	0.543	0.524
14	0.529	0.476	0.539	0.496	0.541	0.524
19	0.525	0.484	0.538	0.500	0.541	0.528
24	0.534	0.482	0.540	0.516	0.541	0.534
29	0.531	0.479	0.540	0.498	0.542	0.534
34	0.534	0.497	0.540	0.526	0.542	0.534
39	0.536	0.500	0.540	0.522	0.542	0.539
44	0.531	0.497	0.540	0.522	0.542	0.541
49	0.535	0.497	0.542	0.525	0.543	0.535

Table 9. Domain of input parameters used to generate training data for Random Forest-based surrogate model of ICP chamber.

Item	Range
Coil Power (PW1)	10 [W]
Coil Powers (PW2, PW3)	(0, 300) [W]
Coil Powers (PW4)	250 [W]
Ratio of O2 at inlet (xO2)	(0.7, 0.9)
Pressure (p0)	(0.01, 0.02) [torr]
Pressure at outlet (outlet_p)	(0, 0.02) [torr]

Table 10. Input parameters for simulations in order to construct true distribution of electron densities and temperatures.

Item	Type	Domain/Value/Distribution
Power1 (Pw1)	Fixed	10 [W]
Power2 (Pw2)	Control	10–250 [W]
Power3 (Pw3)	Control	10–250 [W]
Power4 (Pw4)	Fixed	250 [W]
Ratio of O2 at inlet (xO2)	Randomly Varied	$N (0.8, {(0.01)}^{2})$
Pressure (p0)	Randomly Varied	$N (0.015, {(0.0005)}^{2})$ [torr]
Pressure at the outlet (outlet_p)	Randomly Varied	$N (0.01, {(0.0005)}^{2})$ [torr]

Table 11. Specification of electron densities and temperatures.

Item	Range
Electron density $(1 / m^{3})$	$(1.8 \times 10^{16}, 2.0 \times 10^{16})$
Electron temperature (eV)	$(3.8, 4.0)$

Table 12. Correspondence between surrogate model experiments and real experiments.

Aspect	Surrogate Model	Real Experiments
Control variables/operating windows	Coils (see Table 3 for details)	Same conditions
Measurement cost /response lag	Negligible per eval/fast surrogate; (0.1~1 [s] per evaluation on workstation)	Measurement of sensors for processing wafers /repetition of wafer processing (tens of minutes ~ hours)
Yield metrics	Probability both density/temp within specification limits	Same metrics

Table 13. Yields suggested by Bo methods for each iteration.

Percentile	25		50		75
Method	BO DIST	BO PROB	BO DIST	BO PROB	BO DIST	BO PROB
Iteration
4	0.283	0.297	0.426	0.335	0.621	0.365
9	0.479	0.282	0.507	0.336	0.689	0.381
14	0.507	0.340	0.621	0.369	0.689	0.492
19	0.455	0.371	0.587	0.447	0.638	0.492
24	0.483	0.356	0.507	0.447	0.621	0.524
29	0.503	0.356	0.621	0.424	0.638	0.493
34	0.507	0.356	0.587	0.471	0.638	0.510
39	0.507	0.382	0.621	0.492	0.621	0.621
44	0.507	0.387	0.507	0.471	0.621	0.570
49	0.507	0.356	0.621	0.450	0.621	0.621

Table 14. The specification of electron densities and temperatures for a sensitive study for the strictness of specification.

Item	STRICT	LENIENT
Electron density $(1 / m^{3})$	$(1.8 \times 10^{16}, 1.9 \times 10^{16})$	$(1.8 \times 10^{16}, 2.1 \times 10^{16})$
Electron temperature (eV)	$(3.8, 3.9)$	$(3.8, 4.1)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sano, K.; Kawahito, D.; Saito, Y.; Moki, H.; Djurdjanovic, D. Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty. Appl. Sci. 2026, 16, 3213. https://doi.org/10.3390/app16073213

AMA Style

Sano K, Kawahito D, Saito Y, Moki H, Djurdjanovic D. Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty. Applied Sciences. 2026; 16(7):3213. https://doi.org/10.3390/app16073213

Chicago/Turabian Style

Sano, Kei, Daiki Kawahito, Yukiya Saito, Hironori Moki, and Dragan Djurdjanovic. 2026. "Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty" Applied Sciences 16, no. 7: 3213. https://doi.org/10.3390/app16073213

APA Style

Sano, K., Kawahito, D., Saito, Y., Moki, H., & Djurdjanovic, D. (2026). Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty. Applied Sciences, 16(7), 3213. https://doi.org/10.3390/app16073213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Yield Estimation and Maximization Using Bayesian Optimization Under Uncertainty

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Estimation of Yields

2.2. Bayesian Optimization for Maximizing Yields

3. Results

3.1. Experiments with Various Methods for Estimating Yields

3.2. Numerical Experiments for Maximizing Yield

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Notations and Abbreviations

Appendix A. Derivation of the Confidence Interval for the Newly Proposed Yield Estimator

Appendix A.1. Confidence Interval for Newly Proposed Yield Estimator

Appendix A.2. Numerical Evaluation of the Confidence Interval

Appendix B. Proof of Lemma 1

Appendix C. Proof of Theorem 1 and Proposition 1

Appendix D. Ablation Study of BO Configurations with Yield Estimators

Appendix D.1. Difference in Acquisition Functions

Appendix D.2. Difference in Kernel Functions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI