Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing

Hu, Jun; Zheng, Lijia; Alanazi, Ibtihal

doi:10.3390/math13010161

Open AccessArticle

Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing

by

Jun Hu

^1,*

,

Lijia Zheng

²

and

Ibtihal Alanazi

¹

Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309-4486, USA

²

Department of Health Policy, Stanford University School of Medicine, Stanford, CA 94305, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(1), 161; https://doi.org/10.3390/math13010161

Submission received: 11 December 2024 / Revised: 1 January 2025 / Accepted: 4 January 2025 / Published: 5 January 2025

(This article belongs to the Special Issue Sequential Sampling Methods for Statistical Inference)

Download

Browse Figure

Versions Notes

Abstract

This article addresses the use of fixed-width confidence intervals (FWCIs) for comparing two independent Bernoulli populations in A/B testing scenarios. Two sequential estimation procedures are proposed: one for estimating the difference in log probabilities of success and the other for log odds ratios. Both methods showcase great efficiency, as established via theoretical analysis and Monte Carlo simulations. The practical utility of these methods is demonstrated through two real-world applications: analyzing retention rates in mobile game Cookie Cats and evaluating the effectiveness of online advertising.

Keywords:

A/B testing; Bernoulli proportions; fixed-width confidence interval; mobile gaming; online advertising; sequential estimation

MSC:

62L05; 62L10; 62L12; 62F12; 62F25

1. Introduction

A/B testing is a statistical strategy for comparing two or more variants to see which one outperforms the others. Participants in an A/B test are assigned to different groups at random, and are exposed to varying versions (A and B, say). The observed outcomes are then analyzed to determine whether any of the differences are statistically significant. A/B testing helps companies and researchers make choices about product features, marketing strategies, or user experiences based on data, leading to better results supported by real-world evidence. For example, ref. [1] provided insights into how Microsoft utilized online controlled experiments (A/B tests) to make data-driven decisions in product development, and ref. [2] used A/B testing alongside evaluation of users’ mental models to improve the user experience of a Japanese language mobile learning application. Additionally, many companies regularly share case studies and blog posts detailing how A/B testing has helped them make decisions and obtain better results.

When binary data are available, A/B testing serves as a powerful method to compare two proportions across diverse domains, including applications in mobile gaming and marketing strategies. In the context of mobile gaming, A/B testing can be employed to compare different game features or user interfaces, aiming to identify the version that leads to a higher proportion of player engagement, retention, or in-app purchases. For example, in our study, we will apply A/B testing to analyze the mobile game called Cookie Cats. In this game, players progress through various stages and encounter gates that require them to either wait a certain amount of time or make an in-app purchase to continue. These gates serve a dual purpose: they drive in-app sales and provide players with a break, increasing and extending their enjoyment of the game. The placement of these gates is crucial for maintaining player satisfaction and maximizing retention. We will examine an A/B test in Cookie Cats where we changed the first gate from level 30 to level 40. Our analysis will focus on how this modification affects player retention.

Similarly, A/B testing is highly applicable in marketing strategies, such as evaluating the effectiveness of online advertising on outcomes such as sales or user engagement. A large company with an established user base, for example, may aim to increase sales through targeted advertisements. To determine the effectiveness of these ads, we will investigate an A/B test where users are divided into two groups: a control group that does not receive the advertisements, and a test group that does. By comparing sales performance between the two groups over a specified period, we will evaluate whether exposure to advertisements leads to a significant boost in sales. This data-driven approach enables the company to make informed decisions about its advertising strategy, ensuring that resources are allocated to campaigns that produce measurable results.

In both the mobile gaming and online advertising examples, where A/B testing is utilized as a pivotal tool for refining user experiences and optimizing strategies, the subsequent step of calculating confidence intervals for the proportions under estimation becomes equally important. Once A/B testing identifies the version that yields a superior outcome, the application of confidence intervals provides a quantitative measure of the precision and reliability of those estimates.

Confidence intervals play a fundamental role in the interpretation of proportions in statistical analysis. Estimating the percentage of website users who click a link is an example of a proportion. Confidence intervals measure the uncertainty about observed differences between groups in two proportions, such as A/B testing. These intervals help analysts evaluate their estimates and decide on the statistical significance and practical relevance of observed effects, improving research and experimental interpretation.

It is safe to say that there is a vast number of articles on confidence intervals for one and two proportions. One of the most commonly used methods is the Wald confidence interval, which relies on the normal approximation to the binomial distribution. Despite its simplicity, the Wald interval is widely criticized for its poor performance, particularly with small sample sizes or sample proportions near 0 or 1, where it often exhibits undercoverage. Ref. [3] introduced an alternative interval, called the Wilson interval, which improves upon the Wald interval by addressing its coverage issues. It has better performance, especially for small sample sizes and extreme proportions. Ref. [4] proposed the Clopper–Pearson interval, which is an exact interval based on the inversion of the equal-tail binomial test. Although guaranteeing the nominal coverage probability, this interval tends to be overly conservative and can produce unnecessarily wide intervals. Ref. [5] proposed the so-called “plus four” interval, which adjusts the Wald interval by adding two successes and two failures to the observed counts, improving its performance. Ref. [6] compared seven methods for constructing two-sided confidence intervals for a single proportion, providing a detailed analysis. Ref. [7] conducted a comprehensive evaluation of various intervals focusing on their coverage probability and their width, which was further supported and complemented by their subsequent work [8]. Additionally, ref. [9] obtained the smallest confidence intervals for a proportion in the sense of set inclusion. As for the difference between two independent proportions, classical confidence intervals can often be constructed by inverting the hypothesis tests accordingly. For example, the Wald interval can be derived from Goodman’s test [10], and the score interval is based on inverting the test with the standard errors evaluated under the null hypothesis of equal proportions. Ref. [11] compared eleven methods and introduced a hybrid interval by using information from the respective Wilson intervals for the two proportions. Ref. [12] extended the “plus four” approach to the two-proportion case. Ref. [13] applied an Edgeworth expansion to the Studentized difference of two binomial proportions, proposing two new intervals by correcting the skewness in the Edgeworth expansion. Ref. [14] put forward a “recentered” confidence interval with strong overall performance. And most recently, ref. [15] presented an optimal exact confidence interval, which showed uniformly superior performance in terms of infimum coverage probability and total interval width.

It is worth mentioning that the aforementioned confidence intervals are all constructed based on fixed-size samples. However, in many statistical inference problems, fixed-sample-size methods are unfeasible when predetermined accuracy requirements, such as fixed-width confidence intervals, must be met. Sequential or multistage sampling, which adjusts the sample size dynamically according to the collected data, becomes essential for achieving the desired accuracy. These methods are especially valuable in fields requiring prompt decisions, such as clinical trials and manufacturing quality control, enabling timely and accurate estimates while minimizing resource and time expenditure. In the context of sequential confidence intervals for one and two proportions, ref. [16] proposed asymptotically optimal sequential and two-stage procedures to construct confidence intervals of fixed width and confidence level for a Bernoulli success probability p. Ref. [17] developed and compared four exact sequential methods for obtaining fixed-width confidence intervals for p. Ref. [18] considered the construction of fixed-width confidence intervals using sequential sampling and carried out a simulation study. Ref. [19] introduced a novel approach with a tandem-width confidence interval for a Bernoulli proportion. Ref. [20] proposed leveraging importance sampling to calculate confidence intervals that almost always guarantee the specified coverage. Ref. [21] optimized sampling costs while achieving prescribed interval widths, accounting for varying observation costs between distributions. Furthermore, ref. [22] analyzed the achieved coverage and explored the trade-offs between the number of observations and the number of stages needed to achieve the desired width of the confidence interval.

In this article, we explore sequential confidence intervals for comparing two proportions, with a focus on A/B testing applications. We emphasize the importance of confidence intervals in interpreting A/B testing results within mobile gaming and online advertising sectors. The article is organized as follows: Section 2 introduces fixed-width confidence intervals for the logarithm of the ratio of two proportions and presents simulated studies to validate these methods; Section 3 extends this discussion to fixed-width confidence intervals for the logarithm of the odds ratio, again supported by simulated studies; in Section 4, we apply these statistical tools to A/B testing in mobile gaming through a case study of the game Cookie Cats, highlighting the practical impact of gate placement on player retention; in Section 5, we present a second case study evaluating the effectiveness of online advertising aimed at increasing sales; finally, Section 6 concludes the article by summarizing the findings and underscoring the significance of sequential confidence intervals in facilitating data-driven decision-making via A/B testing.

2. Sequential Confidence Intervals for the Ratio of Two Proportions

Suppose we are interested in some common characteristic, referred to as success, possessed by two independent dichotomous populations, say X and Y. The success probabilities are denoted by

p_{1}

and

p_{2}

, respectively, where

0 < p_{i} < 1, i = 1, 2

. Our goal is to compare their magnitudes and determine whether one is significantly greater than the other.

Assume that we have collected random samples

X_{1}, \dots, X_{n_{1}}

and

Y_{1}, \dots, Y_{n_{2}}

from X and Y, respectively, where the sample sizes

n_{1}

and

n_{2}

are not necessarily the same. Then, the

X_{i}

’s are independent and identically distributed (i.i.d.) Bernoulli (

p_{1}

) random variables, and the sample proportion

{\bar{X}}_{n_{1}} = n_{1}^{- 1} \sum_{i = 1}^{n_{1}} X_{i}

serves as an unbiased estimator of

p_{1}

. However, even though

0 < p_{1} < 1

, there is a positive probability that

{\bar{X}}_{n_{1}}

equals 0 or 1 in a given sample. To avoid this issue, we adopt the “plus four” idea introduced in [5,12], leading to the following biased but consistent estimator:

{\hat{p}}_{1, n_{1}} = \frac{\sum_{i = 1}^{n_{1}} X_{i} + 1}{n_{1} + 2} .

(1)

This estimator can be treated as a Bayes estimator with a

Uniform (1, 1)

prior, or as a weighted average of

{\bar{X}}_{n_{1}}

(the sample proportion) and

1 / 2

(a naïve estimator of

p_{1}

). Notably,

{\hat{p}}_{1, n_{1}}

is always strictly between 0 and 1. Similarly, we estimate

p_{2}

by

{\hat{p}}_{2, n_{2}} = \frac{\sum_{j = 1}^{n_{2}} Y_{j} + 1}{n_{2} + 2} .

(2)

To compare the magnitudes of

p_{1}

and

p_{2}

, we construct a confidence interval for the ratio

p_{1} / p_{2}

(or a monotonic function of

p_{1} / p_{2}

) with some prescribed accuracy. As

p_{1} / p_{2}

is always positive, we apply the log transformation on it and the resulting quantity

log (p_{1} / p_{2})

takes values on

(- \infty, \infty)

. According to the the central limit theorem and the delta method, we find that for

i = 1, 2

,

\sqrt{n_{i}} (log {\hat{p}}_{i, n_{i}} - log p_{i}) \overset{d}{⟶} N (0, σ_{i}^{2}),

(3)

as

n_{i} \to \infty

, where

\overset{d}{⟶}

represents convergence in distribution and

σ_{i}^{2} = (1 - p_{i}) / p_{i}

. For sufficiently large

n_{1}

and

n_{2}

, we have the following approximate normality of the difference in

log {\hat{p}}_{1, n_{1}}

and

log {\hat{p}}_{2, n_{2}}

:

log {\hat{p}}_{1, n_{1}} - log {\hat{p}}_{2, n_{2}} - (log p_{1} - log p_{2}) \dot{\sim} N (0, \frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}),

(4)

where

W \dot{\sim} F

represents that the random variable W is approximately distributed as F. This can be used to construct a large-sample approximate confidence interval for

log p_{1} - log p_{2}

to compare

p_{1}

and

p_{2}

. For the sake of estimation precision, we pre-specify both the confidence level

1 - α \in (0, 1)

and the interval width

2 d > 0

. Such an interval is then referred to as a fixed-width confidence interval (FWCI). That is, with prefixed

α

and d, we consider the confidence interval of the form

I_{n_{1}, n_{2}} = [log {\hat{p}}_{1, n_{1}} - log {\hat{p}}_{2, n_{2}} \pm d],

(5)

which satisfies

Pr (log p_{1} - log p_{2} \in I_{n_{1}, n_{2}}) \approx 1 - α .

(6)

Here,

log {\hat{p}}_{1, n_{1}} - log {\hat{p}}_{2, n_{2}}

serves as a point estimator for

log p_{1} - log p_{2}

, and d can be interpreted as the half-width of the interval (or, the margin of error).

It is clear that the FWCI

I_{n_{1}, n_{2}}

for

log p_{1} - log p_{2}

given by (5) is equivalent to the following interval for the ratio of two proportions

p_{1} / p_{2}

:

I_{n_{1}, n_{2}}^{'} = [e^{- d} {\hat{p}}_{1, n_{1}} / {\hat{p}}_{2, n_{2}}, e^{d} {\hat{p}}_{1, n_{1}} / {\hat{p}}_{2, n_{2}}],

which is then called a fixed-accuracy confidence interval (FACI) with

e^{d} (> 1)

being the accuracy parameter.

Next, we set out to determine the minimum sample sizes needed to meet the fixed width and coverage probability requirements. Define

Δ = d^{2} / z^{2},

(7)

where

z \equiv z_{α / 2}

is the upper

100 (α / 2) %

point of a standard normal distribution. From (6), the required sample size in total,

n_{1} + n_{2}

, must satisfy that

\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}} \leq Δ .

(8)

To minimize the total sample size, applying the Cauchy–Schwarz inequality yields

n_{1} + n_{2} \geq {(σ_{1} + σ_{2})}^{2} / Δ,

(9)

with equality when

n_{1} / n_{2} = σ_{1} / σ_{2}

. In this sense, we can specify the optimal sample sizes of

n_{1}^{*}

,

n_{2}^{*}

, and

n^{*}

as follows:

n_{1}^{*} = σ_{1} (σ_{1} + σ_{2}) / Δ, n_{2}^{*} = σ_{2} (σ_{1} + σ_{2}) / Δ, and n^{*} = n_{1}^{*} + n_{2}^{*} .

(10)

We tacitly disregard the fact that

n_{1}^{*}

,

n_{2}^{*}

, or

n^{*}

may not be an integer.

Since

p_{1}

and

p_{2}

are two unknown parameters, it is essential to estimate

σ_{1}^{2}

and

σ_{2}^{2}

by updating their estimators at every stage as necessary. Beginning with pilot samples

X_{1}, \dots, X_{m_{1}} (m_{1} \geq 10)

from X and

Y_{1}, \dots, Y_{m_{2}} (m_{2} \geq 10)

from Y, we propose the following sequential estimation procedure with the associated stopping rule given by

N = N_{1} + N_{2} = inf {n_{1} + n_{2} \geq m_{1} + m_{2} : n_{1}^{- 1} {\hat{σ}}_{1, n_{1}}^{2} + n_{2}^{- 1} {\hat{σ}}_{2, n_{2}}^{2} \leq Δ},

(11)

where

n_{1}

and

n_{2}

indicate the numbers of observations that are taken from X and Y, respectively, and for

i = 1, 2

,

{\hat{σ}}_{i, n_{i}}^{2} = (1 - {\hat{p}}_{i, n_{i}}) / {\hat{p}}_{i, n_{i}}

with

{\hat{p}}_{i, n_{i}}

defined in (1)–(2). By utilizing the “plus four” adjustment,

Pr (0 < {\hat{p}}_{i, n_{i}} < 1) = 1

so that

{\hat{σ}}_{i, n_{i}}^{2}

is well defined with probability one (w.p.1). Suppose that at some point, we have gathered

n_{1}

and

n_{2}

observations from X and Y, respectively, but the stopping rule (11) is not satisfied, which implies that we should continue sampling. The question is from which population we are going to take the next observation. According to the equality condition for (9), we propose the following allocation scheme:

If n_{1} / n_{2} > (\leq) {\hat{σ}}_{1, n_{1}} / {\hat{σ}}_{2, n_{2}}, collect one additional observation from Y (X) .

(12)

This sequential estimation procedure (11)–(12) can be summarized in Algorithm 1. It is implemented as follows. With the pilot samples, if

m_{1}^{- 1} {\hat{σ}}_{1, m_{1}}^{2} + m_{2}^{- 1} {\hat{σ}}_{2, m_{2}}^{2} \leq Δ

has already been satisfied, we do not take any additional observations, and the final sample size is

N = m_{1} + m_{2}

. Otherwise, we compare

m_{1} / m_{2}

with

{\hat{σ}}_{1, m_{1}} / {\hat{σ}}_{2, m_{2}}

, and pick the next observation as per (12). After obtaining the updated

{\hat{σ}}_{1, n_{1}}^{2}

or

{\hat{σ}}_{2, n_{2}}^{2}

, we check with the boundary crossing condition (11). This process is repeated until

n_{1}^{- 1} {\hat{σ}}_{1, n_{1}}^{2} + n_{2}^{- 1} {\hat{σ}}_{2, n_{2}}^{2} \leq Δ

happens for the first time. By referring to [23], we can claim that

Pr (N_{1} < \infty, N_{2} < \infty | p_{1}, p_{2}) = 1

, which shows that the procedure will stop w.p.1. Finally, with the fully accrued data

{X_{1}, \dots, X_{m_{1}}, \dots, X_{N_{1}}; Y_{1}, \dots, Y_{m_{2}}, \dots, Y_{N_{2}}}

, we construct the FWCI

I_{N_{1}, N_{2}} = [log {\hat{p}}_{1, N_{1}} - log {\hat{p}}_{2, N_{2}} \pm d]

(13)

for

log p_{1} - log p_{2}

. If the interval

I_{N_{1}, N_{2}}

contains zero, we conclude that there is no significant difference in

p_{1}

and

p_{2}

at a pre-specified level of

α

; and if

I_{N_{1}, N_{2}}

contains only positive (negative) values, we conclude that

p_{1} > (<) p_{2}

at level

α / 2

.

Algorithm 1: Sequential sampling strategy (11) and allocation scheme (12)
1	Take pilot samples $X_{1}, . . ., X_{m_{1}}$ and $Y_{1}, . . ., Y_{m_{2}}$ , where $m_{1}, m_{2} \geq 10$ ;
2	Assign $n_{1}$ as $m_{1} \to n_{1}$ and $n_{2}$ as $m_{2} \to n_{2}$ ;
3	while $n_{1}^{- 1} {\hat{σ}}_{1, n_{1}}^{2} + n_{2}^{- 1} {\hat{σ}}_{2, n_{2}}^{2} > Δ$ do
4		if $n_{1} / n_{2} > {\hat{σ}}_{1, n_{1}} / {\hat{σ}}_{2, n_{2}}$ then
5			Collect one additional observation from Y;
6			Update $n_{2}$ as $n_{2} + 1 \to n_{2}$ ;
7		else
8			Collect one additional observation from X;
9			Update $n_{1}$ as $n_{1} + 1 \to n_{1}$ ;
10		end
11	end
12	return $N_{1} = n_{1}, N_{2} = n_{2}$ , and $N = N_{1} + N_{2}$ .

The sequential estimation procedure (11)–(12) enjoys the following efficiency properties as summarized in Theorem 1.

Theorem 1.

Under the sequential sampling strategy (11) and the allocation scheme (12), with

p_{1}, p_{2}, d,

and α fixed, as

d \to 0

, we have:

\begin{matrix} (i) & E [N_{1} / n_{1}^{*}] \to 1, E [N_{1} / n_{1}^{*}] \to 1, and E [N / n^{*}] \to 1, \end{matrix}

(14)

\begin{matrix} (i i) & Pr (log p_{1} - log p_{2} \in I_{N_{1}, N_{2}}) \to 1 - α, \end{matrix}

(15)

where

n_{1}^{*}

,

n_{2}^{*}

, and

n^{*}

come from (10), and

I_{N_{1}, N_{2}}

comes from (13).

Proof.

One can easily find that the associated stopping rule (11) and allocation scheme (12) are similar with the rule

R_{1}

of [24]. Their techniques can be applied here to justify both (26) and (27), so we omit the proof for brevity. One may refer to [25], Chapter 13 of [26], and other sources for many details. □

2.1. Simulated Studies

To investigate the performance of our proposed sequential estimation procedure (11)–(12), we have conducted an extensive set of Monte Carlo simulations. For illustrative purposes, we first present the results under the following settings: X and Y are Bernoulli populations with success probabilities

p_{1} = 0.3

and

p_{2} = 0.2

, respectively; the level

α

is fixed to be 0.05 so that the confidence level is

1 - α = 0.95

; the pilot sample sizes are both set to 20; and a wide range of d (half width) from

0.6

to

0.1

in decrements of

0.1

is taken into account. For each configuration, we have run the simulation for 10,000 times, and summarize the findings in Table 1. We have recorded the three optimal sample sizes

(n_{1}^{*}, n_{2}^{*}, n^{*})

, the three average final sample sizes

({\bar{n}}_{1}, {\bar{n}}_{2}, \bar{n})

along with the standard deviations, and the three ratios

({\bar{n}}_{1} / n_{1}^{*}, {\bar{n}}_{2} / n_{2}^{*}, \bar{n} / n^{*})

accordingly. In the last but one column,

\bar{c p}

is the proportion of confidence intervals that successfully capture the parameter under estimation, which is to be compared with the confidence level. And in the last column, Power is referred to as the proportion of confidence intervals that successfully identify

p_{1} > p_{2}

, that is, the proportion of confidence intervals containing positive values alone. Note that this “power” only provides a conservative estimate, because we are using a two-sided confidence interval to help make a one-sided conclusion.

From Table 1, we find that the three ratios

{\bar{n}}_{1} / n_{1}^{*}

,

{\bar{n}}_{2} / n_{2}^{*}

, and

\bar{n} / n^{*}

are all slightly below 1. However, as d decreases, the three ratios get closer and closer to 1, which empirically verifies (26) in Theorem 1. The coverage probability averages

\bar{c p}

are all around

1 - α = 0.95

, verifying (27). We also observe that Power increases rapidly as d decreases. When

d = 0.2

, one is able to conclude

p_{1} > p_{2}

at least 97.75% of the time; and when

d = 0.1

, this rate increases to 100%. This indicates that our proposed sequential estimation procedure (11)–(12) can help identify which proportion is larger when there does exist a difference in the two proportions for small d values.

In Table 1, we have considered the scenario where

p_{2} < p_{2} < 1 / 2

. Since the optimal sample size

n_{1}^{*}

or

n_{2}^{*}

is not symmetric about 1/2, we have also carried out a set of simulations when

p_{1} > p_{2} > 1 / 2

. In particular,

p_{1} = 0.8

and

p_{2} = 0.7

, and a wide range of d from 0.20 to 0.05 with increment

- 0.05

have been considered. The findings are displayed in Table 2. There is little to no difference in the performance compared to that summarized in Table 1.

Finally, we investigate the performance of the proposed sequential estimation procedure (11)–(12) when the two proportions are identical. In particular, we have conducted simulations under

p_{1} = p_{2} = 0.2

,

α = 0.05

,

m_{1} = m_{2} = 20

with d varying from 0.7 to 0.1. This time, note that Power has the same definition as the coverage probability, so we combine the last two columns

\bar{c p}

and Power in Table 1 and Table 2, and rename it “

\bar{c p}

/Power”. The findings are summarized in Table 3, which again validate Theorem 1. We leave out many details for brevity.

3. Sequential Confidence Intervals for the Odds Ratio of Two Proportions

In Section 2, the log transformation resulted in an FWCI for

log (p_{1} / p_{2})

, the log of the ratio of two proportions. In this section, we consider the logit transformation, which helps to construct an FWCI for

log \frac{p_{1} / (1 - p_{1})}{p_{2} / (1 - p_{2})} = log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

, the log of the odds ratio of two proportions.

As before, we continue to use

{\hat{p}}_{1, n_{1}}

and

{\hat{p}}_{2, n_{2}}

, defined in (1)–(2), as the point estimators of

p_{1}

and

p_{2}

, respectively. By the central limit theorem and the delta method, for

i = 1, 2

,

\sqrt{n_{i}} (log \frac{{\hat{p}}_{i, n_{i}}}{1 - {\hat{p}}_{i, n_{i}}} - log \frac{p_{i}}{1 - p_{i}}) \overset{d}{⟶} N (0, δ_{i}^{2}),

(16)

as

n_{i} \to \infty

, where

δ_{i}^{2} = {[p_{i} (1 - p_{i})]}^{- 1}

. For large enough

n_{1}

and

n_{2}

, we then have

\begin{matrix} log \frac{{\hat{p}}_{1, n_{1}}}{1 - {\hat{p}}_{1, n_{1}}} - log \frac{{\hat{p}}_{2, n_{2}}}{1 - {\hat{p}}_{2, n_{2}}} - (log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}) \dot{\sim} N (0, \frac{δ_{1}^{2}}{n_{1}} + \frac{δ_{2}^{2}}{n_{2}}), \end{matrix}

(17)

which can be used to construct a large-sample approximate confidence interval for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

to compare

p_{1}

and

p_{2}

. With the prefixed half-width

d > 0

, we consider the FWCI given by

J_{n_{1}, n_{2}} = [log \frac{{\hat{p}}_{1, n_{1}}}{1 - {\hat{p}}_{1, n_{1}}} - log \frac{{\hat{p}}_{2, n_{2}}}{1 - {\hat{p}}_{2, n_{2}}} \pm d],

(18)

where

log \frac{{\hat{p}}_{1, n_{1}}}{1 - {\hat{p}}_{1, n_{1}}} - log \frac{{\hat{p}}_{2, n_{2}}}{1 - {\hat{p}}_{2, n_{2}}}

serves as a point estimator for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

. It should be further satisfied that

Pr (log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}} \in J_{n_{1}, n_{2}}) \approx 1 - α,

(19)

where the confidence level

1 - α \in (0, 1)

is also prefixed.

Clearly, the FWCI

J_{n_{1}, n_{2}}

for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

given by (18) is equivalent to the following FACI for the odds ratio

\frac{p_{1} / (1 - p_{1})}{p_{2} / (1 - p_{2})}

:

J_{n_{1}, n_{2}}^{'} = [e^{- d} \frac{{\hat{p}}_{1, n_{1}} / (1 - {\hat{p}}_{1, n_{1}})}{{\hat{p}}_{2, n_{2}} / (1 - {\hat{p}}_{2, n_{2}})}, e^{d} \frac{{\hat{p}}_{1, n_{1}} / (1 - {\hat{p}}_{1, n_{1}})}{{\hat{p}}_{2, n_{2}} / (1 - {\hat{p}}_{2, n_{2}})}] .

Define

δ_{1}^{2} = {[p_{1} (1 - p_{1})]}^{- 1}

and

δ_{2}^{2} = {[p_{2} (1 - p_{2})]}^{- 1}

. From (19), the required sample size in total,

n_{1} + n_{2}

, must satisfy that

\frac{δ_{1}^{2}}{n_{1}} + \frac{δ_{2}^{2}}{n_{2}} \leq Δ,

(20)

where

Δ

is define in (7). Apply the Cauchy–Schwarz inequality, and we obtain

n_{1} + n_{2} \geq {(δ_{1} + δ_{2})}^{2} / Δ,

(21)

with equality when

n_{1} / n_{2} = δ_{1} / δ_{2}

. In the same fashion of (10), the optimal sample sizes are

n_{1}^{*} = δ_{1} (δ_{1} + δ_{2}) / Δ, n_{2}^{*} = δ_{2} (δ_{1} + δ_{2}) / Δ, and n^{*} = n_{1}^{*} + n_{2}^{*} .

(22)

Again, it is essential to estimate the unknown

δ_{1}^{2}

and

δ_{2}^{2}

by updating their estimators at every stage as necessary. In the spirits of (11)–(12), we propose the following sequential estimation procedure with the associated stopping rule and allocation scheme given by

T = N_{1} + N_{2} = inf {n_{1} + n_{2} \geq m_{1} + m_{2} : n_{1}^{- 1} {\hat{δ}}_{1, n_{1}}^{2} + n_{2}^{- 1} {\hat{δ}}_{2, n_{2}}^{2} \leq Δ}

(23)

with

m_{1}, m_{2} \geq 10

, and

if n_{1} / n_{2} > (\leq) {\hat{δ}}_{1, n_{1}} / {\hat{δ}}_{2, n_{2}}, collect one additional observation from Y (X),

(24)

where

{\hat{δ}}_{i, n_{i}}^{2} = {[{\hat{p}}_{i, n_{i}} (1 - {\hat{p}}_{i, n_{i}})]}^{- 1}

,

i = 1, 2

. Since

Pr (0 < {\hat{p}}_{i, n_{i}} < 1) = 1

,

{\hat{δ}}_{i, n_{i}}^{2}

is well-defined w.p.1. The implementation of the sequential estimation procedure (23)–(24) is analogous with that of the procedure (11)–(12), and sampling will terminate w.p.1. After having collected the full data

{X_{1}, \dots, X_{m_{1}}, \dots, X_{N_{1}}; Y_{1}, \dots, Y_{m_{2}}, \dots, Y_{N_{2}}}

, we construct the FWCI

J_{N_{1}, N_{2}} = [log \frac{{\hat{p}}_{1, N_{1}}}{1 - {\hat{p}}_{1, N_{1}}} - log \frac{{\hat{p}}_{2, N_{2}}}{1 - {\hat{p}}_{2, N_{2}}} \pm d]

(25)

for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

. If the interval

J_{N_{1}, N_{2}}

contains zero, we conclude that there is no significant difference in

p_{1}

and

p_{2}

at level

α

; and if

J_{N_{1}, N_{2}}

contains only positive (negative) values, we conclude that

p_{1} > (<) p_{2}

at level

α / 2

.

In the spirit of Theorem 1, we state the efficiency properties enjoyed by the sequential estimation procedure (23)–(24) in the following theorem.

Theorem 2.

Under the sequential sampling strategy (23) and the allocation scheme (24), with

p_{1}, p_{2}, d,

and α fixed, as

d \to 0

, we have:

\begin{matrix} (i) & E [N_{1} / n_{1}^{*}] \to 1, E [N_{1} / n_{1}^{*}] \to 1, and E [N / n^{*}] \to 1, \end{matrix}

(26)

\begin{matrix} (i i) & Pr (log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}} \in J_{N_{1}, N_{2}}) \to 1 - α, \end{matrix}

(27)

where

n_{1}^{*}

,

n_{2}^{*}

, and

n^{*}

come from (22), and

J_{N_{1}, N_{2}}

comes from (25).

Proof.

The proof will be the same with that of Theorem 1, and is thus omitted for brevity. □

Simulated Studies

To investigate the performance of the sequential estimation procedure (23)–(24), we have conducted an extensive set of Monte Carlo simulations in the same fashion of Section 2.1. With the confidence level

1 - α = 0.95

and pilot sample sizes

m_{1} = m_{2} = 20

, we have considered the following two scenarios: (i) X and Y are Bernoulli populations with success probabilities

p_{1} = 0.3

and

p_{2} = 0.2

, respectively; and (ii) X and Y are Bernoulli populations with identical success probability

p_{1} = p_{2} = 0.2

. We exclude the case in which X and Y are Bernoulli populations with success probabilities

p_{1} > p_{2} > 1 / 2

since the optimal sample sizes

n_{1}^{*}

and

n_{2}^{*}

are both symmetric about 1/2. The corresponding simulated results are summarized in Table 4 and Table 5.

Comparing the simulated results implementing the sequential estimation procedures (11)–(12) under the log transformation and (23)–(24) under the logit transformation, we find little to no difference in terms of coverage probability and power performance. However, distinctions emerge regarding the sample size needed: for the same d value, the former procedure requires a smaller sample size, indicating greater efficiency; and conversely, the latter procedure presents a smaller standard deviation in sample size, suggesting lower variability and greater “robustness”. For clarity and brevity, Figure 1 illustrates a visual comparison of the terminal sample sizes obtained from implementing the sequential estimation procedures (11)–(12) and (23)–(24) under Bernoulli success probabilities

p_{1} = 0.3

and

p_{2} = 0.2

as a representative example.

4. Mobile Games A/B Testing

Now, we are in a position to revisit the mobile games A/B testing problem described in Section 1. To illustrate the application, we analyze a dataset collected from the Kaggle platform accessed on 3 March 2024 (https://www.kaggle.com/code/yufengsui/datacamp-project-mobile-games-a-b-testing/notebook), referred to as the Cookie Cats data. The dataset contains information on over 90,000 users of the mobile puzzle game Cookie Cats, developed by Tactile Entertainment. The following variables are included.

userid: A unique label that identifies each user.
version: The version of the game the user played, either with the first gate at level 30 (gate_30, version A) or at level 40 (gate_40, version B).
sum_gamerounds: The total number of game rounds played by the user during the first week after installation.
retention_1: A binary indicator of whether the user returned to play the game one day after installation (True) or not (False).
retention_7: A binary indicator of whether the user returned to play the game seven days after installation (True) or not (False).

Our primary focus is on the variable retention_7, which measures 7-day retention. This metric is used to determine which version of the game more successfully retains users.

The two Cookie Cats versions can be modeled as two independent Bernoulli populations, as each version was tested on a separate and non-overlapping group of players. Let

p_{1}

and

p_{2}

denote the 7-day retention rates of version A and version B, respectively. With fixed

α = 0.05

and

d = 0.1

, we implemented both sequential estimation procedures (11)–(12) and (23)–(24) to collect the data needed for constructing FWCIs to compare the magnitudes of

p_{1}

and

p_{2}

. The outcomes of these comparisons determine which version is better. To initiate the process, pilot samples of size 50 were taken for each version. The summary of the analyses is displayed in Table 6, where

N_{1}

and

N_{2}

represent the terminal numbers of users playing version A and version B, respectively, while

{\hat{p}}_{1}

and

{\hat{p}}_{2}

denote the sample 7-day retention rates of version A and version B, respectively. The FWCI refers to either the interval

I_{N_{1}, N_{2}}

for

log p_{1} - log p_{2}

as defined in (13), or the interval

J_{N_{1}, N_{2}}

for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

as defined in (25) accordingly.

The sequential estimation procedure (11)–(12) terminated with 384 observations from version A and 427 observations from version B. The resulting FWCI for

log p_{1} - log p_{2}

is

[0.059, 0.259]

, indicating that version A, with the first gate at level 30, has a significantly higher 7-day retention rate.

Not surprisingly, the sequential estimation procedure (23)–(24) required larger sample sizes, terminating with 581 observations from version A and 632 observations from version B. The resulting FWCI for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

is

[0.167, 0.367]

, indicating that version A, with the first gate at level 30, has a significantly higher 7-day retention rate, too. Both FWCIs consistently support the conclusion that assigning the first gate in Cookie Cats at level 30 is more appealing than assigning it at level 40 in terms of the 7-day retention rate.

5. Online Advertising Effectiveness

In this section, we revisit the second application for A/B testing in online advertising described in Section 1 earlier: a large company seeks to increase sales through advertisements and has substantial user base plans. To assess the effectiveness of advertisements in boosting sales, an A/B testing experiment was conducted using a dataset collected from the Kaggle platform accessed on 11 May 2024 (https://www.kaggle.com/datasets/farhadzeynalli/online-advertising-effectiveness-study-ab-testing/data), referred to as the Online Advertising data. The dataset contains information on users’ interactions with online advertisements, including the following variables.

customerID: A unique identifier for each individual customer.
made_purchase: A binary indicator of whether the user made a purchase after viewing an advertisement (TRUE) or not (FALSE).
test group: Specifies whether the user is in the “ads” (advertisements) group or the “psa” (public service announcements, no ads) group.
days_with_most_ads: The day of the month when the user viewed the most ads.
peak_ad_hours: The hour of the day when the user viewed the most ads.
ad_count: The total number of ads viewed by each user.

In this case, we hypothesize that implementing advertisements can lead to an increase in sales, which would be supported if the purchase rate in the ads group is significantly higher than that in the psa group.

The ads and psa groups can be modeled as independent Bernoulli populations because each group consisted of distinct and non-overlapping sets of users. Let

p_{1}

and

p_{2}

denote the purchase rates of the ads group and the psa group, respectively. With fixed

α = 0.05

and

d = 0.3

, we implemented both sequential estimation procedures (11)–(12) and (23)–(24) to collect the data needed for constructing FWCIs to compare the magnitudes of

p_{1}

and

p_{2}

, which are further used to evaluate the effectiveness of advertisements in boosting sales. To initiate the process, pilot samples of size 500 were taken for each group. The summary of the analyses is displayed in Table 7, where

N_{1}

and

N_{2}

represent the terminal sample sizes for the ads and psa groups, respectively, while

{\hat{p}}_{1}

and

{\hat{p}}_{2}

denote the sample purchase rates of the ads and psa groups, respectively. The FWCI refers to either the interval

I_{N_{1}, N_{2}}

for

log p_{1} - log p_{2}

as defined in (13), or the interval

J_{N_{1}, N_{2}}

for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

as defined in (25) accordingly.

The sequential estimation procedure (11)–(12) terminated with 1424 observations from the ads group and 2121 observations from the psa group. The resulting FWCI for

log p_{1} - log p_{2}

is

[0.445, 1.045]

, indicating that the ads group has a significantly higher purchase rate than the psa group.

Similarly, the sequential estimation procedure (23)–(24) required larger sample sizes, terminating with 1595 observations from the ads group and 2253 observations from the psa group. The resulting FWCI for

log \frac{p_{1}}{1 - p_{1}} - log \frac{p_{2}}{1 - p_{2}}

is

[0.452, 1.052]

, confirming that the ads group has a significantly higher purchase rate compared to the psa group. Both FWCIs consistently demonstrate that implementing advertisements increases the purchase rate relative to public service announcements.

6. Conclusions

Traditional A/B tests typically rely on a fixed sample size, with larger sample sizes generally preferred to ensure statistical reliability. In contrast, sequential estimation procedures offer more flexibility by allowing data collection and analysis at multiple points throughout the process. In this paper, we present a comprehensive study on the application of sequential confidence intervals for comparing two independent Bernoulli proportions in A/B testing, focusing on two real-world scenarios: mobile game design optimization and online advertising effectiveness. We proposed two types of fixed-width confidence intervals (FWCIs), one based on log transformation and the other on logit transformation, to evaluate key performance metrics including mobile game retention rates and purchase rates, respectively. Both approaches efficiently determined significant differences between experimental groups while minimizing data requirements. The findings demonstrate the practical utility of these methods, successfully identifying optimal strategies for both applications: assigning the first gate of Cookie Cats at level 30 and implementing advertisements to boost sales.

As this study primarily focuses on comparing two independent proportions, the methodologies can be extended to broader contexts. For example, applications in online banking could enable financial managers to compare the proportions of users completing credit card applications, thereby refining services based on statistical evidence to improve user satisfaction and overall performance of their credit card offerings.

Future work could explore scenarios involving three or more independent Bernoulli populations, paving the way for research on ranking and selection problems, such as identifying the best-performing population. Sequential stopping rules and allocation schemes could be proposed to construct simultaneous FWCIs for all pairwise comparisons. In a parallel direction, bandit problems for studying the exploration-exploitation trade-off in sequential decision have gained a lot of attention in machine learning and reinforcement learning. We could apply FWCIs to two-armed or multi-armed Bernoulli bandits, opening up new possibilities in adaptive decision-making under uncertainty. Such extensions would broaden the scope of sequential A/B testing, making it a powerful tool for handling complex decision-making scenarios across various industries.

Author Contributions

Conceptualization, methodology, J.H.; software, J.H. and I.A.; formal analysis, J.H. and L.Z.; validation, L.Z.; data curation, L.Z. and I.A.; writing, J.H., L.Z. and I.A. All authors have read and agreed to the published version of the manuscript.

Funding

The first author’s research was supported in part by the 2024 Summer URC Faculty Research Fellowship from Oakland University, Rochester, MI, USA.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank the three reviewers for their insightful comments and constructive suggestions, which have greatly contributed to improving the quality and clarity of our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kohavi, R.; Longbotham, R.; Sommerfield, D.; Henne, R.M. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov. 2009, 18, 140–181. [Google Scholar] [CrossRef]
Brata, K.C.; Brata, A.H. User experience improvement of Japanese language mobile learning application through mental model and A/B testing. Int. J. Electr. Comput. Eng. 2020, 10, 2659. [Google Scholar] [CrossRef]
Wilson, E.B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 1927, 22, 209–212. [Google Scholar] [CrossRef]
Clopper, C.J.; Pearson, E.S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 1934, 26, 404–413. [Google Scholar] [CrossRef]
Agresti, A.; Coull, B.A. Approximate is better than “exact” for interval estimation of binomial proportions. Am. Stat. 1998, 52, 119–126. [Google Scholar]
Newcombe, R.G. Two-sided confidence intervals for the single proportion; comparison of seven methods. Stat. Med. 1998, 17, 857–872. [Google Scholar] [CrossRef]
Brown, L.D.; Cai, T.T.; DasGupta, A. Interval estimation for a binomial proportion. Stat. Sci. 2001, 16, 101–133. [Google Scholar] [CrossRef]
Brown, L.D.; Cai, T.T.; DasGupta, A. Confidence intervals for a binomial proportion and asymptotic expansions. Ann. Stat. 2002, 30, 160–201. [Google Scholar] [CrossRef]
Wang, W. Smallest confidence intervals for one binomial proportion. J. Stat. Plan. Inference 2006, 136, 4293–4306. [Google Scholar] [CrossRef]
Goodman, L.A. Simultaneous confidence intervals for contrast among multinomial populations. Ann. Math. Stat. 1964, 35, 716–725. [Google Scholar] [CrossRef]
Newcombe, R.G. Interval estimation for the difference between independent proportions: Comparison of eleven methods. Stat. Med. 1998, 17, 873–890. [Google Scholar] [CrossRef]
Agresti, A.; Caffo, B. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Am. Stat. 2000, 54, 280–288. [Google Scholar] [CrossRef]
Zhou, X.H.; Tsao, M.; Qin, G. New intervals for the difference between two independent binomial proportions. J. Stat. Plan. Inference 2004, 123, 97–115. [Google Scholar] [CrossRef]
Brown, L.D.; Li, X. Confidence intervals for two sample binomial distribution. J. Stat. Plan. Inference 2005, 130, 359–375. [Google Scholar] [CrossRef]
Cao, X.; Wang, W.; Xie, T. An optimal exact confidence interval for the difference of two independent binomial proportions. Stat. Methods Med. Res. 2024. ahead of print. [Google Scholar] [CrossRef] [PubMed]
Asparouhov, T.Z. Sequential Fixed Width Confidence Intervals. Doctoral Dissertation, California Institute of Technology, Pasadena, CA, USA, 2000. [Google Scholar]
Frey, J. Fixed-width sequential confidence intervals for a proportion. Am. Stat. 2010, 64, 242–249. [Google Scholar] [CrossRef]
Liu, W.; Zhou, S. Construction of fixed width confidence intervals for a Bernoulli success probability using sequential sampling: A simulation study. J. Stat. Comput. Simul. 2011, 81, 1483–1493. [Google Scholar] [CrossRef]
Yaacoub, T.; Goldsman, D.; Mei, Y.; Moustakides, G.V. Tandem-width sequential confidence intervals for a Bernoulli proportion. Seq. Anal. 2019, 38, 163–183. [Google Scholar] [CrossRef]
Shan, G. Accurate confidence intervals for proportion in studies with clustered binary outcome. Stat. Methods Med. Res. 2020, 29, 3006–3018. [Google Scholar] [CrossRef] [PubMed]
Erazo, I.; Goldsman, D. Efficient confidence intervals for the difference of two Bernoulli distributions’ success parameters. J. Simul. 2023, 17, 76–93. [Google Scholar] [CrossRef]
Erazo, I.; Goldsman, D.; Mei, Y. Cost-efficient fixed-width confidence intervals for the difference of two Bernoulli proportions. J. Simul. 2023, 18, 726–744. [Google Scholar] [CrossRef]
Chow, Y.S.; Robbins, H. On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 1965, 36, 457–462. [Google Scholar] [CrossRef]
Robbins, H.; Simons, G.; Starr, N. A sequential analogue of the Behrens-Fisher problem. Ann. Math. Stat. 1967, 38, 1384–1391. [Google Scholar] [CrossRef]
Srivastava, M.S. On a sequential analogue of the Behrens–Fisher problem. J. R. Stat. Soc. Ser. B 1970, 32, 144–148. [Google Scholar] [CrossRef]
Mukhopadhyay, N.; de Silva, B.M. Sequential Methods and Their Applications; CRC: Boca Ratton, FL, USA, 2009. [Google Scholar]

Figure 1. Comparison of terminal sample sizes from Table 1 and Table 4.

Table 1. Simulated results with

p_{1} = 0.3

,

p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

Table 1. Simulated results with

p_{1} = 0.3

,

p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

d	$n_{1}^{*}$	${\bar{n}}_{1}$	$s (n_{1})$	${\bar{n}}_{1} / n_{1}^{*}$	$n_{2}^{*}$	${\bar{n}}_{2}$	$s (n_{2})$	${\bar{n}}_{2} / n_{2}^{*}$	$n^{*}$	$\bar{n}$	$s (n)$	$\bar{n} / n^{*}$	$\bar{cp}$	Power
$0.6$	$57.50$	$55.63$	$12.52$	$0.9676$	$75.28$	$72.60$	$17.57$	$0.9643$	$132.78$	$128.23$	$27.00$	$0.9657$	$0.9515$	$0.2303$
$0.5$	$82.80$	$80.94$	$15.17$	$0.9775$	$108.41$	$105.81$	$21.24$	$0.9761$	$191.20$	$186.75$	$32.75$	$0.9767$	$0.9536$	$0.3302$
$0.4$	$129.37$	$127.31$	$18.99$	$0.9841$	$169.39$	$166.59$	$26.45$	$0.9835$	$298.76$	$293.91$	$40.89$	$0.9838$	$0.9514$	$0.4883$
$0.3$	$229.99$	$227.92$	$25.31$	$0.9910$	$301.13$	$298.53$	$35.57$	$0.9914$	$531.12$	$526.45$	$54.69$	$0.9912$	$0.9522$	$0.7460$
$0.2$	$517.48$	$514.83$	$38.15$	$0.9948$	$677.54$	$674.26$	$53.27$	$0.9952$	$1195.02$	$1189.09$	$82.33$	$0.9950$	$0.9502$	$0.9775$
$0.1$	$2069.93$	$2066.66$	$76.96$	$0.9984$	$2710.17$	$2706.13$	$105.80$	$0.9985$	$4780.09$	$4772.79$	$164.08$	$0.9985$	$0.9469$	$1.0000$

Table 2. Simulated results with

p_{1} = 0.8

,

p_{2} = 0.7

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

Table 2. Simulated results with

p_{1} = 0.8

,

p_{2} = 0.7

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

d	$n_{1}^{*}$	${\bar{n}}_{1}$	$s (n_{1})$	${\bar{n}}_{1} / n_{1}^{*}$	$n_{2}^{*}$	${\bar{n}}_{2}$	$s (n_{2})$	${\bar{n}}_{2} / n_{2}^{*}$	$n^{*}$	$\bar{n}$	$s (n)$	$\bar{n} / n^{*}$	$\bar{cp}$	Power
$0.20$	$55.44$	$55.59$	$14.00$	$1.0026$	$72.59$	$71.96$	$15.93$	$0.9913$	$128.04$	$127.55$	$26.89$	$0.9962$	$0.9476$	$0.2313$
$0.15$	$98.57$	$98.52$	$19.35$	$0.9995$	$129.06$	$128.53$	$21.43$	$0.9959$	$227.62$	$227.05$	$36.75$	$0.9975$	$0.9447$	$0.4016$
$0.10$	$221.78$	$221.46$	$28.26$	$0.9986$	$290.38$	$290.01$	$31.56$	$0.9987$	$512.15$	$511.47$	$53.63$	$0.9987$	$0.9473$	$0.7455$
$0.05$	$887.11$	$888.64$	$55.87$	$1.0017$	$1161.50$	$1162.35$	$61.93$	$1.0007$	$2048.61$	$2050.99$	$105.58$	$1.0012$	$0.9503$	$0.9989$

Table 3. Simulated results with

p_{1} = p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

Table 3. Simulated results with

p_{1} = p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (11)–(12) under 10,000 runs.

d	$n_{1}^{*}$	${\bar{n}}_{1}$	$s (n_{1})$	${\bar{n}}_{1} / n_{1}^{*}$	$n_{2}^{*}$	${\bar{n}}_{2}$	$s (n_{2})$	${\bar{n}}_{2} / n_{2}^{*}$	$n^{*}$	$\bar{n}$	$s (n)$	$\bar{n} / n^{*}$	$\bar{cp}$ /Power
$0.7$	$62.72$	$59.83$	$15.66$	$0.9540$	$62.72$	$59.91$	$15.53$	$0.9552$	$125.44$	$119.74$	$27.93$	$0.9546$	$0.9582$
$0.6$	$85.37$	$82.61$	$18.15$	$0.9677$	$85.37$	$82.49$	$18.14$	$0.9663$	$170.73$	$165.10$	$32.55$	$0.9670$	$0.9573$
$0.5$	$122.93$	$120.04$	$21.86$	$0.9765$	$122.93$	$120.00$	$21.82$	$0.9762$	$245.85$	$240.04$	$39.10$	$0.9764$	$0.9541$
$0.4$	$192.07$	$189.46$	$27.65$	$0.9864$	$192.07$	$188.96$	$27.29$	$0.9838$	$384.15$	$378.42$	$49.39$	$0.9851$	$0.9573$
$0.3$	$341.46$	$338.70$	$36.59$	$0.9919$	$341.46$	$338.45$	$36.60$	$0.9912$	$682.93$	$677.15$	$65.49$	$0.9915$	$0.9528$
$0.2$	$768.29$	$764.50$	$54.84$	$0.9951$	$768.29$	$765.13$	$55.06$	$0.9959$	$1536.58$	$1529.64$	$98.31$	$0.9955$	$0.9478$
$0.1$	$3073.17$	$3067.66$	$109.34$	$0.9982$	$3073.17$	$3068.90$	$110.25$	$0.9986$	$6164.33$	$6136.55$	$196.40$	$0.9984$	$0.9493$

Table 4. Simulated results with

p_{1} = 0.3

,

p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (23)–(24) under 10,000 runs.

Table 4. Simulated results with

p_{1} = 0.3

,

p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (23)–(24) under 10,000 runs.

d	$n_{1}^{*}$	${\bar{n}}_{1}$	$s (n_{1})$	${\bar{n}}_{1} / n_{1}^{*}$	$n_{2}^{*}$	${\bar{n}}_{2}$	$s (n_{2})$	${\bar{n}}_{2} / n_{2}^{*}$	$n^{*}$	$\bar{n}$	$s (n)$	$\bar{n} / n^{*}$	$\bar{cp}$	Power
$0.8$	$61.33$	$61.75$	$5.70$	$1.0068$	$70.26$	$70.54$	$9.68$	$1.0040$	$131.59$	$132.29$	$13.95$	$1.0053$	$0.9596$	$0.2280$
$0.7$	$80.10$	$80.43$	$6.51$	$1.0041$	$91.77$	$92.07$	$11.12$	$1.0033$	$171.87$	$172.50$	$16.08$	$1.0036$	$0.9622$	$0.3067$
$0.6$	$109.03$	$109.37$	$7.63$	$1.0031$	$124.91$	$125.10$	$12.93$	$1.0015$	$233.93$	$234.46$	$18.74$	$1.0023$	$0.9588$	$0.3944$
$0.5$	$157.00$	$157.31$	$9.19$	$1.0020$	$179.86$	$179.88$	$15.44$	$1.0001$	$336.86$	$337.19$	$22.44$	$1.0010$	$0.9553$	$0.5365$
$0.4$	$245.31$	$245.59$	$11.54$	$1.0011$	$281.04$	$281.52$	$19.57$	$1.0017$	$526.35$	$527.11$	$28.40$	$1.0014$	$0.9521$	$0.7572$
$0.3$	$436.11$	$436.41$	$15.59$	$1.0007$	$499.62$	$499.97$	$26.18$	$1.0007$	$935.73$	$936.38$	$38.28$	$1.0007$	$0.9534$	$0.9398$
$0.2$	$981.24$	$981.00$	$23.00$	$0.9998$	$1124.15$	$1123.92$	$38.96$	$0.9998$	$2105.39$	$2104.93$	$56.68$	$0.9998$	$0.9512$	$0.9993$
$0.1$	$3924.95$	$3924.65$	$46.71$	$0.9999$	$4496.60$	$4496.08$	$78.45$	$0.9999$	$8421.55$	$8420.72$	$114.69$	$0.9999$	$0.9492$	$1.0000$

Table 5. Simulated results with

p_{1} = p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (23)–(24) under 10,000 runs.

Table 5. Simulated results with

p_{1} = p_{2} = 0.2

,

α = 0.05

, and

m_{1} = m_{2} = 20

implementing the sequential estimation procedure (23)–(24) under 10,000 runs.

d	$n_{1}^{*}$	${\bar{n}}_{1}$	$s (n_{1})$	${\bar{n}}_{1} / n_{1}^{*}$	$n_{2}^{*}$	${\bar{n}}_{2}$	$s (n_{2})$	${\bar{n}}_{2} / n_{2}^{*}$	$n^{*}$	$\bar{n}$	$s (n)$	$\bar{n} / n^{*}$	$\bar{cp}$ /Power
$0.8$	$75.03$	$74.93$	$10.08$	$0.9987$	$75.03$	$74.90$	$10.08$	$0.9983$	$150.06$	$149.83$	$17.93$	$0.9985$	$0.9591$
$0.7$	$98.00$	$97.96$	$11.71$	$0.9996$	$98.00$	$97.80$	$11.38$	$0.9980$	$195.99$	$195.76$	$20.57$	$0.9988$	$0.9615$
$0.6$	$133.38$	$133.26$	$13.63$	$0.9991$	$133.38$	$133.26$	$13.52$	$0.9991$	$266.77$	$266.52$	$24.17$	$0.9991$	$0.9567$
$0.5$	$192.07$	$191.96$	$16.34$	$0.9994$	$192.07$	$192.02$	$16.59$	$0.9997$	$384.15$	$383.98$	$29.43$	$0.9996$	$0.9522$
$0.4$	$300.11$	$300.11$	$20.56$	$1.0000$	$300.11$	$299.99$	$20.75$	$0.9996$	$600.23$	$600.10$	$36.92$	$0.9998$	$0.9537$
$0.3$	$533.54$	$533.24$	$27.47$	$0.9995$	$533.54$	$533.49$	$27.63$	$0.9999$	$1067.07$	$1066.73$	$49.24$	$0.9997$	$0.9510$
$0.2$	$1200.46$	$1199.64$	$41.11$	$0.9993$	$1200.46$	$1199.51$	$41.07$	$0.9992$	$2400.91$	$2399.15$	$73.45$	$0.9993$	$0.9505$
$0.1$	$4801.82$	$4800.14$	$82.81$	$0.9996$	$4801.82$	$4800.63$	$82.44$	$0.9998$	$9603.65$	$9600.77$	$148.06$	$0.9997$	$0.9484$

Table 6. A/B testing for Cookie Cats implementing sequential estimation procedures (11)–(12) and (23)–(24).

Procedure	$N_{1}$	$N_{2}$	${\hat{p}}_{1}$	${\hat{p}}_{2}$	FWCI
Sequential estimation procedure (11)–(12)	384	427	$0.199$	$0.170$	$[0.059, 0.259]$
Sequential estimation procedure (23)–(24)	581	632	$0.206$	$0.166$	$[0.167, 0.367]$

Table 7. A/B testing for online advertising effectiveness implementing sequential estimation procedures (11)–(12) and (23)–(24).

Procedure	$N_{1}$	$N_{2}$	${\hat{p}}_{1}$	${\hat{p}}_{2}$	FWCI
Sequential estimation procedure (11)–(12)	1424	2121	$0.069$	$0.033$	$[0.445, 1.045]$
Sequential estimation procedure (23)–(24)	1595	2253	$0.069$	$0.034$	$[0.452, 1.052]$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, J.; Zheng, L.; Alanazi, I. Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing. Mathematics 2025, 13, 161. https://doi.org/10.3390/math13010161

AMA Style

Hu J, Zheng L, Alanazi I. Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing. Mathematics. 2025; 13(1):161. https://doi.org/10.3390/math13010161

Chicago/Turabian Style

Hu, Jun, Lijia Zheng, and Ibtihal Alanazi. 2025. "Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing" Mathematics 13, no. 1: 161. https://doi.org/10.3390/math13010161

APA Style

Hu, J., Zheng, L., & Alanazi, I. (2025). Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing. Mathematics, 13(1), 161. https://doi.org/10.3390/math13010161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sequential Confidence Intervals for Comparing Two Proportions with Applications in A/B Testing

Abstract

1. Introduction

2. Sequential Confidence Intervals for the Ratio of Two Proportions

2.1. Simulated Studies

3. Sequential Confidence Intervals for the Odds Ratio of Two Proportions

Simulated Studies

4. Mobile Games A/B Testing

5. Online Advertising Effectiveness

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI