1. Introduction
A/B testing is a statistical strategy for comparing two or more variants to see which one outperforms the others. Participants in an A/B test are assigned to different groups at random, and are exposed to varying versions (A and B, say). The observed outcomes are then analyzed to determine whether any of the differences are statistically significant. A/B testing helps companies and researchers make choices about product features, marketing strategies, or user experiences based on data, leading to better results supported by real-world evidence. For example, ref. [
1] provided insights into how Microsoft utilized online controlled experiments (A/B tests) to make data-driven decisions in product development, and ref. [
2] used A/B testing alongside evaluation of users’ mental models to improve the user experience of a Japanese language mobile learning application. Additionally, many companies regularly share case studies and blog posts detailing how A/B testing has helped them make decisions and obtain better results.
When binary data are available, A/B testing serves as a powerful method to compare two proportions across diverse domains, including applications in mobile gaming and marketing strategies. In the context of mobile gaming, A/B testing can be employed to compare different game features or user interfaces, aiming to identify the version that leads to a higher proportion of player engagement, retention, or in-app purchases. For example, in our study, we will apply A/B testing to analyze the mobile game called Cookie Cats. In this game, players progress through various stages and encounter gates that require them to either wait a certain amount of time or make an in-app purchase to continue. These gates serve a dual purpose: they drive in-app sales and provide players with a break, increasing and extending their enjoyment of the game. The placement of these gates is crucial for maintaining player satisfaction and maximizing retention. We will examine an A/B test in Cookie Cats where we changed the first gate from level 30 to level 40. Our analysis will focus on how this modification affects player retention.
Similarly, A/B testing is highly applicable in marketing strategies, such as evaluating the effectiveness of online advertising on outcomes such as sales or user engagement. A large company with an established user base, for example, may aim to increase sales through targeted advertisements. To determine the effectiveness of these ads, we will investigate an A/B test where users are divided into two groups: a control group that does not receive the advertisements, and a test group that does. By comparing sales performance between the two groups over a specified period, we will evaluate whether exposure to advertisements leads to a significant boost in sales. This data-driven approach enables the company to make informed decisions about its advertising strategy, ensuring that resources are allocated to campaigns that produce measurable results.
In both the mobile gaming and online advertising examples, where A/B testing is utilized as a pivotal tool for refining user experiences and optimizing strategies, the subsequent step of calculating confidence intervals for the proportions under estimation becomes equally important. Once A/B testing identifies the version that yields a superior outcome, the application of confidence intervals provides a quantitative measure of the precision and reliability of those estimates.
Confidence intervals play a fundamental role in the interpretation of proportions in statistical analysis. Estimating the percentage of website users who click a link is an example of a proportion. Confidence intervals measure the uncertainty about observed differences between groups in two proportions, such as A/B testing. These intervals help analysts evaluate their estimates and decide on the statistical significance and practical relevance of observed effects, improving research and experimental interpretation.
It is safe to say that there is a vast number of articles on confidence intervals for one and two proportions. One of the most commonly used methods is the Wald confidence interval, which relies on the normal approximation to the binomial distribution. Despite its simplicity, the Wald interval is widely criticized for its poor performance, particularly with small sample sizes or sample proportions near 0 or 1, where it often exhibits undercoverage. Ref. [
3] introduced an alternative interval, called the Wilson interval, which improves upon the Wald interval by addressing its coverage issues. It has better performance, especially for small sample sizes and extreme proportions. Ref. [
4] proposed the Clopper–Pearson interval, which is an exact interval based on the inversion of the equal-tail binomial test. Although guaranteeing the nominal coverage probability, this interval tends to be overly conservative and can produce unnecessarily wide intervals. Ref. [
5] proposed the so-called “plus four” interval, which adjusts the Wald interval by adding two successes and two failures to the observed counts, improving its performance. Ref. [
6] compared seven methods for constructing two-sided confidence intervals for a single proportion, providing a detailed analysis. Ref. [
7] conducted a comprehensive evaluation of various intervals focusing on their coverage probability and their width, which was further supported and complemented by their subsequent work [
8]. Additionally, ref. [
9] obtained the smallest confidence intervals for a proportion in the sense of set inclusion. As for the difference between two independent proportions, classical confidence intervals can often be constructed by inverting the hypothesis tests accordingly. For example, the Wald interval can be derived from Goodman’s test [
10], and the score interval is based on inverting the test with the standard errors evaluated under the null hypothesis of equal proportions. Ref. [
11] compared eleven methods and introduced a hybrid interval by using information from the respective Wilson intervals for the two proportions. Ref. [
12] extended the “plus four” approach to the two-proportion case. Ref. [
13] applied an Edgeworth expansion to the Studentized difference of two binomial proportions, proposing two new intervals by correcting the skewness in the Edgeworth expansion. Ref. [
14] put forward a “recentered” confidence interval with strong overall performance. And most recently, ref. [
15] presented an optimal exact confidence interval, which showed uniformly superior performance in terms of infimum coverage probability and total interval width.
It is worth mentioning that the aforementioned confidence intervals are all constructed based on fixed-size samples. However, in many statistical inference problems, fixed-sample-size methods are unfeasible when predetermined accuracy requirements, such as fixed-width confidence intervals, must be met. Sequential or multistage sampling, which adjusts the sample size dynamically according to the collected data, becomes essential for achieving the desired accuracy. These methods are especially valuable in fields requiring prompt decisions, such as clinical trials and manufacturing quality control, enabling timely and accurate estimates while minimizing resource and time expenditure. In the context of sequential confidence intervals for one and two proportions, ref. [
16] proposed asymptotically optimal sequential and two-stage procedures to construct confidence intervals of fixed width and confidence level for a Bernoulli success probability
p. Ref. [
17] developed and compared four exact sequential methods for obtaining fixed-width confidence intervals for
p. Ref. [
18] considered the construction of fixed-width confidence intervals using sequential sampling and carried out a simulation study. Ref. [
19] introduced a novel approach with a tandem-width confidence interval for a Bernoulli proportion. Ref. [
20] proposed leveraging importance sampling to calculate confidence intervals that almost always guarantee the specified coverage. Ref. [
21] optimized sampling costs while achieving prescribed interval widths, accounting for varying observation costs between distributions. Furthermore, ref. [
22] analyzed the achieved coverage and explored the trade-offs between the number of observations and the number of stages needed to achieve the desired width of the confidence interval.
In this article, we explore sequential confidence intervals for comparing two proportions, with a focus on A/B testing applications. We emphasize the importance of confidence intervals in interpreting A/B testing results within mobile gaming and online advertising sectors. The article is organized as follows:
Section 2 introduces fixed-width confidence intervals for the logarithm of the ratio of two proportions and presents simulated studies to validate these methods;
Section 3 extends this discussion to fixed-width confidence intervals for the logarithm of the odds ratio, again supported by simulated studies; in
Section 4, we apply these statistical tools to A/B testing in mobile gaming through a case study of the game Cookie Cats, highlighting the practical impact of gate placement on player retention; in
Section 5, we present a second case study evaluating the effectiveness of online advertising aimed at increasing sales; finally,
Section 6 concludes the article by summarizing the findings and underscoring the significance of sequential confidence intervals in facilitating data-driven decision-making via A/B testing.
2. Sequential Confidence Intervals for the Ratio of Two Proportions
Suppose we are interested in some common characteristic, referred to as success, possessed by two independent dichotomous populations, say X and Y. The success probabilities are denoted by and , respectively, where . Our goal is to compare their magnitudes and determine whether one is significantly greater than the other.
Assume that we have collected random samples
and
from
X and
Y, respectively, where the sample sizes
and
are not necessarily the same. Then, the
’s are
independent and identically distributed (i.i.d.) Bernoulli (
) random variables, and the sample proportion
serves as an unbiased estimator of
. However, even though
, there is a positive probability that
equals 0 or 1 in a given sample. To avoid this issue, we adopt the “plus four” idea introduced in [
5,
12], leading to the following biased but consistent estimator:
This estimator can be treated as a Bayes estimator with a
prior, or as a weighted average of
(the sample proportion) and
(a naïve estimator of
). Notably,
is always strictly between 0 and 1. Similarly, we estimate
by
To compare the magnitudes of
and
, we construct a confidence interval for the ratio
(or a monotonic function of
) with some prescribed accuracy. As
is always positive, we apply the log transformation on it and the resulting quantity
takes values on
. According to the the central limit theorem and the delta method, we find that for
,
as
, where
represents convergence in distribution and
. For sufficiently large
and
, we have the following approximate normality of the difference in
and
:
where
represents that the random variable
W is approximately distributed as
F. This can be used to construct a large-sample approximate confidence interval for
to compare
and
. For the sake of estimation precision, we pre-specify both the confidence level
and the interval width
. Such an interval is then referred to as a
fixed-width confidence interval (FWCI). That is, with prefixed
and
d, we consider the confidence interval of the form
which satisfies
Here,
serves as a point estimator for
, and
d can be interpreted as the half-width of the interval (or, the margin of error).
It is clear that the FWCI
for
given by (
5) is equivalent to the following interval for the ratio of two proportions
:
which is then called a
fixed-accuracy confidence interval (FACI) with
being the accuracy parameter.
Next, we set out to determine the minimum sample sizes needed to meet the fixed width and coverage probability requirements. Define
where
is the upper
point of a standard normal distribution. From (
6), the required sample size in total,
, must satisfy that
To minimize the total sample size, applying the Cauchy–Schwarz inequality yields
with equality when
. In this sense, we can specify the
optimal sample sizes of
,
, and
as follows:
We tacitly disregard the fact that
,
, or
may not be an integer.
Since
and
are two unknown parameters, it is essential to estimate
and
by updating their estimators at every stage as necessary. Beginning with pilot samples
from
X and
from
Y, we propose the following sequential estimation procedure with the associated stopping rule given by
where
and
indicate the numbers of observations that are taken from
X and
Y, respectively, and for
,
with
defined in (
1)–(
2). By utilizing the “plus four” adjustment,
so that
is well defined with probability one (w.p.1). Suppose that at some point, we have gathered
and
observations from
X and
Y, respectively, but the stopping rule (
11) is not satisfied, which implies that we should continue sampling. The question is from which population we are going to take the next observation. According to the equality condition for (
9), we propose the following allocation scheme:
This sequential estimation procedure (
11)–(
12) can be summarized in Algorithm 1. It is implemented as follows. With the pilot samples, if
has already been satisfied, we do not take any additional observations, and the final sample size is
. Otherwise, we compare
with
, and pick the next observation as per (
12). After obtaining the updated
or
, we check with the boundary crossing condition (
11). This process is repeated until
happens for the first time. By referring to [
23], we can claim that
, which shows that the procedure will stop w.p.1. Finally, with the fully accrued data
, we construct the FWCI
for
. If the interval
contains zero, we conclude that there is no significant difference in
and
at a pre-specified level of
; and if
contains only positive (negative) values, we conclude that
at level
.
Algorithm 1: Sequential sampling strategy (11) and allocation scheme (12) |
1 | Take pilot samples and , where ;
|
2 | Assign as and as ;
|
3 | while do |
4 |
| | if then |
5 |
|
|
| | Collect one additional observation from Y;
|
6 |
|
|
| | Update as ;
|
7 |
| | else |
8 |
|
|
| | Collect one additional observation from X;
|
9 |
|
|
| | Update as ;
|
10 |
| | end |
11 | end |
12 | return , and . |
The sequential estimation procedure (
11)–(
12) enjoys the following efficiency properties as summarized in Theorem 1.
Theorem 1. Under the sequential sampling strategy (
11)
and the allocation scheme (
12)
, with and α fixed, as , we have:where , , and come from (
10)
, and comes from (
13).
Proof. One can easily find that the associated stopping rule (
11) and allocation scheme (
12) are similar with the rule
of [
24]. Their techniques can be applied here to justify both (26) and (27), so we omit the proof for brevity. One may refer to [
25], Chapter 13 of [
26], and other sources for many details. □
2.1. Simulated Studies
To investigate the performance of our proposed sequential estimation procedure (
11)–(
12), we have conducted an extensive set of Monte Carlo simulations. For illustrative purposes, we first present the results under the following settings:
X and
Y are Bernoulli populations with success probabilities
and
, respectively; the level
is fixed to be 0.05 so that the confidence level is
; the pilot sample sizes are both set to 20; and a wide range of
d (half width) from
to
in decrements of
is taken into account. For each configuration, we have run the simulation for 10,000 times, and summarize the findings in
Table 1. We have recorded the three optimal sample sizes
, the three average final sample sizes
along with the standard deviations, and the three ratios
accordingly. In the last but one column,
is the proportion of confidence intervals that successfully capture the parameter under estimation, which is to be compared with the confidence level. And in the last column, Power is referred to as the proportion of confidence intervals that successfully identify
, that is, the proportion of confidence intervals containing positive values alone. Note that this “power” only provides a conservative estimate, because we are using a two-sided confidence interval to help make a one-sided conclusion.
From
Table 1, we find that the three ratios
,
, and
are all slightly below 1. However, as
d decreases, the three ratios get closer and closer to 1, which empirically verifies (26) in Theorem 1. The coverage probability averages
are all around
, verifying (27). We also observe that Power increases rapidly as
d decreases. When
, one is able to conclude
at least 97.75% of the time; and when
, this rate increases to 100%. This indicates that our proposed sequential estimation procedure (
11)–(
12) can help identify which proportion is larger when there does exist a difference in the two proportions for small
d values.
In
Table 1, we have considered the scenario where
. Since the optimal sample size
or
is not symmetric about 1/2, we have also carried out a set of simulations when
. In particular,
and
, and a wide range of
d from 0.20 to 0.05 with increment
have been considered. The findings are displayed in
Table 2. There is little to no difference in the performance compared to that summarized in
Table 1.
Finally, we investigate the performance of the proposed sequential estimation procedure (
11)–(
12) when the two proportions are identical. In particular, we have conducted simulations under
,
,
with
d varying from 0.7 to 0.1. This time, note that Power has the same definition as the coverage probability, so we combine the last two columns
and Power in
Table 1 and
Table 2, and rename it “
/Power”. The findings are summarized in
Table 3, which again validate Theorem 1. We leave out many details for brevity.
3. Sequential Confidence Intervals for the Odds Ratio of Two Proportions
In
Section 2, the log transformation resulted in an FWCI for
, the log of the ratio of two proportions. In this section, we consider the logit transformation, which helps to construct an FWCI for
, the log of the odds ratio of two proportions.
As before, we continue to use
and
, defined in (
1)–(
2), as the point estimators of
and
, respectively. By the central limit theorem and the delta method, for
,
as
, where
. For large enough
and
, we then have
which can be used to construct a large-sample approximate confidence interval for
to compare
and
. With the prefixed half-width
, we consider the FWCI given by
where
serves as a point estimator for
. It should be further satisfied that
where the confidence level
is also prefixed.
Clearly, the FWCI
for
given by (
18) is equivalent to the following FACI for the odds ratio
:
Define
and
. From (
19), the required sample size in total,
, must satisfy that
where
is define in (
7). Apply the Cauchy–Schwarz inequality, and we obtain
with equality when
. In the same fashion of (
10), the optimal sample sizes are
Again, it is essential to estimate the unknown
and
by updating their estimators at every stage as necessary. In the spirits of (
11)–(
12), we propose the following sequential estimation procedure with the associated stopping rule and allocation scheme given by
with
, and
where
,
. Since
,
is well-defined w.p.1. The implementation of the sequential estimation procedure (
23)–(
24) is analogous with that of the procedure (
11)–(
12), and sampling will terminate w.p.1. After having collected the full data
, we construct the FWCI
for
. If the interval
contains zero, we conclude that there is no significant difference in
and
at level
; and if
contains only positive (negative) values, we conclude that
at level
.
In the spirit of Theorem 1, we state the efficiency properties enjoyed by the sequential estimation procedure (
23)–(
24) in the following theorem.
Theorem 2. Under the sequential sampling strategy (
23)
and the allocation scheme (
24)
, with and α fixed, as , we have:where , , and come from (
22)
, and comes from (
25).
Proof. The proof will be the same with that of Theorem 1, and is thus omitted for brevity. □
Simulated Studies
To investigate the performance of the sequential estimation procedure (
23)–(
24), we have conducted an extensive set of Monte Carlo simulations in the same fashion of
Section 2.1. With the confidence level
and pilot sample sizes
, we have considered the following two scenarios: (i)
X and
Y are Bernoulli populations with success probabilities
and
, respectively; and (ii)
X and
Y are Bernoulli populations with identical success probability
. We exclude the case in which
X and
Y are Bernoulli populations with success probabilities
since the optimal sample sizes
and
are both symmetric about 1/2. The corresponding simulated results are summarized in
Table 4 and
Table 5.
Comparing the simulated results implementing the sequential estimation procedures (
11)–(
12) under the log transformation and (
23)–(
24) under the logit transformation, we find little to no difference in terms of coverage probability and power performance. However, distinctions emerge regarding the sample size needed: for the same
d value, the former procedure requires a smaller sample size, indicating greater efficiency; and conversely, the latter procedure presents a smaller standard deviation in sample size, suggesting lower variability and greater “robustness”. For clarity and brevity,
Figure 1 illustrates a visual comparison of the terminal sample sizes obtained from implementing the sequential estimation procedures (
11)–(
12) and (
23)–(
24) under Bernoulli success probabilities
and
as a representative example.
4. Mobile Games A/B Testing
Now, we are in a position to revisit the mobile games A/B testing problem described in
Section 1. To illustrate the application, we analyze a dataset collected from the Kaggle platform accessed on 3 March 2024 (
https://www.kaggle.com/code/yufengsui/datacamp-project-mobile-games-a-b-testing/notebook), referred to as the Cookie Cats data. The dataset contains information on over 90,000 users of the mobile puzzle game Cookie Cats, developed by Tactile Entertainment. The following variables are included.
userid: A unique label that identifies each user.
version: The version of the game the user played, either with the first gate at level 30 (gate_30, version A) or at level 40 (gate_40, version B).
sum_gamerounds: The total number of game rounds played by the user during the first week after installation.
retention_1: A binary indicator of whether the user returned to play the game one day after installation (True) or not (False).
retention_7: A binary indicator of whether the user returned to play the game seven days after installation (True) or not (False).
Our primary focus is on the variable retention_7, which measures 7-day retention. This metric is used to determine which version of the game more successfully retains users.
The two Cookie Cats versions can be modeled as two independent Bernoulli populations, as each version was tested on a separate and non-overlapping group of players. Let
and
denote the 7-day retention rates of version A and version B, respectively. With fixed
and
, we implemented both sequential estimation procedures (
11)–(
12) and (
23)–(
24) to collect the data needed for constructing FWCIs to compare the magnitudes of
and
. The outcomes of these comparisons determine which version is better. To initiate the process, pilot samples of size 50 were taken for each version. The summary of the analyses is displayed in
Table 6, where
and
represent the terminal numbers of users playing version A and version B, respectively, while
and
denote the sample 7-day retention rates of version A and version B, respectively. The FWCI refers to either the interval
for
as defined in (
13), or the interval
for
as defined in (
25) accordingly.
The sequential estimation procedure (
11)–(
12) terminated with 384 observations from version A and 427 observations from version B. The resulting FWCI for
is
, indicating that version A, with the first gate at level 30, has a significantly higher 7-day retention rate.
Not surprisingly, the sequential estimation procedure (
23)–(
24) required larger sample sizes, terminating with 581 observations from version A and 632 observations from version B. The resulting FWCI for
is
, indicating that version A, with the first gate at level 30, has a significantly higher 7-day retention rate, too. Both FWCIs consistently support the conclusion that assigning the first gate in Cookie Cats at level 30 is more appealing than assigning it at level 40 in terms of the 7-day retention rate.
5. Online Advertising Effectiveness
In this section, we revisit the second application for A/B testing in online advertising described in
Section 1 earlier: a large company seeks to increase sales through advertisements and has substantial user base plans. To assess the effectiveness of advertisements in boosting sales, an A/B testing experiment was conducted using a dataset collected from the Kaggle platform accessed on 11 May 2024 (
https://www.kaggle.com/datasets/farhadzeynalli/online-advertising-effectiveness-study-ab-testing/data), referred to as the Online Advertising data. The dataset contains information on users’ interactions with online advertisements, including the following variables.
customerID: A unique identifier for each individual customer.
made_purchase: A binary indicator of whether the user made a purchase after viewing an advertisement (TRUE) or not (FALSE).
test group: Specifies whether the user is in the “ads” (advertisements) group or the “psa” (public service announcements, no ads) group.
days_with_most_ads: The day of the month when the user viewed the most ads.
peak_ad_hours: The hour of the day when the user viewed the most ads.
ad_count: The total number of ads viewed by each user.
In this case, we hypothesize that implementing advertisements can lead to an increase in sales, which would be supported if the purchase rate in the ads group is significantly higher than that in the psa group.
The ads and psa groups can be modeled as independent Bernoulli populations because each group consisted of distinct and non-overlapping sets of users. Let
and
denote the purchase rates of the ads group and the psa group, respectively. With fixed
and
, we implemented both sequential estimation procedures (
11)–(
12) and (
23)–(
24) to collect the data needed for constructing FWCIs to compare the magnitudes of
and
, which are further used to evaluate the effectiveness of advertisements in boosting sales. To initiate the process, pilot samples of size 500 were taken for each group. The summary of the analyses is displayed in
Table 7, where
and
represent the terminal sample sizes for the ads and psa groups, respectively, while
and
denote the sample purchase rates of the ads and psa groups, respectively. The FWCI refers to either the interval
for
as defined in (
13), or the interval
for
as defined in (
25) accordingly.
The sequential estimation procedure (
11)–(
12) terminated with 1424 observations from the ads group and 2121 observations from the psa group. The resulting FWCI for
is
, indicating that the ads group has a significantly higher purchase rate than the psa group.
Similarly, the sequential estimation procedure (
23)–(
24) required larger sample sizes, terminating with 1595 observations from the ads group and 2253 observations from the psa group. The resulting FWCI for
is
, confirming that the ads group has a significantly higher purchase rate compared to the psa group. Both FWCIs consistently demonstrate that implementing advertisements increases the purchase rate relative to public service announcements.
6. Conclusions
Traditional A/B tests typically rely on a fixed sample size, with larger sample sizes generally preferred to ensure statistical reliability. In contrast, sequential estimation procedures offer more flexibility by allowing data collection and analysis at multiple points throughout the process. In this paper, we present a comprehensive study on the application of sequential confidence intervals for comparing two independent Bernoulli proportions in A/B testing, focusing on two real-world scenarios: mobile game design optimization and online advertising effectiveness. We proposed two types of fixed-width confidence intervals (FWCIs), one based on log transformation and the other on logit transformation, to evaluate key performance metrics including mobile game retention rates and purchase rates, respectively. Both approaches efficiently determined significant differences between experimental groups while minimizing data requirements. The findings demonstrate the practical utility of these methods, successfully identifying optimal strategies for both applications: assigning the first gate of Cookie Cats at level 30 and implementing advertisements to boost sales.
As this study primarily focuses on comparing two independent proportions, the methodologies can be extended to broader contexts. For example, applications in online banking could enable financial managers to compare the proportions of users completing credit card applications, thereby refining services based on statistical evidence to improve user satisfaction and overall performance of their credit card offerings.
Future work could explore scenarios involving three or more independent Bernoulli populations, paving the way for research on ranking and selection problems, such as identifying the best-performing population. Sequential stopping rules and allocation schemes could be proposed to construct simultaneous FWCIs for all pairwise comparisons. In a parallel direction, bandit problems for studying the exploration-exploitation trade-off in sequential decision have gained a lot of attention in machine learning and reinforcement learning. We could apply FWCIs to two-armed or multi-armed Bernoulli bandits, opening up new possibilities in adaptive decision-making under uncertainty. Such extensions would broaden the scope of sequential A/B testing, making it a powerful tool for handling complex decision-making scenarios across various industries.