1. Introduction
Count data models (see, e.g., [
1]) with latent exposure or reporting mechanisms are common in many empirical settings, including marketing analytics [
2], epidemiology [
3], official statistics [
4], and gender-violence research [
5,
6]. Among these, the Poisson-logistic (pogit) model [
7] has emerged as a flexible and interpretable framework for modeling observed counts subject to under-reporting or partial observability. By combining a binomial observation equation with a Poisson exposure model, the pogit specification allows researchers to disentangle reporting behavior from the underlying intensity process, while retaining a regression structure that facilitates inference and interpretation.
The pogit model has found applications in diverse areas. In studies of under-reporting, it provides a principled way to account for missing or censored events by explicitly modeling the probability that an event is observed (see, e.g., [
7]). In marketing and consumer analytics, it has been used for share-of-wallet (SoW) estimation, where observed purchases represent a noisy subset of latent purchase opportunities (see, e.g., [
2]). These applications have motivated a growing methodological literature, including extensions to negative-binomial exposures [
4], as well as parametric and nonparametric Bayesian formulations that incorporate prior information and/or variable selection (see, e.g., [
5,
8,
9]).
Despite these advances, a major practical limitation of existing pogit methodologies is their lack of scalability. While the latent count can be analytically marginalized and maximum-likelihood estimation can in principle be based on the observed data likelihood, the resulting objective function is highly nonlinear and tightly couples the reporting and intensity components through a multiplicative structure. This makes direct likelihood maximization numerically delicate in practice, particularly as sample size and covariate dimension grow, or when parameters are weakly identified. Bayesian approaches face related difficulties, as Markov chain Monte Carlo methods must explore high–dimensional posteriors with strong dependence across model components. As a consequence, pogit models remain difficult to deploy in large–scale applications, despite their conceptual suitability for precisely such settings.
In this article, we address this literature gap by introducing a new expectation-maximization (EM) algorithm [
10] for efficient estimation of the standard pogit model. Our approach builds on Polya-Gamma data augmentation [
11], which yields conditionally Gaussian complete-data likelihoods and enables closed-form updates in the M-step. Specifically, we adapt the Polya-Gamma-based EM framework developed by Scott et al. [
12] for logistic regression to the pogit setting, combining it with the approximate augmentation strategy for Poisson models introduced by D’Angelo et al. [
13]. This synthesis results in an EM algorithm with a simple structure and closed–form updates throughout. Crucially, the resulting procedure is naturally amenable to online and mini-batch variants, making it possible to fit pogit models to datasets of unprecedented size using streaming data, without sacrificing statistical efficiency.
The remainder of the article is organized as follows.
Section 2 introduces the pogit model and reviews its likelihood structure.
Section 3 presents the proposed Polya-Gamma-based EM algorithm and discusses its computational properties, including quasi-Newton acceleration and mini-batch extensions for large-scale data.
Section 4 describes additional computational enhancements.
Section 5 reports results from simulation studies assessing convergence, finite-sample performance, and computational efficiency, with comparisons to direct numerical maximum-likelihood estimation.
Section 6 applies the method to real datasets.
Section 7 concludes.
Contributions
This article makes the following contributions:
We introduce a new expectation–maximization algorithm for the standard pogit model based on Polya-Gamma data augmentation. By combining an exact augmentation for the binomial component with a controlled approximation for the Poisson component, the complete–data log–likelihood becomes quadratic in the regression parameters. This yields closed-form expressions for all E-step expectations and reduces the M-step to simple weighted least-squares updates, resulting in a fully analytic EM procedure with low per-iteration computational cost.
We show that the resulting EM algorithm admits scalable online and mini-batch variants. In particular, the method can be applied to datasets with millions of observations using mini-batch updates, making pogit models feasible in large-scale applications where existing methods break down.
We evaluate the statistical and computational performance of the proposed estimator using both simulated and real datasets. The results demonstrate fast convergence, stable behavior across sample sizes, and competitive estimation accuracy.
We provide a systematic comparison with direct numerical maximization of the observed-data likelihood using generic maximum-likelihood routines, showing that the proposed method delivers substantial runtime improvements while exhibiting stable finite-sample behavior, robust parameter recovery, and numerical stability.
4. Computational Enhancements
The EM algorithm introduced in
Section 3 has two key computational advantages: each iteration has low cost due to closed-form updates, and the algorithm admits a simple representation in terms of sufficient statistics. Nevertheless, two practical challenges remain in large-scale applications. First, when the sample size
N is very large, even inexpensive full-batch iterations can become costly. Second, like most fixed-point algorithms, EM may converge slowly when the likelihood surface is flat or parameters are weakly identified.
In this section, we address these challenges using two complementary strategies. To scale the algorithm to massive datasets, we develop a mini-batch EM scheme based on Robbins–Monro stochastic approximation. To accelerate convergence (even in samples of moderate size), we combine the EM updates with a quasi-Newton extrapolation technique known as SQUAREM. These enhancements target distinct computational bottlenecks and can be used independently or in combination.
4.1. Mini-Batch EM via Robbins–Monro
To reduce computational cost when
N is large, we adopt an online EM approach based on stochastic approximation [
19], following the general framework of Cappé and Moulines [
20]. The key observation is that the EM algorithm derived in
Section 3 depends on the data only through additive sufficient statistics, making it particularly well suited for mini-batching.
Recall that the M-step for component
depends on the statistics
Given a random mini-batch
, unbiased estimators are
where randomness arises solely from subsampling.
A naive approach would replace
directly with their mini-batch counterparts. However, although unbiased, these estimators exhibit persistent sampling noise that prevents convergence of the resulting EM iterations. To stabilize the procedure, Cappé and Moulines [
20] propose updating the sufficient statistics using diminishing step sizes, that is
where
,
, and the step size sequence
shrinks in such a way that
and
.
In practice, we use
with
,
, and
. The initial constant phase stabilizes early iterations, while the polynomial decay ensures asymptotic convergence.
To further reduce variance and improve finite-sample performance, we apply
Polyak averaging [
21,
22] to the parameter iterates after burn-in. Specifically, the reported estimator is
which achieves optimal asymptotic variance for stochastic approximation schemes [
20].
This mini-batch EM procedure preserves the structure of the exact Polya-Gamma EM updates while reducing per-iteration complexity to , independent of the total sample size. As a result, the method scales naturally to datasets with millions of observations.
Remark 8.
Mini–batch EM algorithms based on stochastic approximation may require a large number of iterations and can exhibit substantial variability in the raw parameter iterates. For this reason, we recommend using a sufficiently large iteration budget (at least iterations in our experiments) and assessing convergence in terms of the Polyak–averaged estimator rather than the instantaneous iterate . This practice stabilizes inference and aligns with standard recommendations in stochastic approximation.
Remark 9.
Unlike full-batch EM, mini-batch EM does not guarantee a monotone increase in the observed-data likelihood. While monotonicity can in principle be enforced via safeguarding steps, our simulation experiments indicate that the proposed algorithm exhibits stable behavior without such modifications. For this reason, and to preserve computational simplicity, we adopt the unsafeguarded version in our implementation.
4.2. Quasi-Newton Acceleration via SQUAREM
Even in full-batch settings, EM algorithms may converge slowly when the likelihood surface is flat or parameters are weakly identified. To accelerate convergence, we complement our method with a quasi-Newton extrapolation technique known as SQUAREM [
23].
Let
denote the EM update mapping induced by one full E- and M-step, so that the standard EM iteration is
. SQUAREM treats EM as a fixed-point iteration and constructs an accelerated update by extrapolating along the EM trajectory. Starting from
, define
Here,
r represents the first-order displacement induced by EM, while
v captures curvature by measuring deviations from linearity along the EM path.
The accelerated update is then given by
where the step size
is chosen to approximately minimize the norm of the residual. Following Varadhan and Roland [
23], we set
A safeguarding step ensures monotonicity of the observed log-likelihood, reverting to the standard EM update whenever the extrapolated iterate decreases the likelihood.
SQUAREM is particularly well-suited to our Polya-Gamma-based EM algorithm. First, each EM iteration is deterministic and inexpensive, making the cost of additional EM evaluations negligible relative to the gains in convergence speed. Second, the M-step consists of weighted least-squares updates, which vary smoothly with the parameters and favor quasi-Newton extrapolation. Third, the method is entirely generic and requires no modification of the underlying EM structure.
In our empirical experiments, SQUAREM substantially reduces the number of EM iterations required to reach convergence, often by an order of magnitude, while preserving numerical stability. For this reason, we recommend SQUAREM as the default acceleration strategy when fitting the pogit model in moderate to large samples.
5. Simulation Study
This section evaluates the proposed EM algorithm along three complementary dimensions. First, we study its finite-sample estimation behavior under moderate sample sizes, focusing on signal recovery and dispersion across Monte Carlo replications. Second, we assess numerical robustness with respect to the negative-binomial approximation parameter underlying the Polya-Gamma augmentation. Third, we examine computational scalability in large samples and quantify the gains achieved by standard acceleration techniques. Together, these experiments are designed to validate the statistical accuracy, numerical stability, and practical scalability of the proposed estimation framework.
Remark 10.
All simulations were conducted on a desktop computer equipped with an Intel® Core™ i7-9750H CPU (6 cores, 2.60GHz) and 16 GB of RAM, running Windows 11 (64-bit).
5.1. Finite-Sample Estimation Behavior
We begin by examining the finite-sample behavior of the EM estimator in moderate sample sizes. In all scenarios, each model component includes nine covariates (i.e., ). For each observational unit i, the covariate vectors are generated independently according to . The scaling factor is chosen so that, if all regression coefficients are set equal to one, the resulting linear predictors have variance of moderate magnitude. This normalization keeps the linear predictors within a numerically stable range for the logistic and exponential link functions throughout the simulations.
The true parameter vectors are fixed at
so that only the first two coefficients in each vector are nonzero. This design induces a sparse signal structure and allows us to assess signal recovery.
Throughout the simulation study, the exposure term is fixed at for all i. We consider two sample sizes: (Scenario 1) and (Scenario 2). For each scenario, independent datasets are generated from the data-generating process described above. In each replication, the EM algorithm is initialized randomly and iterated until convergence, with a maximum of 1000 iterations allowed. Convergence is declared when the maximum relative change across all parameter components between successive iterations falls below .
Figure 2 summarizes the resulting Monte Carlo distributions of the EM estimates. The nonzero coefficients in
and
are accurately recovered in both scenarios, while coefficients that are truly zero remain tightly concentrated around zero. Increasing the sample size from 500 to
leads to a visible reduction in dispersion across replications, indicating improved estimator concentration in larger samples.
5.2. Robustness to the Approximation Parameter
Having established stable finite-sample behavior, we next examine robustness with respect to the tuning parameter r controlling the accuracy of the negative-binomial approximation used to obtain a Polya-Gamma augmentation of the Poisson component. While larger values of r yield a closer approximation to the Poisson likelihood, excessively large choices may introduce unnecessary numerical overhead and potential numerical issues. Evaluating sensitivity to this parameter is, therefore, important for practical implementation.
Figure 3 reports Monte Carlo boxplots for selected coefficients (indices 1 and 2) of
and
across values
, with
for all
i. Across all panels, the empirical distributions remain stable as
r varies. While Monte Carlo variability is present, no systematic shifts or trends are observed as
r increases from relatively small to moderately large values.
These results indicate that the proposed EM algorithm is not unduly sensitive to the precise choice of the tuning parameter, provided that r is chosen sufficiently large to yield an accurate quadratic approximation of the Poisson component. In particular, moderate values such as , which we adopt as a default in subsequent experiments, already yield estimation behavior comparable to that observed for larger values.
Remark 11.
To complement the results of this particular experiment, we included three additional figures in Appendix C, showing the sensitivities of the estimates’ standard errors, the observed likelihood and the effective number of iterations to r. Together, these figures show that the effective number of iterations grows with r, but standard errors and the log-likelihood tend to stabilize at . These supplementary results strengthen our practical recommendation of . 5.3. Computational Scalability and Acceleration
Having established finite-sample accuracy and numerical robustness, we now turn to computational performance. These experiments focus on scalability in large samples and are designed to assess whether the proposed EM algorithm and its accelerated variants remain practical in regimes where existing pogit implementations become computationally prohibitive.
We consider three complementary comparisons. First, we benchmark the proposed deterministic EM algorithm against direct maximum-likelihood (ML) optimization of the observed-data likelihood. Second, we evaluate the impact of mini-batch estimation via Robbins–Monro updates on runtime scalability. Third, we assess the gains from quasi-Newton acceleration using SQUAREM. In all cases, methods target the same likelihood and are run under identical initialization/stopping rules. For the Robbins–Monro mini-batch EM variant, stepsize sequences were chosen according to standard diminishing-stepsize conditions; all tuning parameters are reported in
Supplementary File S1.
Figure 4 and
Figure 5 report total runtime (measured in seconds) as a function of sample size (measured in hundreds of thousands of observations).
Figure 4 compares the proposed EM algorithm with direct ML optimization (implemented in Julia using BFGS via the R function
stats::optim();
Figure 5 compares standard EM with its SQUAREM-accelerated version and
Figure 6 compares full-batch EM with a Robbins–Monro mini-batch variant.
Across all methods, runtime grows approximately linearly with sample size over the ranges considered. The proposed EM algorithm consistently outperforms direct ML optimization, with a consistent runtime advantage across all sample sizes considered. The Robbins–Monro variant further reduces runtime in very large samples by lowering per-iteration cost, while SQUAREM substantially decreases the number of iterations required for convergence. These gains are achieved without compromising numerical stability or likelihood monotonicity.
Overall, the simulation results demonstrate that the proposed Polya-Gamma-based EM algorithm combines stable finite-sample behavior, robustness to approximation choices, and scalability to large datasets. These properties address key practical limitations of existing pogit implementations and support the use of the method in modern large-scale applications.
Finally, to facilitate transparency and reproducibility of the computational evidence reported above, we provide the Julia version 1.12.3 and R version 4.4.2 code used to generate all simulation results in
Supplementary File S1.
6. Real-World Application
We apply the proposed pogit model, estimated via a Pólya-Gamma-based EM framework, to an openly available dataset derived from Amazon purchase histories [
24], with a specific focus on purchases of Apple products within the technology and electronics category. The purpose of this empirical application is to illustrate the use of the model and the practical implementation of the proposed estimation procedure in a realistic, large-scale empirical setting.
This application is not intended to demonstrate superior empirical fit or to provide a comprehensive benchmarking exercise. Rather, it is designed to illustrate the feasibility, interpretability, and numerical stability of the proposed EM framework when applied to a challenging publicly available dataset characterized by severe partial observability and heterogeneous consumer behavior.
The dataset combines detailed longitudinal Amazon purchase histories with respondent-level demographic and household information for users in the United States. Data were collected through a consent-centric crowdsourcing protocol, under which participants voluntarily provided exports of their personal Amazon purchase histories spanning 1 January 2018 to 19 March 2023, together with survey-based sociodemographic information. This design yields a rich observational dataset linking transactional behavior over time to individual and household characteristics, while ensuring informed consent and preserving user privacy.
A comprehensive descriptive and exploratory analysis of this dataset has been previously reported in Berke et al. [
25]. The dataset and its construction are described in detail in Berke et al. [
24] and have since been used in several independent studies examining distinct dimensions of consumer behavior (e.g., [
26,
27,
28]). Building on this established empirical foundation, the present study does not revisit descriptive statistics or exploratory analyses previously reported in the literature, and instead proceeds directly to the model-based analysis under partial observability.
In the present study, the dataset is used under a deliberately constructed information scenario that mirrors the decision environment faced by a focal firm. Although the underlying data contain transactions for multiple brands within the technology and electronics category, the estimation strategy intentionally restricts the information available for model fitting to Apple-specific transactional histories observed on Amazon, together with respondent-level sociodemographic and household characteristics. This restriction is methodological rather than data-driven and reflects the realistic constraint that firms typically lack access to competitors’ sales and to consumers’ total category expenditure. Within this setting of partial observability, the pogit model illustrates how latent wallet allocation and underlying demand intensity can be inferred using focal-firm transaction data alone.
The model specification captures two distinct latent behavioral mechanisms. An intensity component governs the customer’s overall demand for technology and electronics, defined on a normalized latent scale and corresponding to the Size-of-Wallet (SioW). An allocation component governs the fraction of this latent category demand allocated to the focal firm, corresponding to the share-of-wallet (SoW). Identification under partial observability is achieved through the joint hierarchical structure of the model and a deliberate separation of covariates across these latent components. Apple-specific transactional and behavioral features enter the allocation mechanism, whereas broader sociodemographic and household characteristics enter the intensity mechanism.
In this empirical application, all customers are observed over the same calendar window and are, therefore, assumed to face identical exposure. Accordingly, the offset term is fixed at a constant value, for all i, and cross-sectional heterogeneity in overall purchasing activity is captured entirely through the latent intensity .
To operationalize the model in a count-data setting, we transform Apple expenditure observed on Amazon during the evaluation window
(1 November 2021 to 19 March 2023) using a variance-to-mean scaling motivated by the moment structure of the Poisson distribution; a model-based and data-driven justification of this transformation is provided in
Appendix D. Specifically, total monetary spending is rescaled by a constant
c, estimated exclusively from the training sample, and subsequently rounded to obtain a count-like response variable compatible with the pogit specification. This transformation induces a normalized discrete scale shared by the observed response
and the latent exposure
, thereby allowing monetary purchase volumes to be modeled coherently within a Poisson–binomial structure. Importantly, neither
nor
should be interpreted as literal counts of physical transactions, orders, or items; both quantities represent discretized proxies for relative spending intensity defined on a common latent scale.
Predictors are computed exclusively from transactions in (1 January 2018 to 31 October 2021), whereas the response is computed exclusively from transactions in (1 November 2021 to 19 March 2023); the train/test partition is defined at the respondent level and applied consistently across both windows. Categorical predictors encoded as single-selection factors are included using an omitted reference level, which defines the reference group, whereas for multiple-selection categorical variables, all indicator categories are retained. All quantitative covariates are standardized prior to estimation. For quantitative transactional predictors, correlation-based screening is performed using Spearman rank correlations to accommodate potential nonlinearity and heavy-tailed distributions. All data-adaptive preprocessing steps—including winsorization thresholds, standardization parameters, and correlation-screening rules—are learned exclusively on the training sample () and subsequently applied unchanged to the held-out test sample (), thereby ensuring strict prevention of information leakage.
Although the full Amazon dataset permits observation of technology and electronics spending across multiple brands within the Amazon channel during the evaluation window , information on non-Apple spending is used exclusively ex post as a channel-restricted benchmark for validation and contextualization of the model-implied SoW.
To ensure transparency and reproducibility of the empirical application, we provide a complete technical report in
Supplementary File S2, including data pre-processing procedures, design matrices, coefficient estimates, and detailed catalogs of predictor and response variables. Reproducible implementations of the real-data application in
Julia and
R are provided in
Supplementary File S3.
Against this methodological background,
Table 1 and
Table 2 summarize the main empirical patterns recovered by the pogit model under the full specification and after covariate selection, respectively. The full specification incorporates the complete set of covariates considered in the empirical analysis and reveals that most of the model’s explanatory content is concentrated in a relatively small subset of predictors related to Apple-specific transactional behavior and household characteristics—most notably RFM variables, income categories, and platform usage intensity—while the remaining covariates contribute limited additional explanatory power, motivating a more parsimonious specification to facilitate interpretation.
In the full specification (
Table 1), variation in SoW is primarily driven by Apple-specific transactional behavior. Both Frequency and Monetary enter with positive and statistically significant coefficients (estimates
and
, respectively), indicating that customers who purchase Apple products more frequently and spend more on the focal brand allocate a larger share of their latent category demand to Apple, whereas Recency does not reach statistical significance once the full set of transactional and demographic controls is included. Heterogeneity in SioW is mainly associated with household income and platform usage intensity: the Income 100–149k category exhibits a positive and statistically significant association with latent purchase intensity (estimate
), while the highest income group does not reach conventional significance levels. Platform engagement plays a central role, with accounts shared by three individuals (Use = 3) displaying substantially higher latent category demand (estimate
). In addition, the life-event indicator Life: Became pregnant enters with a negative and statistically significant coefficient, suggesting a temporary reduction in latent purchasing intensity, whereas other demographic and life-event controls do not exhibit statistically significant effects.
In the parsimonious specification (
Table 2), SoW is sharply characterized by Apple-specific behavioral variables in the allocation component. Frequency and Monetary remain the dominant determinants, with positive and statistically significant coefficients (estimates
and
, respectively), confirming the central role of repeated purchasing and cumulative spending in shaping wallet allocation toward the focal firm. In contrast to the full model, Recency now enters with a negative and statistically significant coefficient (estimate
), indicating that more recent Apple purchases are associated with a higher share of wallet once irrelevant covariates are removed. Under the same parsimonious specification, heterogeneity in SioW is primarily driven by household income and platform usage intensity: the Income 50–74 k and Income 100–149 k categories are positively and significantly associated with latent category demand (estimates
and
, respectively), while platform engagement remains a dominant factor, with accounts shared by three individuals (Use = 3) exhibiting a strong and highly significant positive association with intensity (estimate
). Consistent with the full specification, the life-event indicator Life: Became pregnant retains a negative and statistically significant effect, whereas other retained income and usage categories do not reach conventional significance levels.
7. Conclusions
This paper develops a scalable expectation-maximization algorithm for the Poisson-logistic (pogit) model, a classical framework for count data subject to partial observability. By exploiting a Polya-Gamma data augmentation, the proposed approach yields a quadratic complete-data log-likelihood and closed-form updates in both the E- and M-steps. As a result, each EM iteration is computationally inexpensive, numerically stable, and well-suited to large datasets.
The primary contribution is methodological. Unlike existing frequentist and Bayesian approaches, which typically rely on generic numerical optimization or Markov chain Monte Carlo methods, the proposed EM formulation scales naturally with sample size and covariate dimension. The algorithm admits deterministic full-batch updates, mini-batch variants based on Robbins–Monro stochastic approximation, and quasi-Newton acceleration via SQUAREM, all targeting the same observed-data likelihood. Simulation experiments confirm stable finite-sample behavior, robustness to the negative-binomial approximation underlying the Polya-Gamma construction, and substantial computational gains relative to direct maximum-likelihood optimization.
The proposed framework also suggests several directions for future research. First, the quadratic structure of the M-step makes the algorithm particularly amenable to regularization in high-dimensional settings. Penalized extensions based on penalties or global–local shrinkage priors, such as the horseshoe, could be incorporated either through penalized M-steps or additional latent-variable augmentations. Second, the Polya-Gamma construction naturally links the present EM approach to variational Bayes methods. In a related direction, exploring variational Bayes approximations built on similar augmentation schemes may yield fast approximate Bayesian procedures that complement the EM algorithm developed here.
More broadly, the results highlight the role of Polya-Gamma augmentation as a unifying computational device for efficient inference in models combining nonlinear link functions and discrete outcomes. By making frequentist estimation of pogit models feasible at scale, the proposed EM algorithm expands the practical applicability of these models in modern empirical settings.
Finally, it is worth noting that the proposed EM estimator targets the maximum–likelihood solution of the pogit model, and therefore, inherits the usual asymptotic properties of likelihood–based inference under standard regularity conditions. While Bayesian Polya-Gamma formulations naturally provide finite-sample posterior uncertainty, the present approach relies on asymptotic inference based on the observed Fisher information, yielding a simple and scalable frequentist alternative suited to large datasets.