Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study

Li, Keqian; Ren, Xueni; Li, Hanfang; Luo, Youxi

doi:10.3390/axioms15030214

Open AccessArticle

Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study

School of Science, Hubei University of Technology, Wuhan 430068, China

^*

Authors to whom correspondence should be addressed.

Axioms 2026, 15(3), 214; https://doi.org/10.3390/axioms15030214

Submission received: 24 January 2026 / Revised: 7 March 2026 / Accepted: 10 March 2026 / Published: 12 March 2026

Download

Browse Figures

Versions Notes

Abstract

This paper focuses on variable selection and parameter estimation for mixed-effects Poisson count regression models. To simultaneously select important variables in both fixed effects and random effects, we propose a double-penalized Poisson count regression model with the Smoothly Clipped Absolute Deviation (SCAD) penalty imposed on both components. To estimate the unknown parameters, we develop a new iterative algorithm called the Double SCAD–Local Quadratic Approximation (DSCAD-LQA) algorithm. Under regularity conditions, the consistency and Oracle property of the proposed estimator are established. Simulation studies are conducted under two types of penalty parameter selection criteria: the Schwarz Information Criterion (SIC) and the Generalized Approximate Cross-Validation (GACV). We evaluate the performance of the proposed method under different levels of correlation among explanatory variables and different covariance structures of random effects. Comparisons are also carried out with the non-penalized model, the single-penalized model, and the double LASSO-penalized model. The results demonstrate that the proposed double SCAD penalty method performs better than the other three methods in terms of important variable selection and coefficient estimation, and is especially effective for sparse models.

Keywords:

double SCAD punishment; Poisson count regression model; iterative local quadratic approximation algorithm; oracle property

MSC:

62F15; 62J05

1. Introduction

With the transformation of modern production and manufacturing towards intelligence and refinement, the accurate modeling of various count-type quality indicators in the production process (such as the number of defects, failure times, and non-conforming parts) is a core demand to ensure production stability and reduce costs. These indicators are inherently non-negative integers, a defining feature of count data. Traditional linear models, such as ordinary least squares (OLS) regression, are not suitable for this type of data because they assume the response variable follows a continuous, normal distribution, which directly contradicts the discrete, non-negative nature of count outcomes.

Meanwhile, the manufacturing process is continuous, and the observation data of the same production line, equipment, or batch have longitudinal correlation in the time dimension. Traditional linear models assume independence among observations, and ignoring this inherent correlation can lead to biased parameter estimates and undermine the reliability of the model. Common modeling approaches for such longitudinal count data include marginal models [1] and mixed-effects models [2,3], among which the mixed-effects Poisson count model with random effects is the most widely used. The primary reason is that longitudinal count sample data are non-independent, with a certain degree of correlation among observations from the same unit. Mixed-effects models allow for such data correlation. They take into account the practical scenario that the production units selected for data collection are often randomly sampled, where each unit is associated with distinct raw materials and operators [4,5]. Since the research objective focuses on the overall parameter characteristics of the population rather than individual units, incorporating unit-specific individual effects as random effects into the model enhances modeling accuracy and flexibility [6].

On the other hand, practical modeling often confronts the challenge of high-dimensional data, where the number of variables is close to or even exceeds the sample size [7]. The curse of dimensionality makes variable selection indispensable, and various penalization methods enable concurrent parameter estimation and variable selection [8]. The LASSO (Least Absolute Shrinkage and Selection Operator) penalty proposed by Tibshirani [9] has been extremely widely adopted. It not only estimates the regression parameters in the model but also shrinks the coefficients of irrelevant variables to zero, thereby achieving the goal of variable selection. However, the LASSO penalty tends to introduce estimation bias for large coefficients. To address this limitation, Fan and Li [10] proposed a novel SCAD (Smoothly Clipped Absolute Deviation) penalty based on the LASSO method. This non-convex penalization approach inherits the advantages of the LASSO penalty while possessing favorable statistical properties: it yields sparse solutions, and the resulting estimates satisfy asymptotic normality.

Since its introduction, the SCAD penalty has been widely applied to various models, such as quantile regression [11] and logistic regression [12,13]. For Poisson regression models, Yan and Chen [14] adopted the SCAD penalty for variable selection, while Buu [15] further extended its application to zero-inflated Poisson regression models for the same purpose. Li [16] employed the SCAD penalty to perform variable selection for high-dimensional random-effects linear regression models and high-dimensional random-effects quantile regression models; Li and He [17] also implemented the SCAD penalty for variable selection in high-dimensional random-effects linear regression models. Liu [18] constructed a SCAD-regularized heterogeneous autoregressive model for volatility forecasting. Liu and Chen [19] proposed an analytical framework integrating self-supervised learning with SCAD-Net regularization. Liu, Wang, and Wu [20] conducted research on SCAD-type variable selection and estimation in censored regression models. Zhang and Dong [21] carried out a comparative study on the performance of six variable selection methods—LASSO, Ridge Regression, Elastic Net, SCAD, Bridge, and Adaptive Bridge—in Poisson regression, and verified that the SCAD method exhibits the best performance in terms of model goodness-of-fit and prediction accuracy, making it suitable for large-sample and high-correlation datasets.

Moreover, all the aforementioned applications of the SCAD penalty only target the fixed-effects component of the respective models [22]. For mixed-effects Poisson regression models, performing variable selection solely on fixed effects while ignoring random effects will introduce bias into the estimation of fixed effects; conversely, incorporating an excessive number of irrelevant random effects may lead to the singularity of the random-effects covariance matrix. To address this problem, Ibrahim et al. [23] once imposed the SCAD penalty on both fixed and random effects of generalized linear mixed-effects models for variable selection, with parameters solved via the LLA (Local Linear Approximation) algorithm. However, their method penalizes the variances of random effects and requires these variances to follow a normal distribution. This penalization scheme has a drawback: when the variance of a certain random effect is shrunk to zero, the estimated values of all individuals under this random effect are forced to zero, which neglects the potential impact of non-zero random effects for some individuals.

Against this backdrop, this paper directly penalizes the random effects and proposes a double SCAD-penalized mixed-effects Poisson count regression model that enables simultaneous selection of both fixed and random effects. This penalization framework imposes no constraints on the variances of random effects. Furthermore, the parameter estimation algorithm in this paper is based on the LQA (Local Quadratic Approximation) algorithm [10], which yields more accurate estimates compared with the LLA algorithm.

The model proposed in this paper differs fundamentally from the method of Ibrahim et al. [23] in terms of penalization target, regularization logic, and hierarchical modeling strategy. Although Ibrahim et al. [23] applied the SCAD penalty to both fixed and random effects in generalized linear mixed-effects models; their penalization target was the variance of random effects. The corresponding regularization logic was to conduct variable selection by shrinking the variances; when a variance was shrunk to zero, the estimates of all individuals under that random effect were forced to zero, representing a global selection scheme that failed to capture individual-level heterogeneity. Their hierarchical modeling required the random effects to follow a normal distribution and adopted the LLA (Local Linear Approximation) algorithm for parameter estimation.

In contrast, this paper directly takes fixed-effect coefficients and random-effect coefficients as the penalization targets and uses the double SCAD penalty to achieve simultaneous selection of both types of effects. The regularization process does not rely on variance shrinkage, allowing accurate identification and retention of meaningful individual random effects. In terms of hierarchical modeling, no distributional constraint is imposed on the variances of random effects, and the more accurate LQA (Local Quadratic Approximation) algorithm is employed for parameter estimation.

When applied to cigarette production data from multiple units, this model simultaneously penalizes the numerous unknown random effects of individual units and the fixed effects of various process parameters. It not only identifies a small set of key process indicators affecting the number of short positions but also accurately estimates their impact weights, thereby providing valuable insights for the further adjustment and optimization of process parameters. In fact, similar double-penalization ideas have been explored in existing literature; for instance, Luo et al. [24] investigated the issue of double regularization penalties for mixed-effects models and proved the convergence of the iterative algorithm.

Most existing penalization strategies for Poisson mixed-effects count regression models are imposed on the variance components of random effects. Such approaches only achieve global shrinkage and overall selection of random effects, and suffer from two critical limitations:

First, they cannot perform independent and precise variable selection on the individual random-effect coefficients

α_{i}

. In high-dimensional settings, retaining redundant individual effects or excluding key heterogeneous effects tends to introduce substantial fitting bias. Second, when the variance of a random effect is shrunk to zero, the estimates for all individuals under that effect are forced to zero, which ignores the potential heterogeneity where some individuals may still exhibit non-zero random effects. This is incompatible with the characteristic of count data in practice: the overall trend is consistent while individuals show specific heterogeneity. Meanwhile, such methods require the variances of random effects to follow a normal distribution, which further restricts their applicability in modeling high-dimensional and non-normal count data.

In contrast, penalizing the random-effect coefficients

α_{i}

directly can overcome the above limitations. It enables simultaneous and refined variable selection for both fixed effects and random effects. This approach effectively removes irrelevant individual random effects while preserving meaningful individual heterogeneity. Moreover, it imposes no distributional constraints on the variances of random effects, making it better suited for modeling high-dimensional count-type quality indicators in manufacturing, biostatistics, and other fields. It thus fills the research gap of traditional penalization strategies in accurately capturing individual heterogeneity and adapting to high-dimensional scenarios.

Section 2 presents the general form of the mixed-effects Poisson count regression model. Based on the likelihood function, a novel double-penalization method is proposed by simultaneously imposing the SCAD penalty on both fixed-effects and random-effects regression coefficients. This section also introduces the iterative local quadratic approximation algorithm for model parameter estimation and discusses the selection criteria for optimal penalty parameters.

Section 3 provides rigorous proofs of the relevant theoretical properties of the proposed method.

Section 4 conducts comparisons between the proposed method and traditional methods via Monte Carlo simulations, evaluating their performance in variable selection and regression coefficient estimation under various scenarios.

Section 5 demonstrates the practical application scenarios of the proposed method, illustrating its applicability in real-world contexts.

Finally, Section 6 summarizes the main findings of the paper and offers concluding remarks.

2. Double Penalty Poisson Counting Regression Model

2.1. Double Penalty Poisson Counting Regression Model with SCAD Penalty

2.1.1. Model Establishment

To ensure compatibility of matrix operations in the model and facilitate verification and reproducibility, this section first standardizes all core variables, the meanings of subscripts, and dimensions involved in the model. The specific definitions are as follows:

Explanation of Subscripts:

i = 1, 2, \dots, n

denotes the observation unit, reflecting individual heterogeneity;

j = 1, 2, \dots, m

denotes the

j - t h

repeated observation within the

i - t h

observation unit, reflecting the time/batch correlation of longitudinal data.

Definition of Core Variables and Dimensions:

Response Variable:

y_{i j}

represents the count-type response variable of the

j - t h

observation for the

i - t h

observation unit, and its matrix form is

y = {(y_{11}, y_{12}, \dots, y_{1 m}, y_{21}, \dots, y_{n m})}^{T}

.

Fixed Effect-related Variables:

x_{i j}

represents the fixed-effect explanatory variable vector of the

j - t h

observation for the

i - t h

observation unit, where the subscript

i j

corresponds strictly to the response variable

y_{i j}

. Its dimension is defined as

x_{i j} \in R^{k}

, where k is the total number of fixed-effect explanatory variables;

β

is the fixed-effect regression coefficient vector, whose dimension matches that of

x_{i j}

, i.e.,

β \in R^{k}

, ensuring the compatibility of the linear operation

x_{i j}^{T} β

.

Random Effect-related Variables:

z_{i j}

represents the random-effect explanatory variable vector (covariate vector) of the

j - t h

observation for the

i - t h

observation unit, where the subscript

i j

is consistent with those of

y_{i j}

and

x_{i j}

. Its dimension is defined as

z_{i j} \in R^{p}

, where p is the total number of random-effect explanatory variables (covariates);

α_{i}

is the random-effect coefficient vector of the

i - t h

observation unit, whose dimension matches that of

z_{i j}

, i.e.,

α_{i} \in R^{p}

, ensuring the compatibility of the linear operation

z_{i j}^{T} α_{i}

(scalar output). Its matrix form is

α = {(α_{1}^{T}, α_{2}^{T}, \dots, α_{n}^{T})}^{T} \in R^{n p \times 1}

.

Supplementary Illustration: X is the fixed-effect explanatory variable matrix, whose matrix form is

X = {(x_{11}^{T}, x_{12}^{T}, \dots, x_{1 m}^{T}, x_{21}^{T}, \dots, x_{n m}^{T})}^{T} \in R^{N \times k}

. It matches the fixed-effect coefficient vector

β \in R^{k}

, ensuring the compatibility of the matrix operation

X β

; Z is the random-effect explanatory variable (covariate) matrix, whose matrix form is

Z = d i a g (Z_{1}, Z_{2}, \dots, Z_{n})

, where

Z_{i} = {(z_{i 1}, z_{i 2}, \dots, z_{i m})}^{T} \in R^{m \times p}

and the overall dimension is

Z \in R^{N \times n p}

. It matches the random-effect coefficient vector

α \in R^{n p \times 1}

, ensuring the compatibility of the matrix operation

Z_{α}

. N denotes the total sample size, which is equal to nm; that is, there are n individuals with m observations per individual.

In the general Poisson counting regression model, assuming that the response variable Y obeys the Poisson distribution,

P (Y = y) = \frac{λ^{y} e^{- λ}}{y!} = \exp \{y \log (λ) - λ - \log (y!)\}, y = 0, 1, 2, \dots,

(1)

where

λ > 0

,

E (Y) = λ

, using the family standard forms of the generalized linear model exponential distribution, there are

θ = \log (λ), b (θ) = λ = e^{θ}, a (ϕ) = 1, c (y, ϕ) = - \log (y!) g (u) = g (E (Y | X)) = l o g (μ) = l o g (b^{'} (θ)) = l o g (e^{θ}) = θ = X β

(2)

By this formulation, the general Poisson counting regression model is given by the following:

E (Y | X) = e^{X β}

(3)

where

X

is an explanatory variable,

β

is the corresponding coefficient.

Assume that there are n subjects, and each subject is observed repeatedly. For the

i - t h

subject

(i = 1, \dots, n)

, let

y_{i j}

be the response variable for the

j - t h

observation

(j = 1, \dots, m)

. Such clustered or longitudinal data arise frequently in applied sciences. Then generalize model (3) to a Poisson counting regression model with mixed effects:

E (y_{ij} | x_{ij}, z_{i j}) = e^{x_{i j}^{T} β + z_{i j}^{T} α_{i}}

(4)

where

x_{i j}

is a

k \times 1

vector of explanatory variables,

β

is the corresponding vector of

x_{i j}

with an unknown fixed effect coefficient,

z_{i j}

is the

p \times 1

vector of covariates,

α_{i}

is the corresponding vector of

z_{i j}

with an unknown random effect coefficient. The probability density function of the response variable is obtained from Equation (1):

p (y_{i j} | x_{i j}, z_{i j}) = \frac{{[E (y_{i j} | x_{i j}, z_{i j})]}^{y_{i}} \cdot e^{- E (y_{i j} | x_{i j}, z_{i j})}}{y_{i}!}

(5)

From this, the log-likelihood function is simplified, given by the following Equation:

l (β, α) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} [y_{i j} (x_{i j}^{T} β + z_{i j}^{T} α_{i}) - e^{x_{i j}^{T} β + z_{i j}^{T} α_{i}}]

(6)

Considering that the fixed-effect coefficient vector

β

and random-effect coefficient vector

α_{i}

are unknown in the above likelihood function, this paper applies the SCAD penalty to both fixed effects and random effects. This strategy takes full advantage of the desirable estimation properties of the SCAD penalty, enabling the model to produce unbiased sparse solutions. It can not only select important explanatory variables but also prevent the model from overfitting caused by including excessive random effects. This leads to the following minimized double SCAD-penalized likelihood function:

L (β, α) = \min {\sum_{i = 1}^{n} \sum_{j = 1}^{m} [- y_{i j} (x_{i j}^{T} β + z_{i j}^{T} α_{i}) + e^{x_{i j}^{T} β + z_{i j}^{T} α_{i}}] + \sum_{l = 0}^{k} p_{λ} (|β_{l}|) + \sum_{i = 1}^{n} \sum_{t = 1}^{p} p_{λ} (|α_{i t}|)} .

(7)

The equation above is also equivalent to the following:

L (β, α) = m a x (\sum_{i = 1}^{n} \sum_{j = 1}^{m} (y_{i j} (x_{i j}^{T} β + z_{i j}^{T} α_{i}) - e^{x_{i j}^{T} β + z_{i j}^{T} α_{i}} - \sum_{l = 0}^{k} p_{λ} (|β_{l}|) - \sum_{i = 1}^{n} \sum_{t = 1}^{p} p_{λ} (|α_{i t}|)))

(8)

where the SCAD penalty function takes the specific form:

p_{λ} (|θ_{i}|) = \{\begin{matrix} λ |θ_{i}|, & 0 \leq |θ_{i}| \leq λ \\ - \frac{{|θ_{i}|}^{2} - 2 a λ |θ_{i}| + λ^{2}}{2 (a - 1)}, & λ < |θ_{i}| \leq a λ \\ \frac{(a + 1) λ^{2}}{2}, & |θ_{i}| > a λ \end{matrix}

(9)

where

λ > 0

is penalty parameter,

a

is constant, and the recommended value is usually taken as 3.7 [10].

2.1.2. Statistical Interpretation and Methodological Comparison

(1): Statistical Interpretation of the Double-Penalized Likelihood

The double-penalized likelihood function in Equation (7) enables simultaneous shrinkage estimation of both fixed effects

β_{l}, l = 1, \dots, k

and individual random-effect realizations αit. Unlike traditional variance-component penalization, which targets the covariance structure of random effects, our approach directly penalizes the coefficient values of

α_{i t}

. This has a clear interpretation in terms of hierarchical borrowing strength:

For random-effect coefficients that are statistically indistinguishable from zero, the SCAD penalty will shrink them to zero, effectively removing redundant individual-level variation from the model.

For coefficients that represent meaningful heterogeneity, the penalty is milder, allowing the model to retain and exploit within-cluster information while avoiding over-borrowing of between-cluster information.

This local, coefficient-wise shrinkage controls the effective model complexity more precisely, which is critical in high-dimensional settings where many random-effect terms may be spurious or irrelevant.

(2): Comparison with Variance-Parameter Penalization

From Table 1, it can be seen that variance-component penalization is a global operation: when a variance is shrunk to zero, it eliminates all individual-level variation for that effect. In contrast, our double SCAD penalty operates at the level of individual realizations, allowing the model to retain sparse and meaningful heterogeneity while discarding noise. This is particularly valuable in industrial applications, where only a subset of production units may exhibit anomalous behavior, while the rest follow the global trend.

2.2. Iterative Local Quadratic Approximation Algorithm (DSCAD-LQA) for Model Unknown Parameter Estimation

The parameter estimation algorithm for Equation (8) is mainly based on the local quadratic approximation algorithm (LQA) proposed by Fan and Li (2001) [10]. Different from the conventional LQA, Equation (8) in this paper involves two penalty functions. Therefore, we develop an improved iterative algorithm based on LQA, referred to as the DSCAD-LQA (Double Smoothly Clipped Absolute Deviation–Local Quadratic Approximation). The specific steps of the algorithm are as follows.

Step 1: The initial values

β^{0}

,

α^{0}

of the fixed effect coefficients

β

and the random effect coefficients

α

are first solved by the log-likelihood function without penalty (6) using the Newton iterative algorithm in the following steps (1) to (3), and updating the likelihood values during each iteration. The exact process of iteration is as follows:

(1): At the first s + 1 iteration, $β$ value, a $k \times 1$ vector, is solved by the function: $β_{l}^{s + 1} = β_{l}^{s} - {{[\frac{\partial^{2} l (β, α^{s})}{\partial β_{l}^{2}}]}^{- 1} \frac{\partial l (β, α^{s})}{\partial β_{l}}|}_{β = β_{l}^{s}}, l = 1, \dots, k$ . Substitute the $β$ value to update the likelihood value.
(2): Then, $α$ value, a $p \times n$ vector, is solved by the function: $α_{i t}^{s + 1} = α_{i t}^{s} - {{[\frac{\partial^{2} l (β^{s + 1}, α)}{\partial α_{i t}^{2}}]}^{- 1} \frac{\partial l (β^{s + 1}, α)}{\partial α_{i t}}|}_{α = α_{i t}^{s}}, t = 1, \dots, p, i = 1, \dots, n$ . Substitute the $α$ value similarly to update the likelihood value.
(3): The initial value $β^{0}$ , $α^{0}$ can be obtained by solving (1) and (2) iteratively until it stops at $| | \frac{l (β^{s + 1}, α^{s + 1}) - l (β^{s}, α^{s})}{l (β^{s}, α^{s})} | | < ε$ , where $l (β, α)$ is a log-likelihood function without penalty (6).

Step 2: Subjecting the penalty functions

p_{λ} (|β_{l}|)

and

p_{λ} (|α_{i t}|)

to a second-order Taylor expansion at non-zero:

p_{λ} (|β_{l}|) \approx p_{λ} (|β_{l 0}|) + \frac{1}{2} \{p_{λ}^{'} (|β_{l 0}|) / |β_{l 0}|\} (β_{l}^{2} - β_{l 0}^{2}), β_{l} \approx β_{l 0}

(10)

p_{λ} (|α_{i t}|) \approx p_{λ} (|α_{i t 0}|) + \frac{1}{2} \{p_{λ}^{'} (|α_{i t 0}|) / |α_{i t 0}|\} (α_{i t}^{2} - α_{i t 0}^{2}), α_{i t} \approx α_{i t 0}

(11)

Calculate the derivative

{[p_{λ} (|β_{l}|)]}^{'} \approx \{p_{λ}^{'} (|β_{l 0}|) / |β_{l 0}|\} β_{l}, β_{l} \approx β_{l 0}

(12)

{[p_{λ} (|α_{i t}|)]}^{'} \approx \{p_{λ}^{'} (|α_{i t 0}|) / |α_{i t 0}|\} α_{i t}, α_{i t} \approx α_{i t 0}

(13)

Step 3: Substitute the initial values

β^{0}

,

α^{0}

obtained from Step 1 into the objective function equation with double SCAD penalty (8), and find the initial objective value

L M = L (β^{0}, α^{0})

. The final results

β^{b e s t}

α^{b e s t}

are also obtained by using Newton iterative method on the basis of the initial values

β^{0}

α^{0}

solved by the following alternating iterative solutions. The exact process of iteration is as follows:

(1): At the first s + 1 iteration, $β$ value, a $k \times 1$ vector, is solved by the Function (14). Substitute the $β$ value to update the target value $L M = L (β^{s + 1}, α^{s})$ .

$β_{l}^{s + 1} = β_{l}^{s} - {{[\frac{\partial^{2} l (β, α^{s})}{\partial β_{l}^{2}} + \frac{p_{λ}^{'} (|β|)}{|β|}]}^{- 1} [\frac{\partial l (β, α^{s})}{\partial β_{l}} + \{\frac{p_{λ}^{'} (|β|)}{|β|}\} β]|}_{β = β_{l}^{s}}, l = 1, \dots, k$

(14)
(2): Then the $α$ value, a $p \times n$ vector, is obtained by solving Equation (15) in turn, and substituted to update the target value $L M = L (β^{s + 1}, α^{s + 1}) .$

$α_{i t}^{s + 1} = α_{i t}^{s} - {{[\frac{\partial^{2} l (β^{s + 1}, α)}{\partial α_{i t}^{2}} + \frac{p_{λ}^{'} (|α|)}{|α|}]}^{- 1} [\frac{\partial l (β^{s + 1}, α)}{\partial α_{i t}} + \{\frac{p_{λ}^{'} (|α|)}{|α|}\} α]|}_{α = α_{i t}^{s}}, t = 1, \dots, p, i = 1, \dots, n$

(15)
(3): The final result $β^{b e s t}$ , $α^{b e s t}$ can be obtained by solving (1) and (2) iteratively until it stops at $| | \frac{l (β^{s + 1}, α^{s + 1}) - l (β^{s}, α^{s})}{l (β^{s}, α^{s})} | | < ε$ , where $l (β, α)$ is the target Function (8).

Analysis of Computational Properties and Numerical Stability of the Algorithm

The iterative process of the aforementioned DSCAD-LQA relies fundamentally on the Newton iterative update mechanism, and the iterative formula for fixed effects can be further specified as follows:

β_{l}^{s + 1} = β_{l}^{s} - {{[\frac{\partial^{2} l (β, α^{s})}{\partial β_{l}^{2}} + \frac{p_{λ}^{'} (|β|)}{|β|}]}^{- 1} [\frac{\partial l (β, α^{s})}{\partial β_{l}} + \{\frac{p_{λ}^{'} (|β|)}{|β|}\} β]|}_{β = β_{l}^{s}}, l = 1, \dots, k

where

\frac{\partial l}{\partial β}

and

\frac{\partial^{2} l}{\partial β^{2}}

denote the gradient and Hessian of the unpenalized log-likelihood, respectively, and

\frac{p_{λ}^{'} (∣ β ∣)}{|β|}

is the derivative of the double SCAD penalty.

(1): The Relationship Between Algorithm Convergence and Matrix Condition Number

The convergence of the DSCAD-LQA mainly depends on the positive definiteness and the condition number

κ (\cdot)

of the modified Hessian matrix:

\frac{\partial^{2} l (β, α^{s})}{\partial β^{2}} + \frac{p_{λ}^{'} (|β|)}{|β|} .

The condition number of the modified Hessian matrix is defined as the ratio of its maximum eigenvalue to its minimum eigenvalue, i.e.,

κ (\frac{\partial^{2} l (β, α^{s})}{\partial β^{2}} + \frac{p_{λ}^{'} (|β|)}{|β|}) = \frac{λ_{m a x}}{λ_{m i n}} .

Its core influence laws are as follows:

When k or

p n

is small (low-dimensional scenario), the modified Hessian matrix possesses satisfactory positive definiteness, with the condition number

κ \in (1, 10^{2})

. The Newton iteration converges rapidly. In this paper, the convergence threshold is set as

{| | β^{(s + 1)} - β^{(s)} | |}_{2} < 10^{- 6}

, which can usually be satisfied within 5–10 iterations in low-dimensional scenarios.

As k or

p n

increases (high-dimensional scenario), the modified Hessian matrix tends to become nearly singular, with the condition number

κ > 10^{3}

. Direct matrix inversion in this case leads to oscillatory iterations and slower convergence.

To address this issue, the Tikhonov regularization method is adopted in this paper to modify the Hessian matrix, resulting in the regularized matrix:

\frac{\partial^{2} l (β, α^{s})}{\partial β^{2}} + \frac{p_{λ}^{'} (|β|)}{|β|} + μ I,

where μ is the regularization parameter, and I denotes the identity matrix. This modification can constrain the condition number below

10^{2}

in high-dimensional scenarios, guaranteeing stable convergence of the iterative procedure. To improve the numerical stability in high-dimensional scenarios, Tikhonov regularization is adopted in this paper with a regularization parameter

μ = 10^{- 4}

. This value is determined by the L-curve method, which can effectively improve the condition number of the Hessian matrix while ensuring the consistency of the estimates.

(2): Algorithm Scalability Analysis

The scalability of the algorithm mainly reflects the variation in computational efficiency as the sample size n, the dimension of fixed effects k, and the dimension of random effects

p n

increase. Based on the core computational steps of the algorithm, including gradient evaluation, construction, and inversion of the modified Hessian matrix, the scalability analysis is presented as follows:

The computational complexity of the gradient vector

\frac{\partial l (β, α^{s})}{\partial β} + \{\frac{p_{λ}^{,} (|β|)}{|β|}\} β

is

O (n k + n p n)

, which mainly depends on the linear combination of the sample size and variable dimensions. When n or k is doubled, the computational cost grows approximately linearly.

The construction complexity of the modified Hessian matrix is

O ({n k}^{2} + {n p n}^{2})

, and its inversion complexity is

O (k^{3})

, which forms the main computational bottleneck of the algorithm.

For high-dimensional scenarios

(k > 100 o r p n > 500)

, this paper employs the conjugate gradient method in place of direct matrix inversion, reducing the inversion complexity to

O (k^{2})

. This strategy significantly improves the scalability of the algorithm for high-dimensional data and guarantees that the algorithm can still finish iterations efficiently even when

k

increases to 500.

(3): Computational Complexity and Numerical Stability Guarantees

The overall computational complexity of the DSCAD-LQA is

O (T ({n k}^{2} + {n p n}^{2}))

, where T denotes the number of iterations. In all numerical experiments, the number of iterations T is controlled within the range [5,20], and neither divergence nor premature convergence is observed. This complexity is comparable to that of existing penalized Poisson mixed-effects regression algorithms (e.g., the LASSO-Poisson algorithm). However, the numerical stability and reproducibility are further enhanced by the following two improvements:

The penalty parameter λ is selected via five-fold cross-validation, with the optimal value determined by minimizing the prediction error, thereby avoiding non-convergence caused by inappropriate parameter selection.

The initial values

β^{(0)}

and

α^{(0)}

are chosen as the Newton-type estimates obtained from the unpenalized log-likelihood Function (6), rather than random initialization. This ensures the consistency and reproducibility of the iterative process, such that different researchers can obtain exactly the same convergent results based on the same dataset.

2.3. The Selection of Penalty Parameters

The selection of penalty parameters is extremely important for models involving penalty functions, as the magnitude of penalty parameters affects the final results of model variable selection and parameter estimation. When

λ = 0

, the penalty function does not penalize the parameters in the model; however, when

λ = \infty

, the parameters may be over-penalized, resulting in all parameters being estimated as zero. Since there are two penalty parameters (

λ_{β}

and

λ_{α}

) in the double SCAD-penalized mixed-effects Poisson count regression model, and considering the computational complexity of the algorithm, the two penalty parameters are treated uniformly based on the overall likelihood value and selected via the SIC (Schwarz Information Criterion) (1982) [25] or GACV (Generalized Approximate Cross-Validation Criterion) (2006) [26]. The two criteria were defined as follows:

S I C (λ_{θ}) = L n (S_{M} / N) + \frac{L n (N)}{2 N} |M|

(16)

G A C V (λ_{θ}) = \frac{S_{M}}{N - |M|}

(17)

where

θ

is the parameter

β

or

α

,

S_{M} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} [- y_{i j} (x_{i j}^{T} β + z_{i j}^{T} α_{i}) + e^{x_{i j}^{T} β + z_{i j}^{T} α_{i}}]

is the value after updating the parameters and substituting them into the maximum likelihood function without penalty and minimizing them.

N = n m

is the total number of samples,

|M|

is the number of fixed-effect regression coefficient vectors

β

or random effect regression coefficient vectors

α

which is not equal to zero.

λ

, which minimizes the values of SIC and GACV, is the value of the optimal penalty parameter.

Remark on Joint Tuning of Penalty Parameters

In this work, we adopt a joint tuning strategy for the fixed-effect penalty parameter λβ and the random-effect penalty parameter

λ_{α}

, i.e., setting

λ_{β} = λ_{α} = λ

and selecting the optimal λ via the SIC or GACV criterion. This choice is theoretically justified from the following perspectives:

(1): Consistency of Model Complexity

The SIC and GACV criteria employed in this paper are constructed based on the total number of non-zero coefficients

|M|

to measure model complexity. Joint tuning ensures that

|M|

consistently represents the total number of non-zero fixed-effect coefficients

β_{l}

and non-zero random-effect coefficients

α_{i t}

, which is fully consistent with the definition of model degrees of freedom in Equations (16) and (17). Separate tuning of

λ_{β}

an

λ_{α}

d would introduce ambiguity in the definition of

|M|

(requiring separate calculation of degrees of freedom for fixed and random effects), thereby compromising the statistical consistency of the parameter selection procedure.

(2): Convexity and Convergence Guarantee

When

λ_{β} = λ_{α} = λ

, the double SCAD penalty function

p_{λ} (∣ \cdot ∣)

(Equation (9)) maintains a uniform derivative structure for both the fixed effects β and random effects α. This ensures that the overall objective function (Equation (7)) is convex (or pseudo-convex) over the parameter space, preventing the DSCAD-LQA (Equations (14) and (15)) from being trapped in local optima and guaranteeing convergence to a global solution.

(3): Computational Feasibility in High Dimensions

In high-dimensional scenarios where the dimension of fixed effects k or the dimension of random effects

p n

is large, separate tuning of

λ_{β}

and

λ_{α}

would lead to a two-dimensional grid search, significantly increasing the computational complexity from

O (L)

to

O (L^{2})

, where L is the number of candidate parameters. Joint tuning reduces the parameter space to one dimension, making the selection procedure computationally feasible while maintaining satisfactory estimation accuracy.

(4): Balanced Estimation and Stability

A unified penalty strength λ ensures a balanced shrinkage effect on both the fixed effects β and random effects α, avoiding estimation bias caused by over-penalizing one component while under-penalizing the other. Additionally, the uniform penalty structure contributes to the numerical stability of the modified Hessian matrix (e.g., the terms

\frac{\partial^{2} l (β, α^{s})}{\partial β_{l}^{2}} + p_{λ}^{'} (|β|) / |β|

in Equations (14) and (15), reducing the risk of ill-conditioning and improving the robustness of the Newton iteration.

3. Theoretical Property Proof

Assuming that the parameters

(β, α_{i})

have true value

(β_{0}, α_{0 i})

, the objective function is as follows:

Q (β, α) = L (β, α) - m n \sum_{l = 1}^{k} p_{λ_{β}} (|β_{l}|) - m \sum_{i = 1}^{n} \sum_{t = 1}^{p} p_{λ_{α}} (|α_{i t}|)

(18)

where

β = {(β_{1}, β_{2}, \dots, β_{k})}^{T}

,

β_{0} = {(β_{01}, β_{02}, \dots, β_{0 k})}^{T}

,

α_{i} = (α_{i 1}, α_{i 2}, \dots, α_{i p})

,

α_{0 i} = {(α_{0 i 1}, α_{0 i 2}, \dots, α_{0 i p})}^{T}

, Let

β = {(β_{1}^{T}, β_{2}^{T})}^{T}

,

β_{0} = {(β_{10}^{T}, β_{20}^{T})}^{T}

,

α_{i} = {(α_{1 i}^{T}, α_{2 i}^{T})}^{T}

,

α_{0 i} = {(α_{10 i}^{T}, α_{20 i}^{T})}^{T} \overset{r e c o r d}{=} {(α_{10}^{T}, α_{20}^{T})}^{T}

, where

β

is a

k

dimensional vector,

β_{1}

is a

s_{β}

dimensional vector,

β_{2}

is a

k - s_{β}

dimensional vector,

β_{10}

is a

s_{β}

dimensional non-zero vector,

β_{20}

is a

k - s_{β}

dimensional zero vector,

α_{i}

is a

p

dimensional vector,

α_{1 i}

is a

s_{α}

dimensional vector,

α_{2 i}

is a

k - s_{α}

dimensional vector,

α_{10}

is a

s_{α}

dimensional non-zero vector,

α_{20}

is a

k - p_{α}

dimensional zero vector. The exponential family probability density function is as follows:

f (y, θ, ϕ) = \exp \{\frac{y θ - b (θ)}{a (ϕ)} + c (y, ϕ)\}

(19)

For the mixed effects Poisson regression model, note that the probability density function is as follows:

f (y, x, z, β, α) \overset{record}{=} f (V, β, α) = e x p \{y (x^{T} β + z^{T} α_{i}) - e x p (x^{T} β + z^{T} α_{i}) - l o g (y!)\}

(20)

The following conditions are given:

Condition 1: The derivative of

f (V, β, α)

is defined as follows:

E [\frac{\partial l o g f (V, β, α)}{\partial β_{l}}] = 0 \begin{matrix} l = 1, \dots, k \end{matrix}

E [\frac{\partial l o g f (V, β, α)}{\partial α_{i t}}] = 0 \begin{matrix} t = 1, \dots, p \end{matrix}

I_{j l}^{β} (β, α) = E [\frac{\partial l o g f (V, β, α)}{\partial β_{j}} \frac{\partial l o g f (V, β, α)}{\partial β_{l}}] = E [- \frac{\partial^{2} l o g f (V, β, α)}{\partial β_{j} \partial β_{l}}]

I_{t l}^{α} (β, α) = E [\frac{\partial l o g f (V, β, α)}{\partial α_{i t}} \frac{\partial l o g f (V, β, α)}{\partial α_{i l}}] = E [- \frac{\partial^{2} l o g f (V, β, α)}{\partial α_{i t} \partial α_{i l}}]

Condition 2: The Fisher information matrix

I (β, α)

is finite and positive definite at the true parameter values

(β_{0}, α_{0 i})

. We denote its block submatrices as follows:

Let

I_{11} (β, α) = E (- \frac{\partial^{2} \log f (V, β, α)}{\partial^{2} β})

,

I_{22} (β, α) = E (- \frac{\partial^{2} \log f (V, β, α)}{\partial^{2} α_{i}})

. When evaluated at the true parameters, these submatrices are denoted

I_{11} (β_{0}, α_{0 i})

and

I_{22} (β_{0}, α_{0 i})

, and the full Fisher information matrix is

I (β_{0}, α_{0 i})

.

Condition 3: For an open interval

ω

containing the truth values of the parameters

(β_{0}, α_{0 i})

, there exists

M

so that the third-order derivative is bounded, that is

\begin{array}{l} |\frac{\partial^{3} \log f (V, β, α)}{\partial β_{j} \partial β_{l} \partial β_{r}}| \leq M \\ |\frac{\partial^{3} \log f (V, β, α)}{\partial α_{i t} \partial α_{i l} \partial α_{i r}}| \leq M \end{array} (β, α) \in ω

Condition 4:

{liminf}_{\begin{matrix} m \to \infty \\ n \to \infty \end{matrix}} {liminf}_{θ \to 0^{+}} λ_{θ}^{- 1} p_{λ_{θ}}^{'} (θ) > 0

, where

θ

is the parameter.

Theorem 1 (Compatibility).

Under conditions 1–4, if

\max \{|p_{λ_{β}}^{″} (|β_{0 l}|)| : β_{0 l} \neq 0\} \to 0

,

\max \{|p_{λ_{α}}^{″} (|α_{0 i t}|)| : α_{0 i t} \neq 0\} \to 0

, then there exists a local maximum

(\overset{⌢}{β}, {\overset{⌢}{α}}_{i})

of Formula (21) such that

\begin{array}{l} ‖\overset{⌢}{β} - β_{0}‖ = O_{p} ({(m n)}^{- \frac{1}{2}} + a_{m n}^{'}) \\ ‖{\overset{⌢}{α}}_{i} - α_{0 i}‖ = O_{p} (m^{- \frac{1}{2}} + b_{m}^{'}) \end{array}

(21)

where

a_{m n}^{'} = \max \{p_{λ_{β}}^{'} (|β_{0 l}|) : β_{0 l} \neq 0\}

,

b_{m}^{'} = \max \{p_{λ_{α}}^{'} (|α_{0 i t}|) : α_{0 i t} \neq 0\}

, when

m \to \infty, n \to \infty

:

\begin{array}{l} ‖\overset{⌢}{β} - β_{0}‖ = O_{p} ({(m n)}^{- \frac{1}{2}}) \\ ‖{\overset{⌢}{α}}_{i} - α_{0 i}‖ = O_{p} (m^{- \frac{1}{2}}) \end{array}

(22)

Proof.

Let

a_{m n} = {(m n)}^{- \frac{1}{2}} + a_{m n}^{'}

,

b_{m} = m^{- \frac{1}{2}} + b_{m}^{'}

. For any given

ε > 0

and the existence of a sufficiently large constant

C

, there is the following:

P \{\sup_{‖u‖ = C, ‖v_{i}‖ = C} Q (β_{0} + a_{m n} u, α_{0 i} + b_{m} v_{i}) < Q (β_{0}, α_{0 i})\} \geq 1 - ε

(23)

where

u

is a

k \times 1

vector,

v_{i}

is a

p \times 1

vector. According to Equation (23), we know that there exists a local maximum point

(\hat{β}, {\hat{α}}_{i})

within the ball

\{β_{0} + a_{m n} u, α_{0 i} + b_{m} v_{i}, ‖u‖ \leq C, ‖v_{i}‖ \leq C\}

when the probability is close to one, so that

‖\overset{̑}{β} - β_{0}‖ = O_{p} (a_{m n})

,

‖{\overset{̑}{α}}_{i} - α_{0 i}‖ = O_{p} (b_{m})

.

Due to

p_{λ_{θ}} (0) = 0

, there is the following:

\begin{array}{l} D (u, v) \equiv Q (β_{0} + a_{m n} u, α_{0 i} + b_{m} v_{i}) - Q (β_{0}, α_{0 i}) \\ \leq L (β_{0} + a_{m n} u, α_{0 i} + b_{m} v_{i}) - L (β_{0}, α_{0 i}) - m n \sum_{l = 1}^{s_{β}} \{p_{λ_{β}} (|β_{0 l} + a_{m n} u_{l}|) - p_{λ_{β}} (|β_{0 l}|)\} \\ \begin{matrix} - m \sum_{i = 1}^{n} \sum_{t = 1}^{s_{α}} \{p_{λ_{α}} (|α_{0 i t} + b_{m} v_{i t}|) - p_{λ_{α}} (|α_{0 i t}|)\} \end{matrix} \end{array}

(24)

Subjecting it to a second-order Taylor expansion, there is

\begin{array}{l} D (u, v) \leq L (β_{0}, α_{0 i}) + a_{m n} L_{β}^{'} (β_{0}, α_{0 i}) u + b_{m} L_{α}^{'} (β_{0}, α_{0 i}) v_{i} \\ + [\frac{1}{2} u^{T} L_{β}^{″} (β_{0}, α_{0 i}) u a_{m n}^{2} + \frac{1}{2} v_{i}^{T} L_{α}^{″} (β_{0}, α_{0 i}) v b_{m}^{2} + u^{T} L_{β α}^{″} (β_{0}, α_{0 i}) v_{i} a_{m n} b_{m}] \{1 + O (1)\} \\ - L (β_{0}, α_{0 i}) - \sum_{l = 1}^{s_{β}} [m n a_{m n} p_{λ_{β}}^{'} (|β_{0 l}|) sgn (β_{0 l}) u_{l} + m n a_{m n}^{2} p_{λ_{β}}^{″} (|β_{0 l}|) u_{l}^{2} \{1 + O (1)\}] \\ - \sum_{i = 1}^{n} \sum_{t = 1}^{s_{α}} [m b_{m} p_{λ_{α}}^{'} (|α_{0 i t}|) sgn (α_{0 i t}) v_{i t} + m b_{m}^{2} p_{λ_{α}}^{″} (|α_{0 i t}|) v_{i t}^{2} \{1 + O (1)\}] \\ = T_{1} + T_{2} + T_{3} + T_{4} + T_{5} \end{array}

(25)

where

L_{β}^{'} (β_{0}, α_{0 i})

is the first-order partial derivative of

L (β_{0}, α_{0 i})

with respect to

L (β_{0}, α_{0 i})

,

L_{α}^{'} (β_{0}, α_{0 i})

is the first-order partial derivative of

α

,

L_{β}^{″} (β_{0}, α_{0 i})

is the second-order partial derivative of

β

,

L_{α}^{″} (β_{0}, α_{0 i})

is the second-order partial derivative of

α

,

L_{β α}^{″} (β_{0}, α_{0 i})

is the second-order mixed partial derivative.

For

T_{1}

,

T_{2}

, there are

L_{β}^{'} (β_{0}, α_{0 i}) = O_{p} (\sqrt{m n})

and

L_{α}^{'} (β_{0}, α_{0 i}) = O_{p} (\sqrt{m})

under condition 1 and condition 2, known that

O_{p} (\sqrt{m n} a_{m n}) = O_{p} (m n a_{m n}^{2})

,

O_{p} (\sqrt{m} b_{m}) = O_{p} (m b_{m}^{2})

. According to the Cauchy–Schwarz inequality, there are

T_{2} = b_{m} L_{α}^{'} (β_{0}, α_{0 i}) v_{i} \leq b_{m} ‖L_{α}^{'} (β_{0}, α_{0 i})‖ ‖v_{i}‖ = b_{m} \sqrt{m} ‖v_{i}‖ = O_{p} (m b_{m}^{2}) ‖v_{i}‖

(26)

T_{1} = a_{m n} L_{β}^{'} (β_{0}, α_{0 i}) u \leq a_{m n} ‖L_{β}^{'} (β_{0}, α_{0 i})‖ ‖u‖ = a_{m n} \sqrt{m n} ‖u‖ = O_{p} (m n a_{m n}^{2}) ‖u‖

(27)

For

T_{3}

,

\begin{array}{l} T_{3} = [\frac{1}{2} u^{T} L_{β}^{″} (β_{0}, α_{0 i}) u a_{m n}^{2} + \frac{1}{2} v_{i}^{T} L_{α}^{″} (β_{0}, α_{0 i}) v b_{m}^{2} + u^{T} L_{β α}^{″} (β_{0}, α_{0 i}) v_{i} a_{m n} b_{m}] \{1 + O (1)\} \\ = - \frac{1}{2} w^{T} I (β_{0}, α_{0 i}) w \{1 + O (1)\} \end{array}

(28)

where

w^{T} = (a_{m n} u^{T}, b_{m} v_{i}^{T})

.

For

T_{4}

,

T_{5}

, due to

\max \{|p_{λ_{β}}^{''} (|β_{0 l}|)| : β_{0 l} \neq 0\} \to 0

,

\max \{|p_{λ_{α}}^{''} (|α_{0 i t}|)| : α_{0 i t} \neq 0\} \to 0

,

a_{m n}^{'} = \max \{p_{λ_{β}}^{'} (|β_{0 l}|) : β_{0 l} \neq 0\}

,

b_{m}^{'} = \max \{p_{λ_{α}}^{'} (|α_{0 i t}|) : α_{0 i t} \neq 0\}

, there are

\begin{array}{l} T_{4} = - \sum_{l = 1}^{s_{β}} [m n a_{m n} p_{λ_{β}}^{'} (|β_{0 l}|) sgn (β_{0 l}) u_{l} + m n a_{m n}^{2} p_{λ_{β}}^{″} (|β_{0 l}|) u_{l}^{2} \{1 + O (1)\}] \\ \begin{matrix} \leq - [\sqrt{s_{β}} m n a_{m n} a_{m n}^{'} ‖u‖ + m n a_{m n}^{2} \max \{p_{λ_{β}}^{'} (|β_{0 l}|) : β_{0 l} \neq 0\} {‖u‖}^{2}] \end{matrix} \\ = - \sqrt{s_{β}} m n a_{m n} a_{m n}^{'} ‖u‖ \end{array}

(29)

\begin{array}{l} T_{5} = - \sum_{i = 1}^{n} \sum_{t = 1}^{s_{α}} [m b_{m} p_{λ_{α}}^{'} (|α_{0 i t}|) sgn (α_{0 i t}) v_{i t} + m b_{m}^{2} p_{λ_{α}}^{″} (|α_{0 i t}|) v_{i t}^{2} \{1 + O (1)\}] \\ \begin{matrix} \leq - [\sqrt{n s_{α}} m b_{m} b_{m}^{'} ‖v_{i}‖ + m b_{m}^{2} \max \{p_{λ_{α}}^{'} (|α_{0 i t}|) : α_{0 i t} \neq 0\} {‖v_{i}‖}^{2}] \end{matrix} \\ = - \sqrt{n s_{α}} m b_{m} b_{m}^{'} ‖v_{i}‖ \end{array}

(30)

When a sufficiently large

C

is chosen, the value of

T_{3}

is much larger than the value of

T_{1}

,

T_{2}

,

T_{4}

,

T_{5}

under the conditions of

‖u‖ = C

,

‖v_{i}‖ = C

. That is, the sign of Equation (25) is completely determined by

T_{3}

, and the theorem has been proved. □

Theorem 2

(Nature of Oracle). For parameter

β = {(β_{1}^{T}, β_{2}^{T})}^{T}

,

α_{i} = {(α_{1 i}^{T}, α_{2 i}^{T})}^{T}

, we know by Theorem 1 that there exist local maximum points

\hat{β} = {({\hat{β}}_{1}^{T}, {\hat{β}}_{2}^{T})}^{T}

,

{\hat{α}}_{i} = {({\hat{α}}_{1 i}^{T}, {\hat{α}}_{2 i}^{T})}^{T}

. Assuming that

{{(m n)}^{- \frac{1}{2}} / λ}_{β} \to 0

,

m^{- \frac{1}{2}} / λ_{α} \to 0

, and providing that the conditions all hold, we have

(a): Sparsity: ${\hat{β}}_{2} = 0, {\hat{α}}_{2 i} = 0$ , $m \to \infty, n \to \infty$
(b): Asymptotical normality:

\sqrt{m n} (I_{11} (β_{10}, α_{10}) + \sum_{11}) \{{\hat{β}}_{1} - β_{10} + {(I_{11} (β_{10}, α_{10}) + \sum_{11})}^{- 1} b_{1}\} \to N \{0, I_{11} (β_{10}, α_{10})\}

\sqrt{m} (I_{22} (β_{10}, α_{10}) + \sum_{22}) \{{\hat{α}}_{1 i} - α_{10} + {(I_{22} (β_{10}, α_{10}) + \sum_{22})}^{- 1} b_{2}\} \to N \{0, I_{22} (β_{10}, α_{10})\}

where

I_{11} (β_{10}, α_{10})

is a

s_{β} \times s_{β}

matrix,

I_{22} (β_{10}, α_{10})

is a

s_{α} \times s_{α}

matrix,

I_{11} (β_{10}, α_{10})

,

I_{22} (β_{10}, α_{10})

are submatrices of

I (β_{0}, α_{0 i})

. For

\sum_{11}

,

\sum_{22}

,

b_{1}

,

b_{2}

the definitions are as follows:

\sum_{11} = d i a g \{p_{λ_{β}}^{″} (|β_{01}|), \dots, p_{λ_{β}}^{″} (|β_{0 s_{β}}|)\}

\sum_{22} = d i a g \{p_{λ_{α}}^{″} (|α_{0 i 1}|), \dots, p_{λ_{α}}^{″} (|α_{0 i s_{α}}|)\}

b_{1} = {(p_{λ_{β}}^{'} (|β_{01}|) sgn (β_{01}), \dots, p_{λ_{β}}^{'} (|β_{0 s_{β}}|) sgn (β_{0 s_{β}}))}^{T}

b_{2} = {(p_{λ_{α}}^{'} (|α_{0 i 1}|) sgn (α_{0 i 1}), \dots, p_{λ_{α}}^{'} (|α_{0 i s_{α}}|) sgn (α_{0 i s_{α}}))}^{T}

Proof.

For the proof of sparsity, let

ε_{m n} = C {(m n)}^{- \frac{1}{2}}

,

ε_{m} = C m^{- \frac{1}{2}}

; it suffices to show that when

m \to \infty, n \to \infty

is satisfied,

‖{\hat{β}}_{1} - β_{10}‖ = O_{p} ({(m n)}^{- \frac{1}{2}})

,

‖{\hat{α}}_{1 i} - α_{10}‖ = O_{p} (m^{- \frac{1}{2}})

, with probability tending to one for any given

β_{1}, α_{1 i}

:

\{\begin{matrix} \frac{\partial Q (β, α)}{\partial β_{j}} < 0, 0 < β_{j} < ε_{m n} \\ \frac{\partial Q (β, α)}{\partial β_{j}} > 0, - ε_{m n} < β_{j} < 0 \end{matrix}

(31)

\{\begin{matrix} \frac{\partial Q (β, α)}{\partial α_{i t}} < 0, 0 < α_{i t} < ε_{m} \\ \frac{\partial Q (β, α)}{\partial α_{i t}} > 0, - ε_{m} < α_{i t} < 0 \end{matrix}

(32)

where

j = s_{β} + 1, \dots, k, t = s_{α} + 1, \dots, p

. For

‖β_{2}‖ \leq C {(m n)}^{- \frac{1}{2}}

,

‖α_{2 i}‖ \leq C m^{- \frac{1}{2}}

, applying Taylor’s formula yields the following:

The score function of the objective

Q (β, α)

is given by

(\begin{matrix} \frac{\partial Q (β, α)}{\partial β_{j}} \\ \frac{\partial Q (β, α)}{\partial α_{i t}} \end{matrix}) = (\begin{matrix} \frac{\partial L (β, α)}{\partial β_{j}} - m n p_{λ_{β}}^{'} (|β_{j}|) s g n (β_{j}) \\ \frac{\partial L (β, α)}{\partial α_{i t} - m p_{λ_{α}}^{'} (|α_{i t}|) s g n (α_{i t})} \end{matrix})

Under the given rate conditions, this can be simplified to

(\begin{matrix} \frac{\partial Q (β, α)}{\partial β_{j}} \\ \frac{\partial Q (β, α)}{\partial α_{i t}} \end{matrix}) = (\begin{matrix} m n λ_{β} \{- λ_{β}^{- 1} p_{λ_{β}}^{'} (|β_{j}|) sgn (β_{j}) + O_{p} ({(m n)}^{- \frac{1}{2}} / λ_{β})\} \\ m λ_{α} \{- λ_{α}^{- 1} p_{λ_{α}}^{'} (|α_{i t}|) sgn (α_{i t}) + O_{p} (m^{- \frac{1}{2}} / λ_{α})\} \end{matrix})

(33)

Given the conditions

{liminf}_{\begin{matrix} m \to \infty \\ n \to \infty \end{matrix}} {liminf}_{θ \to 0^{+}} λ_{θ}^{- 1} p_{λ_{θ}}^{'} (θ) > 0

,

{\frac{{(m n)}^{- \frac{1}{2}}}{λ}}_{β} \to 0

,

\frac{m^{- \frac{1}{2}}}{λ_{α}} \to 0

, it is possible that the sign of

\partial Q (β, α) / \partial β_{j}

is determined by

β_{j}

,

\partial Q (β, α) / \partial β_{j}

, and the sign of

\partial Q (β, α) / \partial α_{i t}

is determined by

α_{i t}

. Combined with the sign conditions above, this completes the proof of sparsity. For the proof of asymptotic normality, we know from Theorem 1 that there exists a local maximum point

(\hat{β}, {\hat{α}}_{i})

such that

{(\begin{matrix} \frac{\partial Q (β, α)}{\partial β_{j}} \\ \frac{\partial Q (β, α)}{\partial α_{i t}} \end{matrix})|}_{(β, α) = (\begin{matrix} {\hat{β}}_{1} & {\hat{α}}_{1 i} \\ 0 & 0 \end{matrix})} = (\begin{matrix} 0 \\ 0 \end{matrix})

By Slutsky’s theorem and the central limit theorem, it follows that

\sqrt{m n} (I_{11} (β_{10}, α_{10}) + \sum_{11}) \{{\hat{β}}_{1} - β_{10} + {(I_{11} (β_{10}, α_{10}) + \sum_{11})}^{- 1} b_{1}\} \to N \{0, I_{11} (β_{10}, α_{10})\}

\sqrt{m} (I_{22} (β_{10}, α_{10}) + \sum_{22}) \{{\hat{α}}_{1 i} - α_{10} + {(I_{22} (β_{10}, α_{10}) + \sum_{22})}^{- 1} b_{2}\} \to N \{0, I_{22} (β_{10}, α_{10})\}

□

Discussion on the Practical Implications of Asymptotic Convergence Results

This section elaborates on the convergence rates, regularity conditions, and oracle property presented in the theorems, clarifying their theoretical significance and practical application value.

1.: Practical Interpretation of Convergence Rates

The asymptotic estimation error results in Theorem 1,

|| \hat{β} - β_{0} || = O_{p} ({(m n)}^{- 1 / 2}),

Which indicates that the fixed-effect estimator

\hat{β}

achieves the optimal parametric convergence rate of

\sqrt{m n}

.

Here,

m n

is the total number of observations. This means that the estimation precision of fixed effects improves steadily as the total sample size increases, and the rate is consistent with that of maximum likelihood estimation (MLE) under a fully known model structure, achieving optimal statistical efficiency.

Intuitive Elucidation:

This formula implies that the estimation error of the fixed effect β decreases at a rate of

{(m n)}^{- \frac{1}{2}}

as the total sample size

m n

increases.

For instance, increasing the total sample size from

m n = 100

to

m n = 400

(i.e., a fourfold increase) would theoretically halve the standard deviation of the estimation error, which is consistent with the efficiency of the classical maximum likelihood estimation (MLE) under a fully known model structure.

This indicates that the most effective approach to improving the estimation precision of the fixed effect is to increase the total sample size

m n

: both an increase in the number of subjects n and the number of observations per subject m can lead to a linear improvement in estimation precision.

For the random effects

α_{i}

, the convergence rate is

|| \hat{α_{i}} - α_{0 i} || = O_{p} ({(m n)}^{- 1 / 2}),

which shows that the estimation accuracy of random effects primarily depends on the number of observations per subject m. This aligns with the hierarchical nature of random effects. In practice, increasing the number of within-subject observations can more effectively improve the estimation precision of random effects.

Intuitive Elucidation:

The estimation precision of the random effects is determined primarily by the number of repeated observations per subject m, rather than the total sample size

m n

. This is because the random effects

α_{i}

characterize individual-level heterogeneity, which requires a sufficient number of within-subject repeated observations for accurate identification.

For example, increasing the number of observations per subject from

m = 5

to

m = 20

(i.e., a fourfold increase) would theoretically halve the standard deviation of the estimation error for the random effects.

This suggests that in practical applications, if the research focuses on individual-level random effects, priority should be given to increasing the number of observations per subject m, rather than simply increasing the number of subjects n.

2.

Practical Significance of Conditions 1–4

The regularity conditions 1–4 required for the theoretical proofs are standard in generalized mixed-effects models and are generally satisfied in practical applications:

Condition 1 ensures that the expectation of the score function at the true parameters $(β_{0}, α_{0 i})$ is zero, which is a fundamental condition for the consistency of MLE.
Condition 2 requires the Fisher information matrix $I (β, α)$ to be positive definite at the true parameters, guaranteeing model identifiability and unique estimation.
Condition 3 bounds the third-order derivatives of the log-likelihood function, which holds for common exponential family models, such as the mixed-effects Poisson counting regression model in this paper.
Condition 4 restricts the growth rate of the penalty parameters, ensuring that the penalty terms do not dominate the likelihood function, so that variable selection and parameter estimation can be performed effectively simultaneously.

These conditions are not overly restrictive and are typically naturally satisfied in real data analysis.

Sample Size Design for Practical Studies:

For studies focusing on fixed effects: To achieve a target estimation precision (e.g., a standard error of 0.05 for a key fixed effect), researchers can use the convergence rate $m n$ to back-calculate the required total sample size. For example, if the initial estimate of the standard deviation of the outcome is σ = 1, a total sample size of $m n = 400$ would yield a standard error of approximately 1/400 = 0.05.
For studies focusing on random effects: To accurately estimate individual-level random effects, it is recommended to ensure that each subject has at least $m \geq 20$ repeated observations. This ensures that the estimation error of random effects is sufficiently small to support meaningful individual-level inferences.

3.

Practical Value of the Oracle Property

The oracle property (Theorem 2) proven in this paper has clear practical implications:

Sparsity: The algorithm can consistently identify the true zero coefficients in the model $({\hat{β}}_{2} = 0, {\hat{α}}_{2 i} = 0)$ , with probability approaching one, achieving simultaneous sparse selection of both fixed and random effects.
Practical implication: This means the model can automatically “drop” irrelevant fixed and random effects from the final specification, reducing model complexity and improving interpretability. For example, in a healthcare application, the model can identify which patient characteristics (fixed effects) and which individual-specific deviations (random effects) are truly predictive of the outcome, eliminating noise from the model.
Asymptotic normality: The estimators of the non-zero coefficients follow an asymptotic normal distribution with the same asymptotic variance as if the true model structure were known in advance.
Practical implication: This allows researchers to construct valid confidence intervals and perform hypothesis tests for the selected non-zero coefficients as if they knew the true model structure beforehand. This eliminates the need for additional corrections for variable selection, making statistical inference straightforward and reliable in large samples.

4.

Practical Implementation Recommendations

To ensure stable estimation of fixed effects, the total sample size $m n$ should be increased as much as possible.
To improve the estimation precision of random effects, the number of within-subject observations m should be appropriately increased.
The penalty parameter λ is recommended to be selected via SIC or GACV, which automatically satisfies Condition 4 and guarantees the theoretical properties.
After variable selection, statistical inference can be directly performed based on the asymptotic normality results without additional corrections.

5.

Summary of Practical Takeaways

Fixed effects precision is driven by the total sample size $m n$ . The larger the total sample size $m n$ , the more precise the estimates of the fixed effects, with the convergence rate achieving the optimal level of the $\sqrt{m n}$ order.
Random effects precision is driven by within-subject observations m. The greater the number of repeated observations per subject, the more accurate the estimation of individual heterogeneity.
Model selection: The oracle property ensures that irrelevant effects are automatically excluded, and the remaining effects can be used for valid statistical inference without extra adjustments.
Study design: Prioritize $m n$ for fixed effects, prioritize m for random effects, and use information criteria (SIC/GACV) to select penalty parameters $λ$ .

4. Comparative Study of Monte Carlo Simulation

Monte Carlo simulation data were generated as follows:

λ_{i j}

is first obtained through

λ_{i j} = E (y_{i j} | x_{i j}, z_{i j}) = e^{β_{0} + x_{i j}^{T} β + z_{i j}^{T} α_{i}}

, and then the value of

y_{i j}

is generated through the Poisson distribution of

P (λ_{i j})

, where

N

is the total sample size,

β_{0} = 0

,

x_{i j}^{T} = (x_{i j 1}, x_{i j 2}, \dots, x_{i j 8}) (i = 1, 2, \dots, n; j = 1, 2, \dots, n_{i})

. The explanatory variable

X_{1}, X_{2} \dots, X_{8}

follows a standard normal distribution, and for any two explanatory variables

X_{l}

and

X_{k}

, the correlation coefficient between them is

ρ^{|l - k|}

. In order to estimate the accuracy of model parameter estimation, the median of mean squared error (MRME) is used as an evaluation metric in this paper. Specifically, it is defined as the median mean squared error (MSE) obtained from multiple simulations. This indicator avoids the effect of maximum values and minimal values in the simulation results, where the mean squared error (MSE) is calculated by the following:

M S E = \frac{1}{K} \sum_{s = 1}^{K} \frac{{({\hat{β}}^{s} - β)}^{T} ({\hat{β}}^{s} - β)}{p}

(34)

where

{\hat{β}}^{s}

is the estimated value of

β

at the first

s

simulation; in the following simulations, the number of simulations

K

is repeated 100 times.

To provide a more comprehensive assessment of the finite-sample performance of our estimator, we decompose the mean squared error (MSE) into two fundamental components: squared bias and variance. This decomposition allows us to distinguish between the systematic error of the estimator and its variability across simulations.

For a fixed effect coefficient

β_{l}

, the bias is defined as the average difference between the estimated value and the true value across all simulations:

B i a s ({\hat{β}}_{l}) = \frac{1}{k} \sum_{s = 1}^{K} ({\hat{β}}_{l}^{s} - β_{l})

(35)

where K = 100 is the number of simulations,

{\hat{β}}_{l}^{s}

is the estimate of

β_{l}

in the

s - t h

simulation, and

β_{l}

is the true value.

The variance of the estimator

{\hat{β}}_{l}

measures its dispersion around its own mean:

V a r ({\hat{β}}_{l}) = \frac{1}{K - 1} \sum_{s = 1}^{K} {({\hat{β}}_{l}^{s} - {\bar{\hat{β}}}_{l})}^{2}

(36)

where

{\bar{\hat{β}}}_{l} = \frac{1}{k} \sum_{s = 1}^{K} {\hat{β}}_{l}^{s}

is the average estimate of

β_{j}

.

The relationship between these quantities is given by the MSE decomposition:

M S E ({\hat{β}}_{l}) = {[B i a s ({\hat{β}}_{l})]}^{2} + V a r ({\hat{β}}_{l})

(37)

For the overall model performance, we report the median bias and median variance across all fixed effect coefficients, which are robust to outliers and align with our use of MRME.

To evaluate variable selection performance, we use two key metrics derived from the confusion matrix:

Correct Selection Rate (CSR)—the proportion of truly non-zero coefficients

β_{l} \neq 0

that are correctly identified as non-zero, i.e.,

CSR = TP/total number of true non-zero coefficients.

Correct Elimination Rate (CER)—the proportion of truly zero coefficients

β_{l} = 0

that are correctly identified as zero, i.e.,

CER = TN/total number of true zero coefficients.

Here, an estimated coefficient

{\hat{β}}_{l}

is treated as zero if

|{\hat{β}}_{l} < 10^{- 6}|

. CSR reflects the accuracy of selecting significant variables, while CER reflects the accuracy of excluding irrelevant variables.

1.: Simulation 1—Impact of Multicollinearity

In order to investigate whether the magnitude of the correlation coefficient

ρ

of the independent variables affects the accuracy of the method estimation and variable selection, we evaluated three cases—

ρ = 0.25

,

ρ = 0.5

,

ρ = 0.75

—as a way to simulate the comparison. When generating the simulation data, let

n = 30

,

m = 10

; the fixed effect coefficient is assigned to

β = {(3, - 1.5, 0, 0, 2, 0, 0, 0)}^{T}

, the random effect coefficient is assigned to

α_{i} = {(α_{i 0}, α_{i 1}, α_{i 2})}^{T} ~ N_{3} (0, D)

,

D = d i a g (1, 1, 0)

,

z_{i j}^{T} = (1, x_{i j 1}, x_{i j 2})

. Set the threshold value

ε = 1 0^{- 6}

when stopping the iterations. The double SCAD penalty method proposed in this paper is simulated under two criteria, SIC and GACV, and is noted as DSCAD-SIC and DSCAD-GACV, respectively. The simulation was repeated 100 times, and the specific results obtained from the simulation are shown in Table 2.

As can be seen from Table 2, under the specific simulation settings of this study (i.e.,

n = 30, m = 10

, true fixed effects

β = {(3, - 1.5, 0, 0, 2, 0, 0, 0)}^{T}

, and 100 repetitions), the estimates under the GACV criterion consistently yield smaller MRME values, identical CSR values, and higher CER values compared to the SIC across all collinearity levels. This indicates that, within the current experimental conditions, the GACV criterion outperforms the SIC in terms of estimation accuracy and the exclusion of irrelevant variables, while both criteria achieve equally high accuracy in selecting significant variables.

It is observed that the accuracy of parameter estimation and variable selection decreases gradually as the correlation coefficient increases, where the value of MRME becomes larger and larger, while the value of CSR and CER becomes smaller and smaller. It can be obviously seen that, compared to the two cases with correlation coefficients of

ρ = 0.25

and

ρ = 0.5

, the value of MRME becomes significantly larger when

ρ = 0.75

. The value increases by 38 times and 31 times under the SIC and by 39 times and 29 times under the GACV criterion, respectively. At a correlation coefficient

ρ = 0.5

, the value of MRME increased by 0.21 times and 0.32 times in both SIC and GACV criteria, respectively, compared to the case of

ρ = 0.25

.

From the perspective of the bias–variance trade-off, as the correlation coefficient ρ increases, the degree of multicollinearity intensifies. Our detailed decomposition shows that both bias and variance rise substantially, although their relative contributions to the estimation error differ across criteria.

Under the GACV criterion, the variance of parameter estimates remains the dominant component of MSE, accounting for approximately 70% of the total error across all ρ levels. The contribution of squared bias is stable at around 30%, indicating that although both components increase, the rise in MRME is still mainly driven by variance inflation, which is consistent with the typical behavior of parameter estimation under high-dimensional collinear settings.

In contrast, under the SIC, the contribution of squared bias to MSE (approximately 60%) exceeds that of variance. This implies that the increase in MRME is a joint result of both bias and variance, with bias playing a more prominent role. Notably, compared with the DSCAD-SIC method, the DSCAD-GACV method consistently achieves lower absolute bias and a more favorable bias–variance balance, especially under high multicollinearity (ρ = 0.75), where it effectively alleviates the systematic error introduced by collinear predictors.

For the CSR (Correct Selection Rate), under both criteria, the CSR is 1.00 when the correlation coefficient is ρ = 0.25 and ρ = 0.5, indicating that all significant variables are correctly included in the model in these cases. In contrast, the CSR drops to 0.90 for both criteria when ρ = 0.75, which indicates that there are cases where not all significant variables are selected in the 100 simulations. This shows that the accuracy of variable selection is affected when the correlation coefficient of the variables is comparatively large. Notably, even at the high correlation level of ρ = 0.75, the CSR remains at a satisfactory level of 0.90, demonstrating the robustness of the proposed method.

For the CER (Correct Elimination Rate), regardless of the criterion, the CER in case ρ = 0.75 is around 0.40, while for both cases ρ = 0.25 and ρ = 0.5, the CER is above 0.80. It can be seen that the proportion of unwanted variables correctly excluded from the model in the ρ = 0.75 case is only about half of that in the other two cases. Moreover, the CER when ρ = 0.5 is also lower than that when ρ = 0.25. Overall, model estimation and variable selection perform better under the GACV criterion, and when the correlation coefficient is comparatively large, the accuracy of parameter estimation and selection becomes worse.

2.: Simulation 2—Impact of Random Effects

In order to investigate whether the size of the random-effect covariance

D

affects the accuracy of the estimation of the coefficients of the variables by the method proposed in this paper, three random effects are set up, that is, the random effect coefficients

α_{i} = {(α_{i 0}, α_{i 1}, α_{i 2})}^{T} ~ N_{3} (0, D)

,

z_{i j}^{T} = (1, x_{i j 1}, x_{i j 2})

, and for the covariance, the following three cases,

D_{1} = d i a g (1, 1, 0)

,

D_{2} = d i a g (2, 2, 0)

,

D_{3} = d i a g (3, 3, 0)

, are taken for simulation, respectively. When simulated data are generated for different situations, controlling that

n = 30

,

m = 10

,

ρ = 0.5

, and the fixed effect coefficient is

β = {(3, - 1.5, 0, 0, 2, 0, 0, 0)}^{T}

, the threshold

ε = 10^{- 6}

is set when the iteration is stopped. The results obtained from 100 simulations under both SIC and GACV criteria are shown in Figure 1.

From Figure 1, it is known that the value of MRME is increasing, while the values of CSR (Correct Selection Rate) and CER (Correct Elimination Rate) are decreasing as the value of the random-effect covariance increases under either criterion. This indicates that the magnitude of the random-effect covariance has a notable impact on the accuracy of both parameter estimation and variable selection.

The bias–variance analysis reveals that an increase in the random-effect covariance matrix D amplifies both the variance fluctuation and the systematic bias of the parameter estimates. While variance remains the dominant component of the mean squared error (MSE) across all settings, ranging between 2.011 and 2.491, the absolute bias also exhibits a substantial increase. Specifically, as D rises from D1 to D3, the absolute bias increases by 61% (from 0.376 to 0.606) under the SIC, and by 102% (from 0.209 to 0.423) under the GACV criterion. Although the relative increase in absolute bias is more pronounced under GACV than under SIC, the absolute bias associated with GACV remains consistently lower across all scenarios, indicating its superior estimation accuracy. These results demonstrate that the deterioration in estimation accuracy is driven jointly by elevated estimation uncertainty (variance) and a notable increase in systematic bias. This finding is consistent with the theoretical intuition that stronger random effects give rise to greater unobserved heterogeneity, which is reflected not only in higher estimation variability but also in more substantial systematic deviations from the true parameter values.

Comparing with the case where the covariance is

D_{1}

, when the covariance is

D_{2}

, the value of MRME increases by 168% and 170% under SIC and GACV criteria, respectively. The CSR value is one when the covariance is

D_{1}

; the CSR value is not one when the covariance is

D_{2}

, which indicates that there is a situation where not all the needed variables are included in the model when the covariance is

D_{2}

. For the CER value, it was reduced by 15.1% and 15.2% compared to the covariance

D_{1}

under both criteria.

The MRME line in Figure 1 fluctuates significantly with the value of the covariance

D_{3}

increasing by 402% and 361% compared with the covariance

D_{1}

for both criteria. The CSR value is less than 1 when the covariance is

D_{3}

, and the CER value decreases by 33.0% and 33.89% under both criteria compared with the covariance

D_{1}

.

Comparing with the

D_{2}

covariance case, when the

D_{3}

covariance resulted in the MRME value increasing by 88%, and the CER value decreasing by 21.1% under the SIC; for the CSR value, the value is less than one in both cases. Under the GACV criterion, the MRME increased by 71% and the CER decreased by 22.1% when the covariance structure shifted from

D_{2}

to

D_{3}

. Again, the CSR value was less than 1.

Within the range of random-effect covariance values considered in this simulation, it is observed that smaller covariance values are associated with better estimation and variable selection performance, while larger covariance values lead to degraded accuracy.

Under the specific simulation conditions of this study (i.e.,

n = 30, m = 10, ρ = 0.5

, and random-effect covariance

D \in {D_{1}, D_{2}, D_{3}}

), the proposed method under the GACV criterion shows slightly lower MRME and slightly stronger performance in selecting significant and excluding non-significant variables compared to the SIC. This finding is specific to the range of random-effect covariance values considered in this simulation.

3.: Simulation 3—Comparison with other methods

Based on the simulation above, it is clear that in this paper, the method under the GACV criterion is better estimated. The GACV criterion is used in this paper when different methods are used for comparison with the methods involving the selection of penalty parameters. In this paper, we compare the following methods: the Poisson regression model without penalty (denoted as P), the single-penalty mixed-effects Poisson regression model with SCAD penalty for fixed effect only under the GACV criterion (denoted as P-H-GACV), the model with double LASSO penalty under the GACV criterion (denoted as DLASSO-GACV), and the mixed- effects Poisson regression model with double SCAD penalty under the GACV criterion (denoted as DSCAD-GACV). Similarly, in generating the simulated data, let

n = 30

,

m = 10

,

ρ = 0.5

; the random effect coefficient is

α_{i} = {(α_{i 0}, α_{i 1}, α_{i 2})}^{T} ~ N_{3} (0, D)

,

D = d i a g (1, 1, 0)

,

z_{i j}^{T} = (1, x_{i j 1}, x_{i j 2})

. For the fixed effect coefficient, consider the following two cases:

Case 1: Sparse model

β = {(3, - 1.5, 0, 0, 2, 0, 0, 0)}^{T}

.

Case 2: Dense model

β = {(0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)}^{T}

.

Setting the threshold value

ε = 10^{- 6}

when stopping the iteration, simulations of four methods are compared in the above two cases, and the results after repeating the simulations 100 times are shown in Table 3 (In case 2, the number of true zero coefficients is 0, which makes the denominator of the CER formula zero. Therefore, the CER is mathematically undefined and is conventionally recorded as 1.0.).

From Table 3, it can be seen that under the sparse model (case 1), the MRME value of the model with the double LASSO penalty (DLASSO-GACV) is 55 times that of the double SCAD method proposed in this paper, indicating that its parameter estimation accuracy is inferior to that of our method. The CSR values of the two methods are the same, which means they have the same effect in selecting important variables. However, the CER value of DLASSO is much lower than that of the double SCAD method, indicating that its ability to exclude insignificant variables is weaker than that of our method.

Compared with the Poisson mixed-effects model with a single SCAD penalty only on the fixed effects (P-H-SCAD), the MRME value obtained from the parameter estimation of our double SCAD penalty model is significantly lower. Specifically, the MRME value of the P-H-SCAD method is 24 times that of the DSCAD-GACV method. Although the CER value of P-H-SCAD is higher, its CSR value for selecting important variables is only half of that of the double SCAD method. Moreover, in model construction, the CSR value is more important than the CER value, which suggests that for the Poisson mixed-effects model, the double SCAD penalty performs significantly better than the single penalty.

For the unpenalized Poisson regression model, not only is its estimation error (MRME) larger, but it also fails to exclude any irrelevant variables.

From the perspective of the bias–variance trade-off, under the simulation settings adopted in this study, the proposed double SCAD penalty method (DSCAD-GACV) achieves a more favorable balance between variable sparsity and estimation stability.

Specifically, in the sparse model, DSCAD-GACV yields the lowest absolute bias (0.185) and variance (0.021), thereby attaining the smallest mean squared error (MSE = 0.055) among all competing methods. In the dense model, this method maintains an absolute bias of 0.210 and a variance of 0.035, both of which are significantly lower than those of the double LASSO (DLASSO-GACV) and single SCAD (P-H-SCAD) methods, demonstrating superior estimation stability.

In contrast, under the same simulation conditions, the double LASSO penalty (DLASSO-GACV) introduces considerably larger estimation bias and variance. For illustration, in the sparse model, the absolute bias of DLASSO-GACV (0.220) is 1.2 times that of DSCAD-GACV, while its variance (0.028) is 1.3 times higher than that of DSCAD-GACV. Although the unpenalized model (P) exhibits relatively small bias (0.193 in the sparse model and 0.215 in the dense model), its excessively large variance (0.082 in the sparse model and 0.098 in the dense model) leads to inadequate stability of the estimation results.

In the dense model, the MRME value of the double LASSO method is much larger than that of the double SCAD method, and the MRME value of the single-penalty method is also higher than that of the DSCAD-GACV method by a factor of 4.6. The CSR value of the double LASSO penalty is 0.889, while that of the single penalty is only 2.9, and that of the double SCAD method is 0.787. It can be known that although the CSR value of double LASSO is high, the double SCAD penalty method in this paper is much higher than that of double LASSO in terms of the accuracy of parameter estimation, and performs better in the dense model than that of the single SCAD penalty. For the non-penalty method, although all important variables are selected, the MRME values are much larger than those of the double SCAD method.

From the above analysis, under the sparse and dense model settings of this study (i.e.,

n = 30, m = 10, ρ = 0.5, D = d i a g (1, 1, 0),

and 100 repetitions), the double SCAD penalty method proposed in this paper demonstrates significantly better accuracy in variable estimation and selection compared to the double LASSO and single SCAD penalty models. This conclusion is an empirical finding within the scope of the current simulation design and should not be generalized to all possible modeling scenarios. In the sparse case, the double SCAD method successfully selects all required variables, while in the dense case, some required variables are missed, and the MRME is higher. This suggests that, within the current experimental conditions, the double SCAD method is more suitable for sparse modeling scenarios. Further research is needed to evaluate its performance under different sample sizes, parameter dimensions, and data distributions.

Limitations and Future Research:

The simulation results presented in this chapter are based on a specific set of experimental conditions, including sample size

n = 30, m = 10

, fixed effect sparsity, and Poisson-distributed outcomes. While these settings are representative of many applied scenarios, the findings should be interpreted as empirical evidence within the scope of this study, rather than universal conclusions about the superiority of the double SCAD penalty method. Future research will extend the simulation design to include larger sample sizes, higher-dimensional parameters, and different outcome distributions to further validate the generalizability of the proposed method.

5. Examples

This paper analyzes data from a cigarette factory using the double SCAD method proposed above. The data were collected from 15 April to 15 June 2020 (no data were recorded on May 1, Labor Day). The factory has three shifts and six machines, with two shifts rotating each morning and evening, and each shift operating all six machines. In total, there are 708 observations.

This study mainly explores the effects of machine parameters on the number of short cigarettes. A short cigarette is one of the most common quality defects during the cigarette rolling process. Cigarettes with such quality defects are discarded, which increases production consumption. Investigating which machine parameters have significant effects can help cigarette factories reduce consumption losses.

The details of the machine parameters are shown in Table 4.

The scatter plot of the number of short cigarettes versus each variable is shown in Figure 2.

From Figure 2, we can observe that the scatter plots of vehicle speed, rubbing plate temperature, SE1 soldering iron temperature, and SE2 soldering iron temperature versus the number of short cigarettes are clustered. By contrast, the scatter plots of compaction end position versus the number of short cigarettes are more scattered, and those of flattening position versus the number of short cigarettes are highly dispersed with no obvious linear pattern.

According to the data characteristics, the machine parameter variables are analyzed by the machine. Table 5 presents the descriptive statistics for each machine parameter variable.

As shown in Table 5, there are obvious differences in the distribution of parameters across different machines. For vehicle speed, machine 7 has higher minimum, mean, and median values but a smaller standard deviation, indicating more aggregated data. For the rubbing plate temperature, the ranges, means, and medians are similar among machines 5, 6, and 8, as well as between machines 4 and 7, while machine 9 has a minimum of 0, and the standard deviations of machines 4 and 5 are close but differ from those of the others. For flat plate position, machines 4 and 6 have positive maximum values and standard deviations above two, whereas other machines have negative maxima and standard deviations below 1.3, with large variations in medians. For compacting end position, machines 6 and 8 have maximum values 2–3 times those of other machines, and machine 8 has a much higher standard deviation. For SE1 soldering iron temperature, machine 4 has a much lower minimum, while machine 6 has a higher minimum and larger standard deviation, with similar means and medians between machines 5 and 8, and between machines 7 and 9; SE2 soldering iron temperature follows a similar pattern, with machine 4 showing more dispersed data. Overall, the ranges and dispersions of parameters differ across machines, suggesting that machine-level parameters may affect the number of short cigarettes.

Next, the characteristics of the short-position data will be researched by machine, and the box line plots are shown in Figure 3.

As can be seen from Figure 3, the distribution of short-position values is more scattered in machines 7 and 8, and more aggregated in machines 5 and 6. Moreover, the mean of short-position values in machine 4 is higher than the mean of the other machines, and the means in machines 5 and 6 are significantly lower. As we can see, there are some variations in the short-position values between the different units, indicating that the type of machine has some influence on the short-position numbers. It can be seen that for the daily number of short cigarettes, it is necessary to consider the type of machine when researching it, and the longitudinal counting data can be used to incorporate the effect of machine type into the short-position data.

In the actual data, there are some null values for each explanatory variable, which are filled with the median, and the data are normalized and modeled by the following mixed effects Poisson counting regression model:

E (Y_{i j} | X_{i j}, Z_{i j}) = e^{X_{i j}^{T} β + Z_{i j}^{T} α_{i}}

(38)

Since there are 2 shifts in production per day, one shift per day is taken as a whole sample cross-section, the whole is noted as a work schedule for a total of 59 days with two shifts per day, that is

n = 118

. Each work schedule operates on 6 machines, that is

m = 6

, the total sample size was 708. The number of short positions generated by the first

i

work schedule operation of the first

j

machine is noted as

Y_{i j}

,

X_{i j}^{T} = (1, X_{i j 1}, X_{i j 2}, X_{i j 3}, X_{i j 4}, X_{i j 5}, X_{i j 6})

denotes the intercept and 6 values of machine parameters obtained for the first

i

work schedule operation of the first

j

machine, and

β

is the coefficient vector of the explanatory variables with dimension

1 \times 7

.

Z_{i j}^{T}

is the random effect generated by the explanatory variables when operating 6 machines in 118 work schedules. Since it is not known in advance which explanatory variables generate random effects, let

Z_{i j} = X_{i j}

;

α_{i}

is the coefficient vector corresponding to

Z_{i j}^{T}

, with dimension

118 \times 7

.

The double SCAD penalty method proposed in this paper is used to model and analyze the cigarette plant data to select the variables that would have an impact on the number of short cigarettes. And for the selection criteria of the penalty parameters, according to the simulation, the GACV criterion and the Poisson regression model without penalty are used as a control group; the results of the runs are shown in Table 6 and Figure 4.

In the Double SCAD (DSCAD) penalized model, the standard errors and p-values for some variables appear as NA, which is caused by the variable selection mechanism of the model. The DSCAD penalty shrinks the coefficients of variables with weak explanatory power for the response variable to zero, thereby achieving automatic variable selection. When refitting the mixed-effects Poisson model based on the selection results, these variables with zero coefficients are excluded from the model, so their corresponding standard errors, z-statistics, and p-values cannot be calculated. An estimated value of zero in the table indicates that the variable has been eliminated by the model. The NA values do not represent missing data but are normal outcomes of model simplification after variable selection. This reflects the advantage of DSCAD in performing high-dimensional variable selection and model reduction while maintaining satisfactory fitting performance.

As shown in Table 6, the DSCAD model based on the GACV criterion effectively achieves sparse variable selection and identifies only two non-zero variables as the key factors contributing to short-position defects:

Rubbing plate temperature (X₂) with a coefficient of 0.0067 (p < 0.001), indicating that a 1 °C increase in temperature is associated with a 0.67% average increase in the expected count of short cigarettes.

Position of flat plate (X₃) with a coefficient of 0.0409 (p < 0.001), meaning that a one-unit increase in the flat plate position leads to a 4.09% increase in the expected count of short cigarettes, making it the most significant factor.

All other variables (X₁, X₄, X₅, X₆) are estimated to be exactly zero, with their standard errors and p-values marked as NA, confirming that these variables have been excluded from the model.

In sharp contrast, the unpenalized Poisson model yields significant non-zero coefficients (p < 0.001) for all variables, including those that are actually irrelevant (e.g., vehicle speed X₁).

This demonstrates that the DSCAD model can produce a parsimonious and interpretable model specification, whereas the traditional Poisson model suffers from overfitting and yields misleading results.

Figure 4 visually reinforces these findings. The blue bars represent the inflated coefficients from the unpenalized Poisson model, which are all non-zero and vary widely. The red bars represent the DSCAD estimates, which are sparse and concentrated on the two meaningful variables. For variables like vehicle speed

(X 1)

and compacting end position

(X 4)

, the DSCAD model correctly sets their coefficients to zero, while the Poisson model incorrectly identifies them as significant. This visual comparison clearly demonstrates the superiority of the DSCAD method in distinguishing signal from noise in real industrial data.

To quantify the advantages of the DSCAD model, we compared its fitting performance with the unpenalized Poisson model.

As shown in Table 7, the DSCAD model achieved lower AIC (388,367.6 vs. 391,879.4) and BIC (388,408.7 vs. 391,902.2), indicating a better balance between model fit and complexity.

Practical Implications for Cigarette Factory Management

The real-world application provides clear, actionable insights.

Prioritize process control for two key parameters:

Flat plate position (X3): As the most impactful factor, engineers should focus on tightening the tolerance for this setting to minimize production errors.
Rubbing plate temperature (X2): While the effect is smaller, it is statistically significant. Monitoring and stabilizing this temperature can lead to incremental quality improvements.

Reallocate quality control resources:

The DSCAD model confirms that vehicle speed, compacting end position, and soldering iron temperatures are not significant drivers. Resources currently allocated to monitoring these variables can be redirected to controlling the flat plate position and rubbing plate temperature.

Integrate the DSCAD method into routine QC:

The method can be used to periodically re-evaluate the drivers of quality defects, ensuring that process control efforts remain focused on the most impactful factors.

The application to the cigarette factory data provides strong empirical evidence for the advantages of the proposed double SCAD penalty method. By automatically selecting only two relevant variables from six candidates, the model avoids overfitting and produces results that are consistent with engineering expertise. The DSCAD model not only improves interpretability but also delivers superior predictive performance, making it a valuable tool for data-driven decision-making in industrial quality control.

6. Conclusions

This paper proposes a double SCAD penalty model that imposes the SCAD penalty simultaneously on both fixed effects and random effects within the mixed-effects Poisson count regression model. Furthermore, the model accounts for the influence of random effects while enabling concurrent parameter estimation and variable selection. For parameter estimation, this paper presents the DSCAD-LQA to compute the parameters of the double SCAD model, which achieves simultaneous coefficient estimation for both fixed effects and random effects. With regard to penalty parameter selection, simulation results show that the model estimation performance under the GACV criterion is superior to that under the SIC in all scenarios. This paper mainly investigates whether variations in the degree of variable correlation and changes in the covariance of random effects affect the accuracy of model parameter estimation and variable selection.

When the variable correlation coefficients change, it is found that as the correlation coefficients increase, the estimation error of the proposed double SCAD model gradually increases, and the accuracy of the model in correctly identifying important variables and excluding unimportant variables decreases accordingly. In particular, when the correlation coefficients are relatively large, the model’s performance in parameter estimation and variable selection is affected more significantly.

When the covariance of random effects changes, it is observed that with the gradual increase in covariance, the parameter estimation error of the double SCAD model also increases, and the accuracy of the model in including important variables and excluding unimportant variables declines. When the covariance of random effects is relatively small, the model exhibits superior performance in parameter estimation and variable selection.

Furthermore, this paper compares the proposed double SCAD penalty model with three benchmark methods—the non-penalized Poisson regression model, the mixed-effects Poisson regression model with a single SCAD penalty on fixed effects, and the mixed-effects Poisson regression model with a double LASSO penalty—under both sparse and dense scenarios. The results demonstrate that the proposed double SCAD penalty model achieves better performance in parameter estimation and variable selection than the other three methods and is more suitable for sparse scenarios.

Author Contributions

Conceptualization, K.L. and H.L.; Software, X.R.; Validation, K.L. and X.R.; Formal analysis, K.L.; Investigation, X.R.; Resources, H.L.; Writing—original draft, K.L.; Writing—review & editing, Y.L.; Visualization, X.R.; Supervision, H.L. and Y.L.; Project administration, H.L. and Y.L.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

National Social Science Fund of China: 24BTJ068; Key Humanities and Social Science Fund of Hubei Provincial Department of Education: 25D096.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fan, J.; Peng, H. Nonconcave Penalized Likelihood with a Diverging Number of Parameters. Biometrics 2004, 32, 928–961. [Google Scholar] [CrossRef]
Laird, N.M.; Ware, J.H. Random Effects Models for Longitudinal Data. Biostatistics 2004, 38, 963–974. [Google Scholar] [CrossRef]
Wang, Y.; Han, Y. Score Tests for Overdispersion in Marginalized Zero-Inflated Poisson Regression Based on Marginalized Zero-Inflated Generalized Poisson Model. Stat. Anal. Data Min. ASA Data Sci. J. 2025, 18, e70019. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, X.; Zhang, X.; Zhang, J. Simultaneously detecting spatiotemporal changes with penalized Poisson regression models. Comput. Stat. Data Anal. 2025, 212, 108240. [Google Scholar] [CrossRef]
Arima, S.; Calculli, C.; Pollice, A. A zero-inflated Poisson spatial model with misreporting for wildfire occurrences in southern Italian municipalities. Environmetrics 2024, 36, e2853. [Google Scholar] [CrossRef]
Gomtsyan, M.; Lévy-Leduc, C.; Ouadah, S.; Sansonnet, L. Sign consistent estimation in a sparse Poisson model. Stat. Probab. Lett. 2024, 209, 110107. [Google Scholar] [CrossRef]
Liu, Y.; Yin, J. Variable selection in partially linear regression models for time series. Commun. Stat. Theory Methods 2026, 55, 468–486. [Google Scholar] [CrossRef]
Stolfi, P.; Bernardi, M.; Petrella, L. Sparse simulation-based estimator built on quantiles. Econom. Stat. 2025, 34, 32–43. [Google Scholar] [CrossRef]
Tibshirani, R.J. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable Selection Via Nonconcave Penalized Likelihood and Its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Wang, M.Q.; Song, L.X.; Tian, G.L. SCAD-Penalized Least Absolute Deviation Regression in High-Dimensional Models. Commun. Stat. Theory Methods 2015, 44, 2452–2472. [Google Scholar] [CrossRef]
Zou, H.; Li, R.Z. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. Ann. Stat. 2008, 36, 1509–1533. [Google Scholar]
Liu, Z.X.; Huang, Z.Q.; Sun, Q.; Zhang, H. Financial Early-Warning Model Using Logistic Regression with SCAD Penalty. Stat. Inf. Forum 2012, 27, 21–26. [Google Scholar]
Li, Y.; Xia, C. Penalized Quasi-Likelihood SCAD Estimator in High-Dimensional Generalized Linear Models. J. Wuhan Univ. 2018, 6, 533–539. [Google Scholar]
Buu, A.; Johnson, N.J.; Li, R.; Tan, X. New variable selection method for zero inflated count data with applications substance abuse field. Stat. Med. 2011, 30, 2326–2340. [Google Scholar] [CrossRef] [PubMed]
Li, X.L. Research on Two Types of High-Dimensional Random Effects Regression Models Based on SCAD-L2 and SCAD Mixed Penalties. Doctoral Dissertation, Wuhan University of Technology, Wuhan, China, 2023. [Google Scholar]
Li, X.L.; He, S.X.; Wang, C.M. High-dimensional random effects linear regression model based on SCAD-L2 and SCAD mixed penalties. Acta Math. Sci. 2023, 43, 1297–1310. [Google Scholar]
Liu, Y.Z. SCAD-Regularized Heterogeneous Autoregressive Model for Volatility Forecasting. Master’s Thesis, Dalian University of Technology, Dalian, China, 2018. [Google Scholar]
Liu, J.; Chen, H.J. Regression model based on self-supervised learning and SCAD-Net regularization. Comput. Syst. Appl. 2021, 30, 37–45. [Google Scholar]
Liu, X.H.; Wang, Z.F.; Wu, Y.H. SCAD-type variable selection and estimation in censored regression models. J. Univ. Sci. Technol. China 2013, 43, 182–189. [Google Scholar]
Zhang, W.J.; Dong, C.L. Variable selection for Poisson regression models. J. Xinjiang Norm. Univ. (Nat. Sci. Ed.) 2025, 1–11. Available online: https://kns.cnki.net/kcms2/article/abstract?v=esvOG1ozB-iMynTYyhzH1hEVbTSOimCnirmRnToaVqXVBO6jlyn45_eftAl_jf0kVAFJjKN57QjK4ktKJS3eIDbmVCJ0UbMX9V3TjdXp1PmimL6tk42-B4VPG0IBzM8-LHa90PZkQ-Fk1rHPm4HIPW13SKkGdBRH_2M6-Ags3Lg5lPhf38tuzQ8vFjTtx5ni7z8XVIa6yEk&uniplatform=NZKPT&captchaId=ee0e796a-31cd-4fcf-b366-b2fc34105a2c (accessed on 6 March 2026).
Lange, N.; Laird, N.M. The Effect of Covariance Structures on Variance Estimation in Balance Growth-curve Models with Random Parameters. J. Am. Stat. Assoc. 1989, 84, 241–247. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Zhu, H.; Garcia, R.I.; Guo, R. Fixed and random effects selection in mixed effects models. Biometrics 2011, 67, 495–503. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.X.; Li, H.F. Research of Multi-Penalty Regression Process of Mixed Effects Models and Its Convergence. Stat. Inf. Forum 2017, 32, 3–10. [Google Scholar]
Koenker, R.; Ng, P.; Portnoy, S. Quantile Smoothing Splines. Biometrika 1982, 81, 673–680. [Google Scholar] [CrossRef]
Yuan, M. GACV for Quantile Smoothing Splines. Comput. Stat. Data Anal. 2006, 50, 813–829. [Google Scholar] [CrossRef]

Figure 1. Simulation results with different sizes of random-effect covariance.

Figure 2. Scatter plot of the number of short cigarettes versus machine parameter variables.

Figure 3. Short-position data box line plot for each machine.

Figure 4. Coefficient comparison between double SCAD and unpenalized Poisson model.

Table 1. Comparison between variance-component penalization and direct coefficient penalization.

Dimension	Variance-Component Penalization	Direct Coefficient Penalization (Our Approach)
Penalty Target	Random-effect variances $V a r (α_{i})$	Individual random-effect realizations $α_{i t}$
Shrinkage Behavior	All-or-nothing: if $V a r (α_{i}) \to 0$ , all αit are forced to 0	Local: only irrelevant $α_{i t}$ are shrunk to 0
Hierarchical Borrowing	Erases all within-cluster variation for a given effect	Preserves meaningful individual heterogeneity
Effective Complexity	Controls only the number of active random effects	Controls both the number of effects and the sparsity of their realizations

Table 2. Simulation results with different correlation coefficients.

$ρ$	Method	MRME	CSR	CER	Bias	Variance	MSE
0.25	DSCAD-SIC	0.069	1.000	0.876	0.304	2.092	0.160
0.25	DSCAD-GACV	0.060	1.000	0.886	0.164	1.948	0.096
0.5	DSCAD-SIC	0.084	1.000	0.806	0.386	2.229	0.243
0.5	DSCAD-GACV	0.079	1.000	0.844	0.226	2.090	0.164
0.75	DSCAD-SIC	2.726	0.900	0.390	0.542	2.613	0.498
0.75	DSCAD-GACV	2.416	0.900	0.418	0.333	2.265	0.399

Table 3. Comparison of simulation results between the three methods.

Method	Model	MRME	CSR	CER	Bias	Variance	MSE
DSCAD-GACV	sparse	0.079	1.0	0.844	0.185	0.021	0.055
DSCAD-GACV	dense	0.269	0.787	1.0	0.210	0.035	0.088
DLASSO-GACV	sparse	4.352	1.0	0.2	0.220	0.028	0.074
DLASSO-GACV	dense	26.334	0.889	1.0	0.245	0.042	0.105
P-H-SCAD	sparse	1.906	0.397	0.88	0.255	0.032	0.085
P-H-SCAD	dense	1.510	0.322	1.0	0.280	0.048	0.120
P	sparse	40.983	1.0	0	0.193	0.082	0.120
P	dense	2.219	0.889	1.0	0.215	0.098	0.150

Table 4. Machine parameter variable description statistics.

Variable	Parameter Name	Range	Median	Missing Amount
$X_{1}$	Vehicle speed	[0, 5523]	5442	62
$X_{2}$	Rubbing plate temperature	[0, 175]	160	63
$X_{3}$	The position of the flat plate	[−8.3, 0.8]	−4.6	61
$X_{4}$	The position of the compacting end	[−5, 17]	0	62
$X_{5}$	SE1 soldering iron temperature	[28, 251]	218	61
$X_{6}$	SE2 soldering iron temperature	[26, 249]	211	62

Table 5. Description statistics of each machine parameter variable.

Variable	Machine	Range	Mean	Standard Deviation	Missing Amount
vehicle speed	4	[17, 5490]	4709.20	1685.17	5456
	5	[1, 5486]	4736.12	1625.84	5293.5
	6	[0, 5504]	4931.83	1503.39	5492
	7	[337, 5523]	5342.96	555.70	5518
	8	[0, 5491]	4772.19	1661.60	5415
	9	[1, 5484]	4893.66	1556.76	5454.5
rubbing plate temperature	4	[134, 154]	143.87	5.38	143
	5	[154, 175]	164.30	5.56	160
	6	[156, 171]	161.30	3.39	160
	7	[143, 154]	150.42	1.76	151
	8	[156, 167]	160.25	2.29	160
	9	[0, 167]	163.20	16.19	165
the position of the flat plate	4	[−8.3, 0.4]	−3.45	2.70	−2.4
	5	[−8.2, −2.0]	−4.60	1.26	−4.4
	6	[−8.1, 0.8]	−4.06	2.20	−3.1
	7	[−6.9, −1.8]	−4.22	1.17	−3.9
	8	[−8.2, −0.7]	−4.92	1.18	−5.0
	9	[−8.3, −2.0]	−5.54	1.12	−5.6
the position of the compacting end	4	[−4, 9.4]	0.88	2.67	0
	5	[−2, 8.3]	1.43	2.11	1
	6	[−4, 15.6]	0.06	2.89	0
	7	[−4, 5.0]	0.32	1.84	0
	8	[−5, 17.1]	0.65	4.62	0
	9	[−5, 6.6]	−0.99	1.80	−1
SE1 soldering iron temperature	4	[28, 245]	211.75	22.29	212
	5	[188, 251]	226.68	10.60	225
	6	[213, 243]	223.61	7.00	224
	7	[190, 220]	205.70	8.15	205
	8	[199, 231]	222.58	12.57	230
	9	[194, 230]	209.07	9.18	209
SE2 soldering iron temperature	4	[26, 239]	208.68	21.58	211
	5	[186, 249]	224.68	8.88	226
	6	[207, 231]	215.27	4.83	216
	7	[195, 211]	203.10	4.77	203
	8	[199, 211]	208.43	4.19	211
	9	[194, 216]	205.13	5.96	205

Table 6. Variable estimation parameter.

Variable	Parameter Name	DSCAD Estimate	DSCAD Std. Error	DSCAD p-Value	Poisson Estimate	Poisson Std. Error	Poisson p-Value
$X_{1}$	Vehicle speed	0.0000	NA	NA	0.0229	0.0006	0.0000
$X_{2}$	Rubbing plate temperature	0.0067	0.0006	0	−0.0004	0.0006	0.5093
$X_{3}$	Position of flat plate	0.0409	0.0010	0	0.0272	0.0010	0.0000
$X_{4}$	Position of compacting end	0.0000	NA	NA	0.0253	0.0007	0.0000
$X_{5}$	SE1 soldering iron temperature	0.0000	NA	NA	0.0215	0.0009	0.0000
$X_{6 z}$	SE2 soldering iron temperature	0.0000	NA	NA	−0.0082	0.0010	0.0000

Table 7. Model fit comparison.

Model	Log-Likelihood	AIC	BIC	Number of Variables
DSCAD-GACV	−194,174.8	388,367.6	388,408.7	6
Poisson	−195,934.7	391,879.4	391,902.2	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, K.; Ren, X.; Li, H.; Luo, Y. Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study. Axioms 2026, 15, 214. https://doi.org/10.3390/axioms15030214

AMA Style

Li K, Ren X, Li H, Luo Y. Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study. Axioms. 2026; 15(3):214. https://doi.org/10.3390/axioms15030214

Chicago/Turabian Style

Li, Keqian, Xueni Ren, Hanfang Li, and Youxi Luo. 2026. "Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study" Axioms 15, no. 3: 214. https://doi.org/10.3390/axioms15030214

APA Style

Li, K., Ren, X., Li, H., & Luo, Y. (2026). Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study. Axioms, 15(3), 214. https://doi.org/10.3390/axioms15030214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Poisson Mixed-Effects Count Regression Model Based on Double SCAD Penalty and Its Simulation Study

Abstract

1. Introduction

2. Double Penalty Poisson Counting Regression Model

2.1. Double Penalty Poisson Counting Regression Model with SCAD Penalty

2.1.1. Model Establishment

2.1.2. Statistical Interpretation and Methodological Comparison

2.2. Iterative Local Quadratic Approximation Algorithm (DSCAD-LQA) for Model Unknown Parameter Estimation

Analysis of Computational Properties and Numerical Stability of the Algorithm

2.3. The Selection of Penalty Parameters

Remark on Joint Tuning of Penalty Parameters

3. Theoretical Property Proof

Discussion on the Practical Implications of Asymptotic Convergence Results

4. Comparative Study of Monte Carlo Simulation

5. Examples

Practical Implications for Cigarette Factory Management

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI