1. Introduction
The normal distribution serves as a fundamental element in statistical modeling because of its analytical ease and its essential role in limit theorems. However, real-world data frequently exhibit characteristics such as heavier or lighter tails, asymmetry, and, in certain contexts, multimodality, which a single Gaussian density cannot sufficiently represent [
1]. Classes of normal distribution may be found [
2]. However, there are flexible symmetric families that fit better and, in particular, may be bimodal. For rates and proportions on the unit interval, beta regression is a standard choice [
3].
Within this landscape, two-piece constructions provide an appealing route to flexibility by allowing distinct shapes on either side of a central location while preserving closed-form expressions for key quantities. Building on this idea, the two-piece normal family (symmetric double normal; with an asymmetric extension) offers a tractable kernel with a shape parameter that controls departures from normality, enabling richer behavior than the symmetric Gaussian distribution and supporting likelihood-based inference with standard tools [
4,
5].
These features are consequential in applications involving censoring and bounded support. We leverage them to develop the following: (i) a censored model for limited outcomes, (ii) a doubly truncated model on
, and (iii) a survival specification with a log-two-piece baseline and Gamma frailty. In biostatistics and environmental monitoring, detection limits induce mass accumulation at thresholds; in econometrics, corner solutions yield limited dependent variables; and for proportions, support is intrinsically constrained to
[
1]. A coherent framework based on the double normal (two-component normal mixture) family accommodates such constraints while allowing for flexible shapes (including possible bimodality) and straightforward computation [
4,
5]. Note that the
specification considered in
Section 2 is symmetric by construction; any asymmetry in the observed data distribution may arise from censoring/truncation mechanisms rather than from the latent error distribution itself.
This paper develops two extensions tailored to these scenarios. First, a censored specification of the double normal distribution provides a Tobit-type model that allows for flexible modality while accounting for accumulation at a censoring boundary [
1]. Second, a doubly truncated specification on
offers a natural competitor to beta regression for proportions and rates, with the added ability to represent platykurtic shapes and, when supported by the data-generating mechanism, multimodality through a principled truncation operator [
3,
6]. We also consider a survival formulation with Gamma frailty, linking heterogeneity modeling to established survival-analysis methodology [
7].
Our contributions are as follows. First, we establish formal properties for the proposed models, including stochastic representations, generating functions, moments, and conditions for modality [
4,
5]. Second, we propose regression structures via suitable link functions for bounded outcomes and survival contexts [
1,
7]; for completeness and reproducibility, we also provide the corresponding log-likelihood expressions under censoring/truncation and the analytical derivatives (score functions and observed Fisher information) used for numerical maximization. Third, we conduct Monte Carlo experiments to assess finite-sample performance (bias and root mean squared error) across sample sizes, censoring and truncation intensities, and modality regimes, and we compare models using classical information criteria [
8,
9,
10]. Finally, empirical illustrations in biomedical, economic, and labor datasets benchmark the proposed models against Gaussian, skewed, and beta alternatives and include formal tests of unimodality using the Hartigan-Hartigan dip test [
11].
Section 2 introduces the symmetric two-piece normal (double normal) distribution and summarizes its main properties (moments, kurtosis, and modality conditions).
Section 3 presents the censored two-piece normal (CTN) model, including its distributional properties, the Tobit-type regression extension, and likelihood-based estimation components for CTN.
Section 4 develops the doubly truncated two-piece normal (TTN) model on
and its regression specification.
Section 5 introduces the survival formulation with a log-two-piece normal baseline and Gamma frailty and provides the resulting marginal likelihood.
Section 6 consolidates general inference and implementation details (asymptotics, standard errors, and boundary issues).
Section 7 reports the simulation study, and
Section 8 presents the empirical applications. Finally,
Section 9 concludes.
7. Simulation Study
Here, LTN denotes the log-two-piece normal baseline survival model (
Section 5); left-censoring at
L enters the likelihood through
or
, paralleling the boundary contribution induced by censoring in the CTN setting.
We conducted a Monte Carlo study to assess the finite-sample performance of the MLEs for the left-censored LTN regression model. A categorical covariate with three levels was simulated, and inference was based on 1000 replicates with sample sizes
. The true parameters were
,
,
,
, and
. For the Gamma-frailty extension, we considered
. We report the mean estimates, absolute bias, RMSE, and Monte Carlo standard error (SE). Results are shown in
Table 1,
Table 2 and
Table 3; NF denotes the LTN model without frailty. Estimation used
optim in R with the BFGS method.
We note that we do not include a separate TTN simulation study for proportion data: the TTN model on is a purely doubly truncated continuous specification, and its likelihood-based estimation follows the standard truncated-likelihood construction. Our simulations, therefore, focus on the settings that introduce the main inferential challenges of the paper, namely censoring-induced point mass (CTN) and frailty marginalization in survival models.
The data were generated as follows:
- (i)
Fix the model parameters: , , the LTN shape parameter , the frailty variance (when applicable), the sample size n, and the target left-censoring proportion .
- (ii)
For each
, draw a group indicator
with equal probabilities. Construct
and set
. In the frailty model, draw
; otherwise, set
. Conditional on
, draw
, set
, and obtain the failure time by inversion as follows:
where
is the LTN baseline survival function (no frailty).
- (iii)
Given
, choose a common threshold
L such that
by solving
using
for the model without frailty and
for the marginal frailty model. Finally, define the observed time and indicator as follows:
Figure 1,
Figure 2 and
Figure 3 display MSEs by censoring level (0%, 25%, 50%) and sample size. Panels labeled without frailty correspond to independence. The MSEs of the regression coefficients are most affected by heavier censoring, particularly under independence, and decrease as
n grows. The LTN parameters show some sensitivity in
, while
is comparatively stable. Overall, with or without frailty, the estimators track the reference values well, even under censoring.
8. Applications
We present two real-data examples to illustrate two-piece normal models for censored data, both with and without covariates, and to highlight their practical value under censoring.
8.1. HIV Dataset
As a first illustration, we analyze HIV-1 RNA measurements from the Colombian SIVIGILA surveillance system (Ministry of Health). The database preserves patient anonymity and records age, sex, date of entry, HAART (highly active antiretroviral therapy) status at different stages, CD4 count, and viral load (copies of HIV-1 RNA). We focus on 106 women with at least one year of HAART. The assay detection limit is copies/mL; on the log scale, the censoring threshold is . In this sample, of observations fall below L.
We fit the censored two-piece normal model to the viral-load data. For the 69 uncensored observations (women on HAART ≥ 1 year), the summary statistics were: mean , variance , skewness , and kurtosis . The mild right skewness and kurtosis below 3 indicate lighter-than-normal tails, a feature that CTN accommodates under censoring.
Hartigan’s dip test [
11] yields
(
p-value
), rejecting unimodality. Alongside CTN, we also fitted a censored bimodal normal (CBN) model and a censored flexible normal (CFN) model (see [
22] for a flexible bimodal normal family).
Table 4 reports the corresponding MLEs (with standard errors), information criteria, and Anderson–Darling goodness-of-fit results. Maximum likelihood estimation was carried out using the
optim function in R. Model comparison relied on AIC [
8], BIC [
9], and CAIC [
10]. Larger
p-values in the Anderson–Darling (AD) goodness-of-fit test indicate better agreement with the fitted cdf; we computed the AD statistic using the
goftest package in R. Note that in the CFN competitor, the shape parameter is not restricted in sign; hence, negative estimates (for instance, when
) are admissible; by contrast, in our TN-based models, we assume
by construction.
As shown in
Table 4, according to AIC/BIC/CAIC, CTN and CFN provide the best fits (in that order). The AD
p-values likewise support CTN/CFN for these data. The observed censoring proportion is
. Model-implied expected censoring proportions, computed as
, were
(CTN),
(CFN), and
(CBN), indicating that CTN and CFN closely reproduce the observed censoring level.
8.2. FoodExpenditure Dataset
Following [
3], we analyze the FoodExpenditure data, where the response is the proportion of household income spent on food (
), and the covariates are household income (
) and household size (
) [
23]. The data are distributed using the R package
betareg [
24].
We fit three regression models for
Y as follows: (i) a beta regression as in [
3]; (ii) a doubly truncated normal (truncated Gaussian, TG); and (iii) a doubly truncated two-piece normal (TTN), noting that TG is the limit of TTN as
.
Table 5 reports MLEs (SEs) and the information criteria AIC, BIC, and CAIC. Estimation used
optim in R; beta regression used
betareg [
24]. By all three criteria, TTN provides the best fit, followed by TG and then beta.
To check model adequacy and detect outliers, we used the transformed martingale residual
with normal envelopes [
25,
26]. The QQ envelopes (
Figure 4) indicate that TTN adheres more closely to the reference line than TG and beta, corroborating the information–criteria ranking. We also report the Anderson–Darling (AD) statistic with
p-values in
Table 5; larger
p-values favor TTN over TG and beta.
8.3. Mroz Dataset
We analyze 753 married women from the classic labor–supply study: 428 work positive hours, while 325 do not. Let be the annual hours worked and define , where , so non-workers map to the point and workers to a continuous range below . As covariates, we use the number of children aged 6–18 (), the woman’s age (), and years of work experience (). We fit censored normal (CN), censored skew–normal (CSN), and censored two-piece normal (CTN) linear models by maximum likelihood.
Table 6 reports parameter estimates (SEs) and information criteria. According to AIC, BIC, and CAIC, the CTN model provides the best fit, followed by CSN and CN. QQ envelopes of the transformed martingale residuals
(
Figure 5) also favor CTN over the competitors.
The
envelopes for CN, CSN, and CTN are shown in
Figure 5a–c. The CTN envelopes adhere most closely to the reference, reinforcing the information–criteria ranking.
8.4. Childhood Cancer Dataset
We analyze children in Colombia (2023) with confirmed cancer who were hospitalized, obtained from the National Institute of Health via the SIVIGILA portal [
27]. The outcome is the time from symptom onset to hospitalization (in days). Left-censoring arises when the onset date is unknown at the time of admission. Covariates are sex (female/male) and age in years at consultation.
Figure 6 indicates a skewed distribution of times.
In total,
children were analyzed; the event proportion was
. Sample summaries were as follows: mean
, sd
, median
,
,
(days). Additional descriptive statistics by censoring status and sex are reported in
Appendix B (
Table A1).
Model Estimates
We fitted log-two-piece normal baseline models for the time distribution, both without and with Gamma frailty
. The parameters are the regression vector
, scale
, shape
, and (for frailty) variance
.
Table 7 reports MLEs (SEs); bold entries denote Wald
. The model without frailty is labeled LTN, and the model with Gamma frailty is labeled LTN–G. Covariates:
(intercept),
(male vs. female), and
(age in years, centered).
Under the model without frailty, covariate effects are not statistically different from zero. In contrast, with Gamma frailty, there is strong evidence of association as follows: and , indicating (conditional on frailty) shorter expected times for males and for older children on the log–time scale. The estimated scale is (SE ), and the frailty variance suggests substantial unobserved heterogeneity. A likelihood–ratio comparison favors the frailty model (LR , ), with fit indices: LTN, , AIC , BIC , CAIC ; LTN–G, , AIC , BIC , CAIC . This specification accommodates baseline bimodality (via ) and heterogeneity (via ).
9. Conclusions
This paper develops a unified likelihood-based framework for limited outcomes using the two-piece normal construction. The framework covers a censored specification for boundary-inflated responses, a doubly truncated specification on for rates and proportions, and a survival formulation with a log-two-piece normal baseline and Gamma frailty. We provide explicit closed-form building blocks (pdf, cdf, survival, hazard, and cumulative hazard) and likelihood expressions that support routine maximum likelihood estimation and reproducible implementation.
Monte Carlo results suggest that the maximum likelihood estimators exhibit a small bias and decreasing RMSE as the sample size increases. Censoring primarily increases the variability of regression coefficients; the scale parameter is comparatively stable, whereas the shape parameter can be more sensitive under heavy censoring. In the frailty setting, the variance component is recoverable at moderate sample sizes in the scenarios considered, and the observed information provides standard errors with satisfactory frequentist behavior.
The empirical analyzes broadly support these findings. For HIV-1 RNA subject to a detection limit, the censored two-piece normal attains competitive information–criterion values and passes goodness-of-fit checks; model-implied censoring proportions closely match the observed fraction. For household food expenditure on , the doubly truncated two-piece normal improves fit relative to the truncated Gaussian and beta regression in these datasets. In the labor-supply application, the censored specification improves fit over Gaussian and skew-normal alternatives. For childhood cancer survival data, adding Gamma frailty to a log-two-piece baseline improves fit and highlights covariate effects consistent with unobserved heterogeneity; however, sensitivity to starting values and parameterization should be assessed.
Limitations and future work. While the proposed models retain closed forms and integrate naturally with standard diagnostics (information criteria, Anderson–Darling tests, QQ envelopes, and residual analyzes), caution is warranted under extreme censoring/truncation or weak separation, where the shape parameter may be weakly identified. Practical strategies include warm starts from simpler baselines, multiple initializations, and profile-likelihood checks; moreover, when approaches the boundary value 0, standard large-sample approximations for inference and likelihood-ratio testing may require adjustment. Finally, in the doubly truncated setting, regression on the location parameter does not necessarily imply mean regression, and interpretation should be made explicit. Future work includes Bayesian implementations with identifiability-aware priors, penalized and robust estimation in contaminated settings, shared-frailty and multilevel extensions, interval/combined censoring and time-varying covariates, semi-/nonparametric baselines within the two-piece construction, and tailored diagnostics for multimodality; public software and fully reproducible workflows would further support adoption in applied domains.