2.1. Data: The Continuous Sample of Working Histories
The empirical application is based on the Continuous Sample of Working Histories (CSWH), a longitudinal administrative database for the Spanish labor market. Following
Troncoso-Ponce (
2017) and
Troncoso-Ponce (
2018), we use this source to construct a sample of 132,262 young low-skilled workers observed over the period 2000–2019. The CSWH contains complete labor market histories for more than one million individuals and represents a 4% non-stratified random draw from the population linked to the Spanish Social Security Administration. It includes both wage earners and recipients of Social Security benefits, such as unemployment benefits, disability benefits, survivor pensions, and maternity leave (see, for example,
Arranz & García-Serrano, 2011;
Lafuente, 2020;
Lapuerta, 2010).
The database provides the start and end dates of employment and unemployment episodes throughout the observed labor history of each individual, together with personal and job-related information. This structure makes it possible to reconstruct monthly labor market spells and to estimate transition rates between employment and unemployment in a longitudinal framework. Additional details on the variables available in the CSWH can be found in
García-Pérez (
2008) and
Arranz et al. (
2013). This type of administrative source has also continued to prove useful in recent research on labor market transitions in Spain based on discrete-time duration frameworks and rich employment histories (
Carrasco et al., 2024).
The estimation sample consists of young workers with low educational attainment and qualifications in the Spanish labor market between 2000 and 2019. Their average age is 21.9 years and 36.7% are women. On average, each worker experiences 6.4 employment episodes and 6.7 unemployment episodes, with mean durations of 10.3 and 11.7 months, respectively. Approximately 25% of these episodes last no more than 2 months in unemployment and 3 months in employment, whereas 5% extend to at least 37 and 35 months, respectively, indicating both substantial turnover and persistent non-employment spells. After expanding the spell data to the monthly level and redefining the time-varying information accordingly, the estimation sample contains 8,001,341 person-month observations, of which 4,485,930 correspond to unemployment and 3,515,411 to employment.
Appendix A.2 reports the main descriptive statistics.
2.2. Econometric Model
The empirical framework jointly estimates monthly transition rates from two mutually exclusive labor market states: employment and unemployment. Individuals are followed over the observation period and, at each month, can be observed in one of these two states (see
Allison, 1982;
Jenkins, 1995;
Lancaster, 1992). The model is specified in discrete time and allows the transition process to differ across origin states, both in terms of observed covariates and unobserved heterogeneity.
Consider an individual who begins an (un)employment episode at time , where time is measured in monthly intervals. The individual is then observed month by month until either a transition occurs from the current state to the destination state modeled for that equation or the observation period ends, in which case the spell is right-censored.
The hazard rate out of (un)employment is modeled as follows:
where
, with
u denoting unemployment and
e denoting employment. The hazard rate at month
depends on duration dependence, captured by
, on a vector of observed covariates
, whose effects are summarized by
, and on a state-specific unobserved component
.
In both equations, duration dependence is modeled flexibly through sets of duration dummies. In the employment equation, the specification is further allowed to differ by contract type. More specifically, the employment equation is specified so as to allow the duration profile of apprenticeship contracts to differ from that of other temporary contracts, in line with the evidence suggested by non-parametric Kaplan–Meier estimates. This choice is intended to capture the possibility that apprenticeship contracts follow a distinct employment-duration pattern relative to the rest of temporary contracts. The vector includes both common and state-specific regressors. In the employment equation, the covariates capture individual characteristics, type of labor contract, sector of activity, regional conditions, and calendar-time effects. In the unemployment equation, the covariates include individual characteristics, characteristics of the previous job spell, regional labor market conditions, and calendar-time effects. In both cases, the specification allows for time-varying covariates whenever the underlying information changes over the spell. This flexible formulation is intended to isolate duration dependence from compositional differences across workers and labor market conditions.
A key feature of the proposed framework is that, unlike the earlier
hshaz2 and
hshaz commands, which are restricted to the estimation of hazard rates from a single state, it allows the joint estimation of transition rates from both states within a unified specification and, by applying the methodology proposed by
Heckman and Singer (
1984), accommodates the identification of unobserved heterogeneity as a non-parametric bivariate discrete mixture of state-specific latent effects
. The contribution to the total likelihood function for an individual
i with unobserved heterogeneity captured by the vector
is given by:
where
and
denote the hazard rates out of unemployment and employment at month
, respectively. Likewise,
and
denote the survivor functions at month
in the unemployment and employment states. As shown in Equation (
2), both the hazard rates and the survivor functions depend on state-specific observed and unobserved covariates, some of which may vary over time.
Unobserved heterogeneity is introduced through a discrete-mixture specification that is allowed to differ across the unemployment and employment equations. The latent effects are represented by a finite number of support points for each state, and their joint distribution is modeled as a bivariate discrete mixture. Identification is achieved from the longitudinal structure of the data and the joint estimation of transitions from both states. Since individuals contribute repeated employment and unemployment episodes over time, the model exploits within-individual variation in spell durations, censoring patterns, and transition histories to recover the support points and their associated probabilities.
The number of support points used to approximate the distribution of unobserved heterogeneity must balance flexibility and parsimony. In practice, models with 2, 4, or 9 mass points can be estimated, corresponding to increasingly rich discrete approximations of the latent distribution. Model choice should be guided by a combination of statistical fit and practical interpretability. In particular, improvements in the log-likelihood and information criteria, together with the stability of the estimated support points and their probabilities, provide a natural basis for selection (see, for example,
Gaure et al., 2007;
Nicoletti & Rondinelli, 2010). When additional mass points generate only marginal gains in fit, very small estimated probabilities, or substantially higher computational costs without altering the substantive conclusions, the more parsimonious specification is preferable. In the empirical application below, the four-point specification offers a useful compromise between flexibility, interpretability, and computational feasibility.
The total likelihood function for the empirical specification used in this article is:
with
, where
. Equation (
3) corresponds to the four-point specification used in the baseline empirical exercise. More generally, the same framework can be extended to alternative numbers of support points by enlarging the discrete support of the latent effects in each state and reparameterizing the associated probability masses accordingly.
The model parameters are estimated by maximum likelihood, using analytical expressions for the gradient vector and the Hessian matrix of the log-likelihood function. This makes it possible to implement the d2 ml method for the maximization of and improves numerical efficiency in the estimation of the two-state duration model.
In the employment equation, the covariates include: (i) personal characteristics (sex, age, age squared, education, and nationality); (ii) business-cycle conditions, measured by the quarterly regional unemployment rate and its interactions with the elapsed duration of the spell; (iii) regional fixed effects; (iv) a fully non-parametric baseline hazard defined through employment-duration dummies; and (v) characteristics of the current job, including the type of contract and the sector of activity. In the unemployment equation, the specification includes: (i) the same set of personal characteristics; (ii) the quarterly regional unemployment rate; (iii) regional fixed effects; (iv) a fully non-parametric baseline hazard defined through unemployment-duration dummies; and (v) characteristics of the previous job spell, including the type of contract previously held and the sector of the last job. Descriptive statistics for both sets of regressors are reported in
Appendix A.2.