Smoothed Maximum Score Estimation of Discrete Duration Models

This paper extends Horowitz’s smoothed maximum score estimator to discrete-time duration models. The estimator’s consistency and asymptotic distribution are derived. Monte Carlo simulations using various data generating processes with varying error distributions and shapes of the hazard rate are conducted to examine the finite sample properties of the estimator. The bias-corrected estimator performs reasonably well for the models considered with moderately-sized samples.


Introduction
Parametric discrete-time duration models are used extensively within econometrics and the other statistical sciences. Since misspecification of these models can lead to invalid inferences, a variety of semiparametric alternatives have been proposed. However, even these alternative semiparametric estimators exploit certain smoothness and moment conditions, which may be untenable in some circumstances. To address these shortcomings, we propose a new estimator, based on Horowitz (1992)'s smoothed maximum score estimator of single-period binary choice models, which relaxes these assumptions. To motivate and contextualize this estimator, we use this Introduction to review the relevant literature on discrete duration and binary choice models and indicate how our proposed estimator fills a gap in the literature.
In econometrics, discrete-time duration models are typically framed as a sequence of binary choices. The probability of remaining in a state at time s (the continuation probability) is denoted F s (β 0 ), and the hazard rate is simply h s (β 0 ) = 1 − F s (β 0 ). Many parametric forms have been employed for the hazard rate in these models including extreme value, logistic, normal and other parsimonious specifications. Examples using a logistic specification include: Huff-Stevens (1999), Finnie and Gray (2002), Bover et al. (2002) and D' Addio and Rosholm (2005); normal distribution: Meghir and Whitehouse (1997) and Chan and Huff-Stevens (2001); extreme value (also known as the complementary log-log model): Baker and Rea (1998), Cooper et al. (1999), Holmas (2002), Fennema et al. (2006) and Gullstrand and Tezic (2008). These and others were reviewed in Allison (1982) and Sueyoshi (1995). Hess (2009) has suggested using the generalized Pareto distribution, which nests the extreme value and logistic distributions. These specifications lead naturally to maximum likelihood estimation of β 0 , although it is useful to note that there are alternative ways to estimate β 0 including nonlinear regression, treating F s (β 0 ) as a conditional mean. As with any parametric approach, misspecification of the hazard rate can lead to invalid inferences. In this regard, we consider various relevant semiparametric alternatives, which relax the parametric assumptions.
We note first that semiparametric estimation of continuous-time models has been the focus of substantial research in the discipline. Numerous authors have developed distribution theory for semiparametric estimation of various continuous-time duration models including Horowitz (1999), Nielsen et al. (1998), Van der Vaart (1996) and Bearse et al. (2007). While these and other semiparametric estimators allow for the relaxation of some parametric assumptions associated with continuous-time duration models, they are not generally appropriate when the duration random variable has a discrete distribution.
We adopt the standard approach in econometrics of constructing the continuation probability from an underlying latent regression structure. In a standard single-period basic binary choice model, is the usual indicator function, Z is an index function of observable random variables and unknown parameters and U has a distribution function F. With discrete-time duration models, the observed duration is the sum of a sequence of indicators so that There is a large literature on semiparametric estimation of single-period binary choice models. We briefly review this, highlighting how it has been adapted for certain multivariate discrete choice and/or discrete-duration models and finally how our proposed estimator fills a gap in this research. Since in some cases, the conditional mean of Y in the single-period case can be written as F(β 0 ), the parameter of interest, β 0 , can be estimated from a semi-parametric regression. This was suggested by Ichimura (1993) to obtain a √ N-consistent estimator of β 0 . With respect to duration models and exploiting the fact that F s can also be written as the conditional mean of the choice variable, Reza and Rilstone (2014) minimized a sum of squared semiparametric residuals to estimate the parameters of interest. In a similar vein, Klein and Spady (1993) developed a semi-parametric maximum likelihood estimator of β 0 with the single observation likelihood function written as l Klein and Spady's (1993) estimator essentially consists of replacing F with a nonparametric conditional mean function. Reza and Rilstone (2016) adapted Klein and Spady's (1993) estimator to the discrete duration case. They also derived the efficiency bounds and showed that their estimator obtained these bounds. We note that the approaches in Ichimura (1993) and Klein and Spady (1993) require continuity of F in the underlying covariates and are limited with respect to the forms of allowable heteroskedasticity (for example, heteroskedasticity from time-varying parameters is precluded). Another problem is simply that identification may not be possible under the mean-independence restriction that E[U|Z] = 0. 1 By extension, the estimators of Rilstone (2014, 2016) suffer the same shortcomings as applied to duration models.
With respect to single-period binary choice models, Manski's (1975Manski's ( , 1985 Maximum Score (MS) estimator circumvents these limitations using simply the median-independence restriction that Median[U|Z] = 0. The MS estimator can be written as the maximizer of: where Z i (β) is an index function of the observable covariates. As is usually the case, a normalization of β is necessary. For the estimator to be consistent, a few restrictions need to be imposed, in particular with respect to the distribution of U. The shortcomings of the estimator are that it is only N 1/3 -consistent, and its asymptotic distribution, a form of Brownian motion, is not amenable for use in the applied work. From one perspective, the shortcomings of the MS estimator derive from its use of the non-differentiable indicator function. Horowitz (1992) largely circumvented its limitations in this regard by replacing the indicator function with a smoothed indicator function, K † (Z i (β)/γ). The objective function for the Smoothed Maximum Score (SMS) estimator is: 1 Horowitz (1998) gave a discussion of these issues.
The SMS is typically better than N 1/3 -consistent, but slower than √ N, the speed of convergence depending on the smoothness of K † and the distribution of the random components of the model. Note that the √ N-convergence of the estimators such as Klein and Spady's (1993) is linked to the manner in which they use kernels. These estimators are a form of double averages. However, the objective functions for MS and SMS are nonparametric point estimators, which are single averages. With some caveats, the SMS estimator reflects the fact that the only exploitable information is at or close to the median of the U's. The √ N estimators effectively use all the data points. The main objective of this paper is to show how to extend SMS to estimate discrete duration models. The MS and SMS estimators have been used in other situations such as Lee (1992) and Melenberg and Van Soest (1996), who extended the MS and SMS, respectively, to ordered-response models. De Jong and Woutersen (2011) have extended the SMS estimator to binary choices with dynamic time series data. Fox (2007) adapted the MS estimator to multinomial choices. Charlier et al. (1995) extended the SMS to panel data. Other researchers have modified the MS and SMS estimators to improve their sampling properties. Kotlyarova and Zinde-Walsh (2010) suggested using a weighted average of different SMS estimators to reduce mean squared error. Iglesias (2010) derived the second-order bias, which can be used to reduce the bias of the SMS estimator. Jun et al. (2015) proposed a Laplace estimator alternative to improve on the N 1/3 -consistency of the MS estimator. To our knowledge, neither the MS nor SMS estimators have been extended to duration models.
Sections 2 and 3 discuss the class of models considered and present the basic estimator along with its main asymptotic properties. Section 4 provides some simulation results concerning the sampling distribution of the estimator, and Section 5 concludes.

Modelling
As mentioned, a standard approach for modelling a discrete duration process is to construct it as a sequence of binary choice models, with observed and unobserved heterogeneity. The standard binary choice model is adapted such that in each time period, s, a choice is made by individual i to continue in a state if the latent variable: is greater than zero. Here, Z is (β) = X * is + X is β 2 is an index where X * is is a scalar random variable and X is is a k × 1 vector, which may include a function of s, while β is a k × 1 vector of constants.
We assume the U is 's and . . , S. We estimate the parameters by effectively estimating the density of Z is (β 0 ) at zero by nonparametric methods. For notational convenience, we often suppress the i subscripts. Another way to view the modelling is that in any given period s with Y s−1 = 1, this is a standard binary choice variable with the key difference being that the index Z is a function of some covariates and the number of completed periods, s. The duration variable for period s is simply The evolution of the Y s 's, conditional on the covariates and duration, is given by: 2 Some normalization of the parameter space is necessary. We find it most convenient to impose a unit coefficient on The model is easily reformulated to incorporate functions of the Y j 's, j ≤ s as conditioning variables.
Note that this representation is such that Y s is zero if the subject left the state prior to period s and becomes a standard binary choice model in period s if the subject elected to continue in the state in period s − 1.
We put an upper limit, S, on the length of spells. This is common in empirical work. 4 Allowing for unbounded S introduces technical difficulties that are not readily resolved.
It is useful to note that by iterated expectations: so that, tautologically, F s , the continuation probability function, is:

The Estimator
Adapting the SMS estimator to the discrete duration model as outlined in Section 2, the objective function is: K † (w), a smoothed indicator function, is the anti-derivative of K(w) = dK † (w)/dw and has the properties: |K † (w)| ≤ M < ∞, lim w→−∞ K † (w) = 0, lim w→∞ K † (w) = 1. In most kernel density estimation, K is a density function and K † is its associated cumulative distribution function. The technical requirements here sometimes require use of a higher order kernel.
Note that the objective function is of the same form as the usual SMS estimator with the modifications that there is a double summand over individuals and time periods and each of the summands at period s is multiplied by Y s−1 , so that after exit, there is no further contribution to the objective function by that individual.
Implicitly, we impose the identification condition that the coefficient on X * is is unity 5 (e.g., Li and Racine 2007). Horowitz (1992) discussed the identification issue. X * is is assumed to have a continuous distribution, conditional on X is and Y is−1 . Let: For example, Cameron and Heckman (1998) defined S as the upper limit to years of education. In practice, for programming purposes, it suffices to set S equal to the longest duration in the dataset being used. In the simulations reported in Section 4, the maximum duration was 37.
The estimator solves the first-order conditions ψ N ( β) = 0, which are given by: Concerning notation, when a function's argument β is suppressed, it is evaluated at β 0 , e.g., G(u s |z s , x s , y s−1 ) and g(u s |z s , x s , y s−1 ) denote the cumulative distribution and density functions of U s conditional on Z s , X s , Y s−1 = 1, and f (z s |x s , y s−1 ) denotes the density functions of Z s conditional on X s , Y s−1 . The superscript [j] indicates the j th derivative of a function with respect to z s , and in particular, Let Pr[u s , x s , x * s |Z s−1 ] denote the probability distribution of U is , X is , X * is given Z i,s−1 . The distributional assumptions we make are as follows.
Assumption 2. For s = 1, . . . , S, (a) the support of the distribution of x * s , x s is not contained in any proper linear subspace of R k+1 , (b) 0 < Pr(y s = 1|x * s , x s , y s−1 = 1) < 1 for almost every x * s , x s and (c) for almost every x s , y s−1 , the distribution of x * s conditional on x s , y s−1 has everywhere positive density with respect to the Lebesgue measure. Assumption 11. Q is negative definite.
These assumptions adapt those in Horowitz (1992) to allow for the dependency structure. They also embed Manski's (1985) assumptions with S = 1. Notice that the random sampling assumption refers to N random draws within each being the potentially S observations.
Identification (see Proof of Proposition 1 in Appendix A) follows by adapting Manski's (1985) proof for the MS estimator. Of interest here is that we wish to allow for time dependence. Note that for the MS/SMS case, nothing precludes the inclusion of a constant in the index so long as, say, x s is not co-linear 6 (in fact, simulation and empirical results such as in Horowitz (1998) indicate good results for intercept estimates). For the m-multinomial choice model, Lee (1992) included m non-stochastic threshold parameters (including a constant). In our case, the same applies for including certain non-stochastic functions of s in x s , such as including indicators for each s or a polynomial in s. For parsimony in our numerical/empirical work, we have included quadratics to allow for increasing, decreasing and non-monotonic time dependency. This allows for straight-forward testing. In this regard, we note that the semiparametric information matrix derived in Reza and Rilstone (2016) was singular for this class of models. There is no contradiction here, since the singularity indicates that those parameters are not estimable at the √ N-rate; it does not imply that they cannot be identified or estimated at a less than √ N-rate, which we do here. We have the following lemma, which permits simple derivation of the asymptotic properties of the estimator. The asymptotic distribution of the estimator can be summarized easily using the following result.
The proofs are in Appendix A. In the statement of the proposition, note the presence of the first-order bias, γ m Q −1 B, for which it may be advisable to adjust the raw estimator. One of the benefits of this estimator is that one can effectively ignore the dependence of the observations, pool all the observations across individuals for whose Y i,s−1 = 1 and use standard SMS optimization procedures. This is what we have done in the simulations. Reza and Rilstone's (2016) setup (extension of Klein and Spady 1993) allows for estimation of the hazard rate, 1 − F s , with a natural estimate of time dependence from the semiparametric estimates of ∆h s = F s−1 − F s . Note that Reza and Rilstone's (2016) estimator of ∆h s only has a √ Nγ-rate of convergence. As for the SMS estimator, we can consider the optimal choice of window width. As with Horowitz (1992), we consider choices that minimize an MSE criterion. Therefore, if we consider that the asymptotic results correspond to the distribution of a random variable, say W, with mean 6 In this case, the random sampling assumption should be interpreted as referring to the stochastic elements of x s . γ m Q −1 B and variance Q −1 CQ −1 /(Nγ), we can consider minimizing, say, the inner product MSE of Ω 1/2 W, where Ω is a positive definite weighting matrix, i.e., minimize E[W ΩW] with respect to γ. This results in: For inferences it is necessary to obtain consistent estimates of the components of the first-order bias and variance. These cannot be directly estimated as they depend on the distribution of the unobservable U's. However, by extension of the arguments in Horowitz (1992), they may be obtained through various derivatives of the objective function. Specifically, put: By the uniform law of large numbers, B( β) It is well known that the first-order asymptotic results may provide a poor approximation to the sampling distribution of the SMS estimator. Thus, it may be preferable to use some higher order method to approximate the distribution. Apart from Iglesias (2010) who applied the results in Rilstone et al. (1996) to derive the second-order bias of β, little is known (explicitly) about the second-order properties of the SMS estimator. Estimates can be bootstrapped. In this regard, we note that one should resample individuals. That is, bootstrap estimates should be based on resamples: {Z * iS } N i=1 , where the * 's indicate random draws from the original data. Horowitz (2002) documents some of the issues associated with bootstrapping the distribution of β. In particular, the corresponding re-estimates: β * j , say, and corresponding standard errors should be calculated using an under-smoothing window-width such as γ ∈ [.5γ * , γ * ].

Simulation Exercise
To examine the estimator's performance in finite samples, we conducted Monte Carlo simulations with several Data Generating Processes (DGPs). We adapted simulations in Horowitz (1992) by augmenting the models with duration dependence, and a variety of error distributions. The latent processes we considered included those with homoskedastic errors: u is ∼ N(0, 1) and those with heteroskedastic errors: We conducted the simulations for two sample sizes, N = 500 and N = 1000. The X's were drawn as i.i. d. N(0, 1). For the DGP with homoskedastic normal errors, this resulted in duration times with averages of 5.7 (N = 500, 1000) and standard deviations also 5.7 (N = 500, 1000). With heteroskedastic errors, the average duration times were 8.7 (N = 500, 1000) with standard deviations of 9.6 (N = 500) and 9.5 (N = 1000). For identification purposes, the coefficient on X 1 was normalized to one, and our key parameter of interest was the coefficient on X 2 , with a true value of one. We conducted 500 replications for each specification. We followed Horowitz (1992) to estimate the parameters in two steps: first using simulated annealing to find the approximate maximizer of Ψ N (β) followed by gradient methods for greater precision. We then used the bias correction described in the previous section to bias-adjust the parameter estimates. We used a Gaussian kernel with a window-width γ = N −1/6 . 7 Standard errors and the bias correction were based on the consistent estimators B( β), Q( β) and C( β) from Equation (13).
Tables 1-3 report the summary statistics of the simulations for the estimates of the coefficients on X 2 , (s/100) and (s/100) 2 , respectively. We also conducted corresponding probit estimates as benchmarks. Note that, with normal errors, the probit estimates were fully efficient. The summary statistics indicated that the semiparametrically-estimated coefficients on X 2 were very close to the true parameter. The bias and standard deviation both decreased with sample size. This is particularly true compared to the (misspecified) probit estimator when the errors were heteroskedastic. As for the coefficient on the linear duration dependence term (s/100), there appeared to be some bias, particularly in the presence of heteroskedasticity. However, the bias and RMSE of the SMS estimators diminished with sample size. This was not the case with the probit estimators. As indicated earlier, estimating duration dependence term at the √ N-rate was not possible. The estimates of the coefficient on the quadratic term of the duration dependence were somewhat biased, although the bias decreased with the sample as did the RMSE. Larger sample sizes than used here may be required to estimate, with precision, more nuanced forms of duration dependence using the proposed SMS in these contexts. Estimates using a fourth-order kernel as in Horowitz (1992) yielded very similar results. The non-stochastic window-width was used, rather than, say, a plug-in window-width, to keep the simulations manageable. We also examined the distribution of the estimates. Figures 1-3 graph the QQ-plots of the standardized SMS estimates of the coefficients on X 2 , s/100 and (s/100) 2 , respectively. Most of the standardized estimates appeared to be close to the standard normal quantiles, except for a few extreme values. The extreme values are potentially due to difficulties with numerical optimization. This would seem to indicate that the sampling distributions of the estimators in our simulation exercise were reasonably well approximated by a normal distribution.

Conclusions
This paper has shown that the SMS estimator can be readily adapted to consistently estimate the parameters of a popular class of discrete duration models, while relaxing the distributional assumptions of parametric models and certain semiparametric models. The asymptotic distribution of the estimators was derived and can be readily approximated using standard software. Simulations illustrated the viability of the approach. We are currently working on an empirical application of the estimator. is (β 0 ), suppress the i subscripts, and write: suppressing the X s and Y s−1 arguments in g(u s |z s , x s , y s−1 ) and h(z s |x s , y s−1 ).