Minimum Message Length Inference of the Exponential Distribution with Type I Censoring

Data with censoring is common in many areas of science and the associated statistical models are generally estimated with the method of maximum likelihood combined with a model selection criterion such as Akaike’s information criterion. This manuscript demonstrates how the information theoretic minimum message length principle can be used to estimate statistical models in the presence of type I random and fixed censoring data. The exponential distribution with fixed and random censoring is used as an example to demonstrate the process where we observe that the minimum message length estimate of mean survival time has some advantages over the standard maximum likelihood estimate.


Introduction
In Type I random censoring we observe for each item i either the true survival time T i = t i (t i > 0) or the censoring time C i = c i (c i > 0), where capital letters are used to denote random variables. The data consists of joint realisations of the random variables (Y i = y i , ∆ i = δ i ) (i = 1, . . . , n) where The censoring time C i may be fixed (i.e., C i = c for all i = 1, . . . , n) or a random variable that may depend on other factors (e.g., loss to follow-up). The likelihood function of n observed data points D = {(y 1 , δ 1 ), . . . , (y n , δ n )} is where p T (t|θ) and F T (t|θ) denote the probability density and the cumulative density function of the random variable T, respectively. Inference about the survival times (t 1 , . . . , t n ) is of key interest in many areas of science and is commonly done by maximizing the likelihood and dropping terms relevant to C only. This manuscript examines inference of models in the presence of censored data under the minimum message length (MML) framework. MML (see Section 3) is Bayesian technique for model selection and parameter estimation that is based on data compression and key principles of information theory. MML is known to possess strong theoretical properties [1][2][3] and has previously been successfully applied to a wide range of statistical models [1]. Here, we demonstrate how MML can be used to infer models under fixed censoring as well as type I random censoring. We use the exponential distribution (see Section 2) as a simple example to demonstrate the key steps and compare the MML estimator to the well-known maximum likelihood estimator in this setting (see Section 2.1). Although MML analysis of the exponential distribution is not new (see, for example, [1,4]), the MML principle has not been applied to any kind of survival data with censoring to date.
The main contributions of this manuscript are to: (i) introduce the MML principle of inductive inference and demonstrate how the Wallace-Freeman MML approximation can used to infer exponential models with type I censored data; (ii) show that the MML estimate of the mean lifetime has some advantages over the usual maximum likelihood estimate for small samples and that it converges to the maximum likelihood estimate for large sample sizes, (iii) incorporate the proposed codelengths for censored exponential distributions into MML finite mixture models allowing for inference of all parameters as well as the number of mixture classes; and (iv) compare the MML principle to the closely related minimum description length principle.

Exponential Distribution
Consider the case of a randomly censored exponential parameter studied in [5] where the lifetime data and the censoring data are assumed to be exponentially distributed and α, β > 0 denote the mean censoring time and survival time, respectively. Under this model, the joint probability distribution of (Y i = y i , where (Y i , ∆ i ) are defined in (1) and (2) respectively. In contrast to random censoring, in fixed censoring an item is observed for a period of time, say c > 0, and its actual survival time t i is known if the item fails before time t i ≤ c; otherwise, we only know that the item survived past t i > c. In the case of exponentially distributed survival times, the observed data (Y = y i , ∆ i = δ i ) follows where c > 0 is a fixed constant (the follow up period) known a priori. Given n data points D = {(y 1 , δ 1 ), . . . , (y n , δ n )}, the aim is to estimate the unknown mean lifetime survival β > 0 (random censoring) or θ > 0 (fixed censoring).

Maximum Likelihood Estimation
The method of maximum likelihood is the most common approach used to obtain parameter estimates in parametric models. Under the censored exponential model, maximum likelihood proceeds by setting the parameter estimateβ(D) to the value that maximises the probability of the data. From (4) and (5), the joint probability of the data D is where k = (∑ i δ i ) is the number of observed uncensored survival times. Maximizing the likelihood function is equivalent to minimizing the negative log-likelihood function Maximum likelihood estimates of the mean survival and censoring times arê respectively. Provided the count of observed survival times k ∈ (0, n), the maximum likelihood estimatesα(D) andβ(D) are finite; otherwise, if k = 0 or k = n, one of the maximum likelihood estimatesα(D) orβ(D) is infinite. Kim [5] showed that maximum likelihood estimates have infinite mean and variance in this setting. However, the expected value of the maximum likelihood estimateβ(D) is finite if we condition on k > 0. Kim [5] further showed that, provided k ∈ (0, n), the maximum likelihood estimatesα(D) and β(D) are unbiased, strongly consistent (without any condition on k) and asymptotically normally distributed.
In the case of Type I censored data with a fixed censoring time c > 0, the negative log-likelihood function of the data is where k = ∑ i δ i as before. Under fixed censoring, the maximum likelihood estimate of the mean survival timeθ(D) is (see, for example, [6]) In case of no censoring (i.e., k = n implying complete data), (11) reduces to (∑ i y i )/n, which is the usual maximum likelihood estimate for the exponential distribution with complete data. The sampling distribution of (11) is asymptotically normal with mean θ and variance Conditional upon k > 0, Mendenhall and Lehman [7] obtained the exact mean and variance of the maximum likelihood estimate (11) for a = 1, 2, . . . and V{k −1 } = E{k −2 } − (E{k −1 }) 2 . However, the large sample normal approximation of the distribution of the maximum likelihood estimates is inaccurate and not representative of the behaviour of the estimate in the small to moderate sample size regime [7]. Balakrishnan and Davies [8] further show that the maximum likelihood estimate computed based on a censoring time c will always produce an estimate which is Pitman closer to the data generating model θ than the maximum likelihood estimate computed with a shorter censoring time c < c . In the next section, we introduce the MML principle of inductive inference (see Section 3) and demonstrate how MML can be used to infer exponential models with censoring (see Section 4).

Minimum Message Length
Introduced in the late 1960s by Wallace and Boulton [9], the minimum message length (MML) principle [1,[9][10][11] is a framework for inductive inference based on ideas in information theory and data compression. Under the MML framework, the aim is to transmit a set of data (a message) from a hypothetical sender to a receiver over a noiseless transmission channel. The MML message is designed to consist of two parts:

1.
the assertion: an encoding of the model structure and the associated model parameters θ ∈ Θ ∈ R p . 2.
the detail: a description of the data D using the model p(D|θ) that was specified in the assertion.
The length of the assertion measures the complexity of the model, with complex models requiring longer codelengths compared to simpler models, while the detail captures how well a model fits the data. The length of the two-part message, I(D, θ), is the sum of the length of the assertion, I(θ), and the length of detail, I(D|θ); namely, Within the MML framework we seek the model that minimises the length of this message. Due to the two-part nature of the message, MML automatically balances the trade-off between model complexity and the goodness of fit of the model to the data. By measuring the quality of a model in (say) bits, MML is a yardstick that can be universally used to compare models with different parameters and structures. There exist several approches to computing message lengths (13), with the strict MML procedure (SMML) [1,12] and the MML87 approximation [1,10] being the most widely known. In contrast to the SMML procedure whose construction is known to be NP hard [13], the MML87 approximation is computationally tractable and most widely used in practice. The MML87 codelength approximation to (13) is where π θ (θ) is the prior distribution for the parameters θ, |J θ (θ)| is the determinant of the expected Fisher information matrix, p(D|θ) is the likelihood function of the model and κ p is a quantization constant [14,15] that depends on the number of parameters p. Specifically, for small p we have while κ p is well-approximated for large p by [1]: where γ ≈ 0.5772 is the Euler-Mascheroni constant. The MML87 codelength, evaluated at the minimum, is the shortest codelength of a two-part message that encodes both the model parameters θ ∈ Θ and the data D. The MML87 approximation is known to be invariant under smooth one-to-one reparameterizations of the likelihood function and is asymptotically equivalent to the well-known Bayesian information criterion (BIC) [16] as n → ∞ with p > 0 fixed; that is, where the O(1) term depends on the prior distribution, the Fisher information and the number of parameters p. Unlike model selection criteria such as Akaike's information criterion (AIC) and BIC, MML allows for both parameter estimation and model selection within the same unified framework. Furthermore, in models where the number of parameters grows with n or the sample size is relatively small, the difference between the MML87 codelength and BIC can be substantial. Examples include analysis of multiple short time series, where several measurements are collected over a period of time for a large number of study participants [17], learning finite mixture models [18] and discriminating between Poisson and geometric distributions based on observed data [19]. In the latter example, both the Poisson and geometric distribution have the same number of free parameters so that model selection with BIC is equivalent to choosing the model with the higher likelihood. In contrast, MML87 takes into account the complexity of each distribution [20] and not just the number of parameters, resulting in improved model selection performance for small sample sizes [19]. MML has been successful applied to a wide range of problems (e.g., decision trees [21], factor analysis [22], linear causal models [23], mixture modelling [18,24]) demonstrating excellent parameter estimation properties and model selection performance that is on par or better than commonly used techniques such as Akaike's information criterion (AIC) [25] and the Bayesian information criterion (BIC). A brief tutorial overview of minimum message length can be found in [19].

Minimum Message Length Inference of Type I Censored Exponential Data
To encode and transmit censored data D = {(y 1 , δ 1 ), . . . , (y n , δ n )} between the hypothetical sender and receiver within the MML framework, we have two options: • Transmit the censoring indicators (δ 1 , . . . , δ n ) first and then transmit the lifetime survival data (y 1 , . . . , y n ) given the receiver now knows which of the n data points are censored (see Section 4.1); • Transmit the censoring indicators and the lifetime data simultaneously (see Section 4.2).
We shall now estimate the MML87 codelength (15) for both the joint and the conditional encoding schemes for the censored exponential distribution setting introduced in Section 2.

Conditional Encoding of the Data
Under the conditional encoding framework, the sender transmits the censoring indicators δ i first, and then transmits the lifetime data y i using the conditional distribution of the data given the observed censoring indicators. The total message length of the data D = {(y 1 , δ 1 ), . . . , (y n , δ n )} and the parameters θ with the conditional encoding is where θ = {φ, ψ} are the model parameters defined below, I(φ, δ) denotes the message length of the censoring indicators δ = (δ 1 , . . . , δ n ), and I(ψ, y|δ) denotes the codelength of the survival data y = (y 1 , . . . , y n ), given that the censoring indicators are known to the receiver. From (3), the probability of observing an uncensored datum, say φ > 0, is implying that the censoring indicators follow a Bernoulli distribution with probability φ; that is, δ i ∼ Bernoulli(φ), or equivalently, k follows the binomial distribution k ∼ binomial(φ, n). The MML87 codelength of the binomial distribution was previously derived in [1,18] and is included here for completeness. Briefly, to compute the MML87 codelength (15) we require the Fisher information J φ (φ) and the prior distribution π φ (φ) for the probability of observing an uncensored datum. The Fisher information is well-known We assume the prior distribution for the censoring probability φ to be the beta distribution (φ ∼ beta(a, b)) with probability density function where a, b > 0 are the shape and scale parameters respectively and B(a, b) is the usual beta function. Substituting (21) and (22) into the MML87 codelength (15) and noting that κ 1 = 1/12, yields where, as before, k = (∑ i δ i ). The codelength (23) is minimised at the MML87 estimatê Note that, in the special case of uniform prior distribution (a = b = 1), the MML87 estimate simplifies toφ The shortest MML87 codelength for the censoring indicators is therefore given by I 87 (φ 87 , δ).
It remains to work out the conditional codelength of the surivival times given the censoring indicators, I 87 (ψ, y|δ). We note that the conditional likelihood of the lifetime datum y i is which is the exponential distribution with mean ψ = (1/β + 1/α) −1 ; that is, The Fisher information of the exponential distribution is In terms of the prior distribution for ψ, Schmidt and Makalic [4] consider the conjugate exponential distribution with a hyperparameter ψ 0 that controls the prior mean. Here, we would like an objective prior distribution on the mean ψ that is free of hyperparameters and has heavy tails so that large values of ψ are not penalized too severely. Additionally, our choice of the prior distribution should ideally lead to an easy to compute analytic estimate of ψ. A reasonable option is the half-Cauchy distribution which has heavy tails however it leads to MML estimates that are roots of polynomial functions of s = ∑ i (y i ). Instead, we will use the Fréchet (inverse Weibull) distribution with probability density function which is a type of generalized extreme value distribution and has Cauchy-like heavy tails. Substituting (29), (28) into (3), we obtain the MML87 codelength The MML87 estimate of the mean ψ isψ 87 (y) = s + 1 n + 1 (31) where, as before, s = ∑ i y i . The MML87 estimate corresponds to the usual maximum likelihood estimateψ ML (y) = s/n with one additional data point that has a unit contribution to the mean. The expected mean squared error of the MML87 estimate is which dominates the maximum likelihood estimate for ψ > n n + n(2n + 1) , n > 0.
As the sample size n increases, we note that lim n→∞ n n + n(2n implying the MML87 estimate dominates maximum likelihood for all ψ > 0.414 in terms of expected mean squared error for large n. However, we note that, unlike the MML87 estimate with this choice of prior distribution, the maximum likelihood estimate is invariant to scaling of the data. Substituting I 87 (φ, δ) and I 87 (ψ, y|δ) into (19) yields the total (conditional) codelength of the data. The MML87 estimates of the mean lifetimeβ 87 (D) and censoring timeα 87 (D) can be recovered from for φ ∈ (0, 1). Next, we examine how the same message can be encoded using joint encoding of lifetime data and censoring indicators.

Joint Encoding of the Data
Unlike in the conditional encoding, the sender now transmits the survival times and the indicator variables simultaneously. The negative log-likelihood function of the data D = {(y 1 , δ 1 ), . . . , (y n , δ n )} is given in (8). The Fisher information in this parameterization is We would like to use prior distributions for α and β that are comparable to those in the conditional coding described in Section 4.1. Noting that φ ∼ beta(a, b), ψ has the standard Frechet distribution and the Jacobian of the transformation from (φ, ψ) → (α, β) is implying that a commensurate joint prior distribution for α, β is where B(·, ·) is the beta function. In the special case where φ is given a uniform prior (i.e., a = b = 1), we have Substituting (36) and (39) into (15), the MML87 codelength is where θ = {α, β} and the quantization constant κ 2 = 5/(36 √ 3). The MML87 estimates that minimize the codelength (41) arê If required, the corresponding estimates of φ and ψ can be obtained from (37).

Properties
First we show the the conditional (19) and joint (41) MML codelengths are equivalent up to a constant to be specified below. From Sections 4.1 and 4.2, we note the joint density of (Y, ∆) can be expressed as a product of the binomial p ∆ (δ|φ) and exponential densities p Y (y|ψ) where Y and ∆ are independent random variables (see Kim [5] (p. 104)). Consequently, as MML87 is invariant under smooth one-to-one reparameterizations of the sampling model, the MML87 joint codelength (41) and the corresponding conditional codelength (19) are identical (except for the minor efficiency gain in the joint codelength discussed below). Specifically, the relationship between the joint codelength, I 87 (α, β, D) and conditional codelength, I 87 (ψ, φ, D), can be expressed as where the term log(3 √ 3/5) ≈ 0.0385 arises due to the quantization constant being smaller in higher dimensions since its more efficient to encode multiple parameters simultaneously compared to encoding each parameter independently.
Furthermore, as the MML87 estimate of φ isφ ∈ (0, 1) (see (24)), MML87 estimates of the mean survival times (α, β) are finite for all k ∈ [0, n] in contrast to the corresponding maximum likelihood estimates (9) which are finite for k ∈ (0, n). As n → ∞, it is wellknown that the MML87 estimates are equivalent to the maximum likelihood estimates (see (18)) which implies that the MML87 estimates are similarly asymptotically normally distributed and strongly consistent.
The expected mean square error E{(β − β * ) 2 } of the ML and MML87 estimates of β * , conditional on k > 0, is expressible in terms of the generalized hypergeometric function for any n > 0. Figure 1 (top) depicts the expected mean squared error between the MML87 and ML estimate of β * , expressed as a ratio of MML87 to ML with smaller values indicating preference for the MML87 estimate. The expected mean squared error of the MML87 estimate of β * was generally lower than the corresponding maximum likelihood estimate (except when the true censoring proportion φ * was small) with the biggest difference observed for small sample sizes, while the two estimates were practically indistinguishable for larger sample sizes n ≥ 100.
We also compared the MML87 and the ML estimates in terms of the relative entropy or the Kullback-Leibler (KL) divergence, conditional on k ∈ (0, n). The KL divergence between the true data generating model (α 1 , β 1 ) and the approximating model (α 2 , β 2 ) is which, as expected, is the sum of the KL divergences between two exponential and two binomial distributions. The KL divergence may be interpreted as the expected amount of extra information required to encode data from (α 1 , β 1 ) using the model (α 1 , β 2 ). The expected KL divergence for both the ML and MML estimators is shown in the bottom of Figure 1, conditional on k > 0 and k < n. It is clear that for n = 5 the MML87 estimate dominates the maximum likelihood estimate in terms of the KL divergence for all φ * ∈ (0.05, 0.95). When the sample size is increased (n = 10), the MML87 estimate exhibits smaller KL divergence compared to the ML estimate for all φ * except when φ * → 0 or φ * → 1 where the maximum likelihood estimate has smaller KL divergence.

Minimum Message Length Inference with Fixed Censoring
Consider now the fixed censoring scenario (6) introduced in Section 2 where the negative log-likelihood function of the data is given in (10). If we wish to encode the data D using the joint MML code (see Section 4.2), we require the negative log-likelihood, the Fisher information and a prior distribution for the mean survival time θ > 0. The negative log-likelihood is given in (10) while the Fisher information for Type I censored data with fixed censoring is: where F T (·|θ) is the cumulative distribution function of the survival data T (see Section 2). The reduction in information due to censoring is clearly a function of θ and the cumulative density function of T, with large c resulting in little information loss compared to small c.
As expected, as c gets larger which is the usual Fisher information for the exponential distribution with no censoring. The prior distribution for θ is chosen to be the Fréchet ditribution with scale c and probability density function Substituting (44) and (46) into (15), we obtain the complete MML87 codelength for the joint encoding Due to the form of the Fisher information, the MML87 estimate of θ that minimizes this codelength is unavailable analytically and must be obtained via numerical optimisation. The maximum likelihood estimate (11) may be used as a starting point for the numerical search. Consider now the conditional encoding (see Section 4.1) where the probability of observing an uncensored datum, say φ > 0, is so that the number of uncensored data points k follows the binomial distribution k ∼ binomial(φ, n). This implies that the mean survival time can then written as A naive conditional coding approach proceeds by encoding the censoring indicators following Section 4.1 with codelength (23). To encode I(y|δ), one would use the conditional probabilities of the lifetime data which are and for all i = 1 . . . , n. The conditional likelihood of the k = (∑ i δ i ) data points is then Once the receiver has the censoring data and an estimate of φ, they implicitly know θ from (49). The length of the message required to transmit the data y is which is the negative log-likelihood of the data. However, this codelength is inefficient since the probability of censoring φ(θ) is not independent of the mean survival time θ. This implies that the precision to which φ(θ) is encoded must depend on the lifetime data y, which is not the case in the naive approach where the precision quantum for φ(θ) depends on the censoring data δ only. Consequently, joint MML coding should be used instead of the conditional encoding approach for the fixed censoring setup.

Example
We observe n = 20 items with an exponential life distribution for c = 150 h. Out of the 20 items k = 15 items fail during the observation period and the sum of their lifetimes (in hours) is s = ∑ i y i δ i = 835 [6]. The maximum likelihood estimate of the mean lifetime θ iŝ with a negative log-likelihood of 84.9 at the minimum. The MML87 estimate is obtained by a numerical search and isθ 87 (D) = 110.1 h with a codelength of 124.905 bits. The MML87 codelength at the maximum likelihood estimate is 124.924 bits suggesting that there is little difference between the two estimates in this example.

Properties
To evaluate the performance of the MML87 estimate, we computed the mean squared error risk and the expected Kullback-Leibler (KL) divergence for MML87 and the ML estimates under the data generating model θ * = 1 and sample sizes n ∈ {5, 25}. Since the ML estimate is undefined for k = 0, all the results discussed below are conditional on k > 0. The KL divergence from the 'true' model θ 1 to the approximating model θ 2 is The results are shown in Figure 2 where the x-axis of each plot is the censoring point c set to the p-th percentile of the data generating model Exp(θ * = 1); for example c = 0.69 corresponds to the p = 0.50-th percentile of Exp(θ * = 1). It is clear that the MML87 estimate is a reasonable alternative to the maximum likelihood estimate under fixed censoring. For p ≥ 0.20 (i.e., the 20-th percentile of Exp(θ * = 1)) the MML87 estimate dominates the ML estimate in terms of KL risk, while for p < 0.20 the estimates are very similar for both sample sizes tested. In terms of the expected mean squared error, the MML87 estimate dominates the ML estimate for all 0.20 ≤ p ≤ 0.80 when n = 5; for n = 25, the estimates are indistinguishable for p < 0.1 and p > 0.69 and the MML87 estimate again dominates ML for all 0.1 < p < 0.5.

Discussion
This manuscript has demonstrated how minimum message length can be used to infer data with censoring information. Specifically, we have derived MML87 codelengths for the exponential distribution with fixed censoring and random type I censoring. Although information theoretic universal models for the exponential distribution, including those corresponding to MML codes, are known [4], this is the first time MML has been applied to censored data.
The MML87 codelength for the exponential distribution with censoring provides a new means of parameter estimation as well as model selection. In terms of parameter estimation, the MML87 estimate of the mean lifetime θ under type I censoring described in this paper has some advantages over the usual maximum likelihood estimate for small sample sizes. First, the MML87 estimate is defined for all proportions of censoring unlike the maximum likelihood estimate which does not exist when all observations are censored; i.e., k = (∑ i δ i ) = 0. In addition, the MML87 estimate has on average lower mean squared error risk and lower KL divergence from the data generating model for a wide range of censoring proportions.
In the case of random censoring, the MML87 estimate is available in closed-form while for fixed censoring, it can only be obtained by numerical optimisation. Although the experiments in the manuscript utilised heavy-tailed prior distributions for the scale parameter as recommended in, for example, [26], the Bayesian nature of MML allows for information prior information to be incorporated directly into the estimation process. The effect of the prior distribution in the examples considered here is expected to be negligible for medium to large sample sizes.
Importantly, the proposed MML87 codelengths can also be used to discriminate between competing models (e.g., exponential vs lognormal) and offer some advantages over the well-known BIC model selection approach. BIC only considers the sample size and the number of parameters when measuring model complexity. In contrast, MML takes into account not just the number of parameters, but also the complexity of the distribution (i.e., the number of random data strings that are fitted well by the distribution). As the sample size n → ∞, the MML87 codelength converges to the BIC and therefore inherits the favourable asymptotic properties of BIC, such as model selection consistency.
The codelengths derived in this manuscript are extendable to MML inference of other censored data types, such as the Weibull and the lognormal distribution, and can be incorporated into more complex models as shown in the next section.

Clustering Survival Data
To demonstrate the applicability of the codes derived in the manuscript, we implemented the MML87 codes into a Matlab software package for inference of finite mixture models. Our software, called Matlab Snob, features mixture models with categorical data (e.g., multinomial distribution), count data (e.g., geometric, Poisson and negative binomial distributions), continuous data (e.g., normal, Laplace, gamma and Weibull distributions) and survival data (type I fixed and random censored exponential distribution). As a demonstration of Matlab Snob, we used two publicly available survival data sets: (1) Rossi et al.'s criminal recidivism data [27], and (2) survival from malignant melanoma [28].
The crime data was recently analyzed in [29] using variational Bayes estimated finite mixture models. For clustering we used all n = 432 observations and the following seven attributes: (1) financial aid (no, yes), (2) full-time work experience before incarceration (no, yes), (3) marital status at time of release (married, not married), (4) released on parole (no, yes), (5) number of convictions prior to current incarceration, (6) age in years at time of release and (7) week of first arrest after release (73.6% censored). This is an example of fixed censoring as all censored observations were censored at 52 weeks. We modelled the categorical attributes using a multinomial distribution, number of convictions was modelled with a negative binomial distribution, while a Gaussian distribution was used for age at time of release. For the week of arrest, we used the exponential distribution model with fixed type I censored data (see Section 5).
The melanoma data set consists of n = 205 patients from Denmark who were diagnosed with malignant melanoma. Five attributes were used for clustering: (1) sex (male, female), (2) ulcer (present, absent), (3) age at diagnosis in years, (4) tumour thickness in mm, and (5) censored survival time in years (65.3% censored). Sex and ulcer were modelled via multinomial distributions, while age and tumour thickness were modelled with univariate Gaussian distributions. For the survival time, we used an exponential distribution with random type I censoring (see Section 4) and combined death due to melanoma and death due to other causes as the primary outcome of interest.
Clustering results for the Crime and the Melanoma data sets with Matlab Snob are shown in Table 1. First, since Matlab Snob learns finite mixture models using the MML87 codelength approximation, the same framework is used to estimate all model parameters as well as select the number of classes. For the Crime data, the model with three classes had the smallest codelength, while two classes were selected for the Melanoma data set. We observe that all the classes are relatively well differentiated in terms of average survival time. In the case of the Crime data set, class 1 had the shortest average time to arrest (θ = 119 weeks) and consists of younger individuals (mean age 20.7 years, std. dev. 2.1 years) who are primarily unmarried and have no full-time work experience before incarceration. In contrast, class 3 comprised older individuals (mean age 36.3 years, std. dev. 4.9 years) 82% of which had full-time work experience, was estimated to have longest average time to arrest (θ = 345.9 weeks). For the melanoma data set, class 1 was estimated to have the shortest average survival time (β = 7.3 years) and consisted of individuals diagnosed at an older age (mean: 57.3 years, std. dev. 17.1 years) with larger tumours (mean: 5.4 mm, std. dev. 3.4 mm). In contrast, patients assigned to class 2 were diagnosed at a younger age, had smaller tumours and were estimated to have longer survival time on average (β = 37.0 years). The Matlab Snob clustering software is freely available for download from the Mathworks Filexchange website (ID: 72310) and will be extended to incorporate other survival distributions in the future (eg, Weibull and lognormal distribution). We note that the MML87 codelengths for type I censored exponentially distributed data derived in this paper can also be used in decision tree modelling [21,30,31]. For example, one could represent the leaves of the decision tree with a censored exponential distribution and use MML to infer an optimal decision tree for a data set.

Minimum Message Length and Minimum Description Length
Minimum message length is closely related to minimum description length (MDL), an inductive inference principle independently developed by Rissanen and colleagues [32][33][34][35]. Like MML, the MDL principle is rooted in information theory and, given a data set, seeks a model that would result in the shortest encoding of the data. A recent and popular version of the MDL principle is the normalized maximum likelihood (NML) code which says that the codelength for data y with respect to model class M parameterised by models θ ∈ R p ∈ M is − log p NML (y|M) = − log p(y|θ(y), M) + log ∑ whereθ is the maximum likelihood estimate of the p parameters and the sum in the second term is taken over the entire data space; we replace the sum with an integral in the case of continuous data. The first term in the NML codelength is the negative log-likelihood of the data evaluated at the maximum likelihood estimate, while the second term represents the parametric complexity of the model class and measures how well models θ ∈ M within the model class M approximate random data sequences. In particular, a high parametric complexity says that a large number of data sequences can be well-approximated by models within the class. In contrast, the parametric complexity of a simple model that can only well-approximate a few data sequences will tend to be small. Rissanen [33] derives an asymptotic approximation for the NML codelength which is accurate for medium to large sample sizes: where J 1 (·) is the per-sample Fisher information matrix. Mera et al. [36] derive a somewhat sharper approximation to the NML codelength using Riemannian geometry tools and apply their new approximation to principal component analysis. Rissanen further shows that, like the MML87 codelength, the NML codelength reduces to the well-known Bayesian information criterion (BIC) in the limit as the sample size n → ∞. Unfortunately, in the case of the exponential distribution with or without type I censoring, the parametric complexity is infinite for both the exact NML codelength and the asymptotic approximation. To circumvent the problem of infinite parametric complexity, one may consider the restricted approximate normalized maximum likelihood (ANML), the two-part ANML or the objective Bayesian code, among others [37].
Although there exist many similarities in the approaches to inference between MML and MDL, there are some important differences which we summmarize below: • MDL relies on the maximum likelihood estimator and does not offer new means for parameter estimation; • MDL is decidedly non-Bayesian avoiding the use of any (subjective or objective) prior information; • MDL nominates the model class M that would result in the shortest encoding of the data and does not does not infer a fully specified model; • while MML minimises the expected (average) codelength of the data with respect to the marginal data distribution, MDL minimizes the worst-case codelength relative to the ideal code.
In addition to the NML code, other MDL codes exist including the sequential NML code [38] and the conditional NML distribution [34], among others. Clearly, both MML and MDL approaches to inductive inference have merit, and if used correctly, will result in excellent model selection performance as shown in a wide range of applications. A more detailed discussion of MML and MDL similarities and differences can be found in [39] and [1] (pp. 413-415).