Fitmix: An R Package for Mixture Modeling of the Budding Yeast S. cerevisiae Replicative Lifespan (RLS) Distributions

Replicative lifespan (RLS) of the budding yeast is the number of mother cell divisions until senescence and is instrumental to understanding mechanisms of cellular aging. Recent research has shown that replicative aging is heterogeneous, which argues for mixture modeling. The mixture model is a statistical method to infer subpopulations of the heterogeneous population. Mixture modeling is a relatively underdeveloped area in the study of cellular aging. There is no open access software currently available that assists extensive comparison among mixture modeling methods. To address these needs, we developed an R package called fitmix that facilitates the computation of well-known distributions utilized for RLS data and other lifetime datasets. This package can generate a group of functions for the estimation of probability distributions and simulation of random observations from well-known finite mixture models including Gompertz, Log-logistic, Log-normal, and Weibull models. To estimate and compute the maximum likelihood estimates of the model parameters, the Expectation–Maximization (EM) algorithm is employed.


Introduction
An increase in mortality rate is typically interpreted as aging. In cellular aging, the budding yeast S. cerevisiae has revealed many significant properties in the mechanism of eukaryotic lifespan regulation [1][2][3]. Studies of yeast aging have led to many conserved components that affect lifespan [2,4,5]. Yeast lifespan is typically measured by two different approaches. The first is chronological lifespan (CLS), defined as the duration of time that a mother cell remains alive without division. The second is the replicative lifespan (RLS) of a mother cell, calculated as the number of mother cell divisions until senescence [6]. Historically, analysis of the RLS data has been conducted with nonparametric methods or using typical parametric survival models [7]. Survival analysis of the lifespan datasets has been generally delineated with the standard lifespan statistical distributions such as Gompertz, Weibull, Logistic, and Log-logistic [8]. In addition to these conventional statistical distribution models, several models for RLS data of budding yeast have been recently published [9].
Recent studies show that yeast replicative aging is heterogeneous and contains at least two subpopulations [10,11], which argues for mixture models. Finite mixtures are studied and applied to several applications, especially in biological (failure data) and medical (disease distributions) areas since the end of the 19th century. Both grouped and ungrouped datasets are aimed at the calculation of the parameters of finite mixtures [12].
Heterogeneous distributions of mixture population models such as the budding yeast S. cerevisiae can be utilized to model a population with subpopulations. Insights might 2 of 14 be gained by comparing the mixture models of the wildtypes, mutants, and various treatments such as calorie restriction. The mixture survival analysis of the RLS of a cell is a compilation of statistical approaches for lifespan data analysis of mixture models expressing heterogeneous states of yeast cell populations with diverse genotypes. In a complete population, mixture components are the densities (probabilities) of the subpopulations, and weights are represented by the fraction of each subpopulation in the complete population.
Frequency distribution is useful in the instance of single-mode survival lifespan data with the conventional probability distribution model [9,13,14]. The mixture of probability density functions is even more advantageous since it is used to portray a heterogeneous lifetime dataset when there is an indication of simple bimodality or multimodality [15,16].
The RLSs of the budding yeast distributions are not sufficiently portrayed by a single probability distribution since mother cell decrease is related to asymmetric phenotypes such as dysregulation of vacuole acidity, genomic instability, and partial reservation of protein aggregates [3,6,17]. In such settings, often finite mixture distributions are used to describe complex and multi-modal asymmetric division distributions. Marin et al. (2005) propose a mixture of Weibull models utilizing Bayesian analysis to model the heterogeneous lifespans [18], generalizing an earlier finding that Tsionas (2002) studied Weibull distributions of a finite mixture with a fixed number of components [19]. In Al-Hussaini et al. (1999), a finite mixture model of a Gompertz component and a surviving portion in the framework of Types I and II censored samples from heterogeneous population is considered [20].
The basic goal of the functions of fitmix, which is an open-source R package, is to offer a confined set of models to fit a diverse lifespan, even with large datasets, specifically the replicative lifespan of budding yeast or other biological units. It is a publicly accessible software published on 2021-04-19 and obtainable at https://cran.r-project.org/web/ packages/fitmix/index.html (accessed on 18 March 2021) on the Comprehensive R Archive Network (CRAN) protected with a GPL3 license. Specifically, functions are intended to facilitate comparison between parameter estimation techniques and models, utilizing a set of models of mixture fit standards, without requiring an in-depth understanding of the statistical computations and coding skills performed in the estimation procedure.
The fitmix package would have prompt use in biostatistics and bioinformatics, in addition to other biological and medical areas in which mixture modeling lifespan, survival, and lifetime distributions are a significant task (e.g., cancer survival data, disease datasets, and lifespan datasets of single units). This study aims to characterize the practicality of the fitmix package and demonstrate its traits using the analysis of yeast replicative lifespan data [9,14,21].

Survival Time Functions
Let T ≥ 0 be a continuous random variable, i.e., the survival time, and let F(t) be a cumulative distribution function on the [0, ∞) interval. The continuous variable T distribution can be defined by three functions. First is F(t), which is a cumulative of T, where t is 0 ≤ t < ∞. P is called the probability that the random variable T ≤ t. F(t) represents the definition of the cumulative distribution function that an individual fails before t. The other function is S(t), which is the survival (reliability) function and can be described as the probability that a biological trait endures longer than t: where S(t) is a non-increasing function of time t used to describe and demonstrate the survival data (RLS). The h(t) function is called the hazard function and expresses the death rate for an item of a survival time t. The h(t) function analyzes the probability of survival at a given point in time, based on whether an item survives to an earlier time t. In other words, it is the probability that if something survives one moment, it will also survive the next. This turns into an instantaneous hazard rate as ∆t leads to zero: The cumulative hazard function H(t), on the other hand, is defined as The CDF F(t) can be used to calculate the probability of failure, which is simply a continuous failure depending on a failure distribution

Gompertz Distribution
Failure (mortality) rate in biological aging usually follows the Gompertz distribution, i.e., an exponential law. The Gompertz distribution with a provided PDF and CDF is determined as where R and G are the rate and shape parameters, respectively. Further, the survival (viability) and hazard functions of the Gompertz distribution are given by and H(t) = R exp(Gt), respectively.

Log-Logistic Distribution
The Log-logistic distribution is also a power law. It can be utilized to fit the lifetime of an object, a service, or an organism. The Log-logistic model with a provided PDF and CDF is determined as respectively, where λ and κ > 0 are the scale and shape parameters of each. Further, the survival (viability) and hazard functions of Log-logistic distribution are provided by S(t) = 1 1 + λt κ and H(t) = λκt κ−1 1 + λt κ , respectively

Log-Normal Distribution
In a Log-normal distribution case, the natural logarithm (ln(t)) of the lifespan t is supposed to be normally distributed. The Log-normal model with a given PDF defined as where mean is Et = e µ + σ 2 2 and variance is V(t) = e 2µ+σ 2 e σ 2 − 1 . The Log-normal distribution generally is used with non-censored data. However, as this distribution is employed for censored data, calculations may become difficult provided that the hazard function gives zero at t = 0, increases to a maximum, and then decreases, getting close to zero as t leads to infinity.

General Case for the Mixture Model
The CDF, PDF, and HF of a typical κ-parameter mixture model consisting of κ subpopulations (subgroups or subclasses) have been presented by Blischke et al. (2011) [22]. A κ-parameter finite mixture model of the CDF can be characterized as where F j (t) is the CDF and p j is the probability of the mixture of jth subpopulation, where p j > 0 and K ∑ j=1 p j = 1. The PDF is given by where f j (t) is the probability density function related to F j (t). The HF is where h j (t) represents subpopulation j, and where ∑ K j=1 w j (t) = 1 with R j (t) = 1 − F j (t), j = 1, 2, . . . , K For the subpopulations with the weights, it can be followed from Equation (3) that the failure rate for the given distribution is a weighted mean of the failure rate varying with t, t ≥ 0.

An Example: Deriving Gompertz Mixture Model
The CDF of the two-parameter mixture model for K = 2 in Equation (1) for the random variable is given by If F 1 (t) follows Gompertz (R 1 , G 1 ) and F 2 (t) follows Gompertz (R 2 , G 2 ) distributions, the CDF for the two-parameter Gompertz mixture model from Equation (8) becomes The hazard function is Analogously, by utilizing different CDFs from different time distributions, other κ-component mixture models can be derived.

Maximum Likelihood Estimations of the Parameters with EM Algorithm in Mixture of Distributions
The EM algorithm has been applied as an approach to estimate and compute the parameters by the method of maximum likelihood in the finite mixture models [23]. In the EM framework, the realizations t 1 , . . . t n are considered as partial data and implicit class variables z 1 , · · · z g to be non-present where z ki = z k (t i ) = 1 if observation t i is included at the kth class, and 0 otherwise for k = 1, · · · , g and i = 1, · · · , n.
The EM algorithm is used in the mixture distributions by processing z as nonpresent data.

Estimating the Parameters of Replicative Lifespan Datasets Fitted by Mixture Models
In addition to modeling lifespan distributions, describing the link between replicative lifespan of genetic backgrounds is a significant mission for acquiring precise lifespan extension and failure measurements. For instance, the Weibull model is useful to infer the emergence characteristics of aging during the initial life stage. Most of the survival functions, such as the logistic function, are studied in late life during aging [24].
Whereas, the Log-normal model behaves differently since the model is not suitable for lifetime modeling where hazards increase with old age. Regardless of its revolting features, the Log-normal distribution has been broadly employed as a failure distribution in several cases, such as the analysis of electrical isolation or time to develop lung cancer among smokers. Moreover, the Log-normal distribution has often been used as a mixture model [25].

Goodness-of-Fit Measurement of Mixture Modeling
Deciding on the number of components is one of the most significant issues in fitting heterogeneous distributions with mixture models [25]. In mixture models, the number of components can be defined with a variety of criteria. The fitmix package utilizes four approaches to assess the performance of fitted models. Given a collection of mixture models for the RLS (lifetime or survival data), the measurements of four goodness-of-fits can be listed as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Kolmogorov-Smirnov (KS), and log-likelihood (log.likelihood) statistics.

Akaike Information Criterion (AIC)
AIC is commonly used to assess the performance of distinct mixture models. Therefore, AIC yields a mean for a (mixture) model selection. The user can implement AIC for defining the number of components.
The AIC measure is defined by the maximum likelihood estimation of the model as a measure of fit, AIC = −2 ln (L) + 2k (17) where L is the maximized value of likelihood function and k is the number of parameters of finite mixture model [16,26].

Bayesian Information Criterion (BIC)
Bayesian information criterion (BIC) is one of the goodness-of-fit measurements for the selection of a (mixture) model among finite (mixture) models. It is based on the maximum likelihood estimation procedure and closely related to AIC. It is likely to increase the value of likelihood by including more parameters, which might result in overfitting. The BIC settles the overfitting issue by inserting a penalty term for the number of parameters in the chosen model. The value of the penalty term is smaller in AIC than in BIC.
where k is the number of parameters, n is the number of data points in RLS (lifetime or survival time), and L is a value which is the maximum value of likelihood function in the finite mixture model [27].

Kolmogorov-Smirnov (KS)
As in the other goodness-of-fit measurements, the Kolmogorov-Smirnov measurement allows for comparison between different models and is used to determine if RLS data (lifetime or survival data) arises from a population with a specific distribution [28]. Although KS is non-parametric, it can be utilized to check the analysis of variance under the assumption of normality.
When the size of lifetime data is small enough, KS can be used to compare a variable of the cumulative distribution function with a particular distribution. The value of test statistic D is calculated with the assumption of the null hypothesis of no difference between the theoretical distribution and empirical data as where F o (t) = k n is the empirical distribution function (eCDF) of n, n is the total number of realizations in which k is the number of realizations, with k ≤ T, and F r (t) is the cumulative distribution function (CDF). The D value is subject to the KS test table.

The Log-Likelihood Test
The log-likelihood function is a natural logarithmic transformation of the likelihood function, which is the method of obtaining the unknown parameters of a (mixture) model via the maximum of likelihood function. It is one of the measures of goodness-of-fit of a theoretical distribution to lifetime data points for unknown parameters [29]. In the lifetime or survival time case, generally, the likelihood function is defined as discrete distributions.
Given a model of probability density functions t → f(tα, β) where α, β are the parameters, the likelihood function is α, β → f(t | α, β) and can be written as where t > 0 are the discrete data points of the lifetime. The log-likelihood function of the chosen model for the given PDF is as follows:

Working Examples: Fitting Finite Mixture Models to Lifespan Datasets
The fitmixEM function fits lifespan datasets (in the case of yeast cells, this would be the replicative lifespan of a given genetic background) utilizing a finite mixture model in the form of fitmixEM (lifespan, model, κ, initial = FALSE, starts) The lifespan argument denotes genotype backgrounds, strains, or single-gene-deletion strains of lifespan data of dividing cells in the context of RLS; otherwise, it can represent survival time until death, failure, divorce, or relapse. The initial argument depends on the choice of the user, either FALSE or TRUE. If the argument is assigned to FALSE, a value would not set for the starts argument by default. The user must define a vector of (ω, α, β) starting values, which should be assigned to starts. The three-length vector is of κ = 1, · · · , g number of parameters of a given mixture model with the parts weight ω κ of the κth component, α κ shape, and β κ scale or rate parameters of a specified model argument of the κth component when initial is set to TRUE.
via the maximum of likelihood function. It is one of the measures of goodness-of-fit of a theoretical distribution to lifetime data points for unknown parameters [29]. In the lifetime or survival time case, generally, the likelihood function is defined as discrete distributions.
Given a model of probability density functions t ↦ f( t | | α, β ) where α, β are the parameters, the likelihood function is α, β ↦ f(t | α, β) and can be written as where t > 0 are the discrete data points of the lifetime. The log-likelihood function of the chosen model for the given PDF is as follows: log L(α, β|t ) = LL(α, β|t ) = ∑ log f , (t ) LL(α, β|t ) = LL(f , (t ))

Working Examples: Fitting Finite Mixture Models to Lifespan Datasets
The fitmixEM function fits lifespan datasets (in the case of yeast cells, this would be the replicative lifespan of a given genetic background) utilizing a finite mixture model in the form of fitmixEM (lifespan, model, κ, initial = FALSE, starts) The lifespan argument denotes genotype backgrounds, strains, or single-gene-deletion strains of lifespan data of dividing cells in the context of RLS; otherwise, it can represent survival time until death, failure, divorce, or relapse. The initial argument depends on the choice of the user, either FALSE or TRUE. If the argument is assigned to FALSE, a value would not set for the starts argument by default. The user must define a vector of (ω, α, β) starting values, which should be assigned to starts. The three-length vector is of κ = 1, ⋯ , g number of parameters of a given mixture model with the parts weight ω of the κth component, α shape, and β scale or rate parameters of a specified model argument of the th component when initial is set to TRUE.

Distribution Functions for Finite Mixture Models
Similar to classic survival functions in R studio for studying standard probability distributions, we present functions for the computation of the density function dmix, distribution function pmix, and randomly generated rmix vector of values of the finite mixture distributions utilized in fitmix package. The mixture functions presented in fitmix are in the following forms:

Distribution Functions for Finite Mixture Models
Similar to classic survival functions in R studio for studying standard probability distributions, we present functions for the computation of the density function dmix, distribution function pmix, and randomly generated rmix vector of values of the finite mixture distributions utilized in fitmix package. The mixture functions presented in fitmix are in the following forms: where lifespan is a vector of samples (realizations). The model argument is a character string available as PDFs of Gompertz ("gompertz"), Log-logistic ("Log-logistic "), Lognormal ("Log-normal"), and Weibull ("weibull") models indicating the distributions which are employed to fit lifespan data. M is the number of entries to be simulated for the mixture random generation from the model of the finite mixture, p is a positive real number that meets 0 < p < 1, κ is the number of parameters of a given model, and par is the parameter vector for finite mixture model that carries the same shape as the argument starts of the fitmixEM function. κ is a sole integer value representing the number of unknown parameters of a given PDF to contain in the model of the finite mixture. Models of the finite mixture parameters are gauged utilizing the Expectation-Maximization (EM) algorithm. As a demonstration, the script below outputs 500 observations from a two-parameter mixture model of Gompertz and demonstrates the results in Figure 2.

Finite Gompertz Mixture Model: Yeast Mutants with Known Effects on RLS Application
To better demonstrate the usefulness of the fitmix package, we performed Gompertz mixture model on experimental RLS measurements of single gene deletion mutants with known effects on aging [17,30]. We estimated parameters of the Gompertz mixture model  where lifespan is a vector of samples (realizations). The model argument is a character string available as PDFs of Gompertz ("gompertz"), Log-logistic ("Log-logistic "), Lognormal ("Log-normal"), and Weibull ("weibull") models indicating the distributions which are employed to fit lifespan data. M is the number of entries to be simulated for the mixture random generation from the model of the finite mixture, is a positive real number that meets 0 < p < 1, κ is the number of parameters of a given model, and par is the parameter vector for finite mixture model that carries the same shape as the argument starts of the fitmixEM function. κ is a sole integer value representing the number of unknown parameters of a given PDF to contain in the model of the finite mixture. Models of the finite mixture parameters are gauged utilizing the Expectation-Maximization (EM) algorithm. As a demonstration, the script below outputs 500 observations from a twoparameter mixture model of Gompertz and demonstrates the results in Figure 2.

Finite Gompertz Mixture Model: Yeast Mutants with Known Effects on RLS Application
To better demonstrate the usefulness of the fitmix package, we performed Gompertz mixture model on experimental RLS measurements of single gene deletion mutants with known effects on aging [17,30]. We estimated parameters of the Gompertz mixture model from RLSs of different yeast mutants shown in Figure 3. Since the Gompertz model is a good fit for lifespan data [9,31], such as for deletion mutants tor1, sch9, sir2, hap4, sod2, and pmr1, the fitmix package is even better because it uses the Gompertz mixture model to portray lifespan data.   Table 1 for estimated parameters of the finite Gompertz mixture fits.

Discussion
A common task in yeast aging studies is modeling the lifespan distribution. In this study, we have developed and introduced an R package called fitmix that provides several models and estimation techniques for modeling replicative lifespan distributions In Li et al. 2020, two subpopulations were reported during replicative aging. In BY4741 wildtype genetic background, cells have two approximately equally weighted subpopulations. In two single-gene deletion mutants, sir2-deletion and hap4-deletion, the subpopulations are skewed to one subpopulation over another [32]. Our mixture fitting results confirm these significant findings. Based on its histogram, even hap4-deletion may have three subpopulations of cells (Table 1 and Figure 4).    Table 1 for estimated parameters of the finite Gompertz mixture fits.

Discussion
A common task in yeast aging studies is modeling the lifespan distribution. In this study, we have developed and introduced an R package called fitmix that provides several models and estimation techniques for modeling replicative lifespan distributions  Table 1 for estimated parameters of the finite Gompertz mixture fits.
Based on the estimated parameters in Table 1 and using Equations (13) and (15), we obtain the probability density functions and survival functions of the Gompertz mixture distributions as illustrated in Figures 3 and 4. In our analysis, tor1, sir2, and sod2 mutants have RLS that can be fit into two subpopulations. Table 1 demonstrates scg9, hap4, and pmr1 mutants can even be fitted into three subpopulations.
Estimated model parameters of single-mode Gompertz aging model for shape parameter α ranged between 0.05 and 0.25, rate parameter β ranged between 0.0001 and 0.04, which supports that our mixture parameters are correctly estimated by the previous studies [9,14]. We can deduce that the Gompertz mixture model method provides the best fit for the RLS of budding yeast. This agrees with our assumption that a mixture model may be biologically meaningful.

Discussion
A common task in yeast aging studies is modeling the lifespan distribution. In this study, we have developed and introduced an R package called fitmix that provides several models and estimation techniques for modeling replicative lifespan distributions from both individual lifespan data and grouped lifespan data, as well as providing additional functionality for simulating finite mixture distributions and fitting curves to divisionprobability of the data. In our analysis, we show that RLS in many mutants, such as tor1, sch9, sir2, hap4, sod2, pmr1, can be fit into two subpopulations, some of them even three subpopulations. Our study suggests replicative aging in these probably have at least two subpopulations in the other mutants as well (Supplementary Materials), and we argue that these are promising biological insights or hypotheses that may be pursued by experimental biologists. Since experimental studies are expensive and time-consuming, our mixturemodeling approach provides a reasonable alternative research method to assess the aging process in many more yeast mutants.
To help facilitate the comparison of the different model estimation methods provided by the fitmix, we include multiple goodness-of-fit measures that the user can determine the best model technique for their purposes. The fitmix package employs the EM algorithm that might have some other alternatives. However, from the previous section, we can see that simply evaluating Equation (22) to find such parameters would be very hard. An analytical approach to the estimation of parameters requires an explicit formula for the maximum likelihood investigation [33]. Fortunately, there is an iterative method we can use to achieve this purpose. It is called the Expectation-Maximization, or simply EM algorithm [34,35]. It is widely used for optimization problems where the objective function has complexities such as the one that has been encountered in the fitmix package. To avoid converging to a local maximum of the observed data likelihood function, we compare our estimated mixture parameters depending on starting values previously reported [9,14,31].
While this package was developed in the context of yeast research, the models we fit and simulate have numerous applications throughout other biological and medical fields, such as cancer survival data, disease datasets, and lifespan datasets of single units. For example, a similar model of a mixture of a Weibull component and a surviving fraction in the context of a lung cancer trial is considered [36][37][38] and finite mixture distributions are broadly applied in medical and survival statistics (for numerous examples, see [18,39]).
The fitmix package is open-source software released under a GPL3 license on CRAN.
As future work, we think of some software functionality ideas that include: (1) an expanded set of the finite mixture models, particularly the binomial aging model outlined in Qin et al. (2019) and Jackson et al. (2016) [9,40]; (2) a function that automates plots and illustrates survival datasets fitted by mixture models; (3) expansion of user-detected goodness-of-fit measures, including a comparison to the best-fit model; and (4) addition of a Gaussian noise effect [14] that accommodates grouped distribution and replicative lifespan data of the budding yeast, e.g., stands, wildtype strains, single-gene mutant deleted genes.
Another study that could be conducted is examining distinct mixture models to model heterogeneous lifetime data, particularly RLS data. A mixture of Gompertz distributions, a mixture of Log-logistic distributions, a mixture of Log-normal distributions, and a mixture of Weibull distributions would be tested for the best fit to the experimental and simulated survival datasets. The comparisons of theoretical statistical distributions and mixture distribution models for lifetime and survival datasets would be studied in the future. Funding: HQ thanks the support of NSF Career award #1453078 and #1720215, NSF BD Spoke #1761839.
Institutional Review Board Statement: Not applicable.

Data Availability Statement:
The data that support the findings of this study are available through personal communication with Matt Kaeberlein via email: info@kaeberleinlab.org.

Conflicts of Interest:
The authors declare that they have no conflict of interest.