You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

7 March 2024

A Markov Chain Genetic Algorithm Approach for Non-Parametric Posterior Distribution Sampling of Regression Parameters

Information Systems School of Business Administration, Pennsylvania State University at Harrisburg, 777 West Harrisburg Pike, Middletown, PA 17057, USA
This article belongs to the Special Issue Nature-Inspired Algorithms in Machine Learning (2nd Edition)

Abstract

This paper proposes a genetic algorithm-based Markov Chain approach that can be used for non-parametric estimation of regression coefficients and their statistical confidence bounds. The proposed approach can generate samples from an unknown probability density function if a formal functional form of its likelihood is known. The approach is tested in the non-parametric estimation of regression coefficients, where the least-square minimizing function is considered the maximum likelihood of a multivariate distribution. This approach has an advantage over traditional Markov Chain Monte Carlo methods because it is proven to converge and generate unbiased samples computationally efficiently.

1. Introduction

Let α(x) be some unknown continuous multivariate probability density function. Let xi = [xi1,…, xin] be the ith iid sample generated from the distribution. Assume that m such samples are available; then, a maximum likelihood estimator α ^ will maximize the following expression:
α ^ x = i = 1 m α ^ x i
If X and Y are random variables generated from multivariate distribution (X,Y),
i = 1 m y i α ^ x i 2
where the dimensionality of some response variable Y is one and an iid sample from the distribution is represented as (x1, y1),…, (xm, ym). Then, the least-square non-parametric estimator will minimize the following expression [1]:
In regression problems, the dependence of y on x is considered to be given by some regression function α ^ (x) belonging to some class of functions [2]. In this paper, assuming that some functional model form α ^ (x) = f(x; θ1,…, θn) is known or a priori established, parameters θ1,…, θn are computed by minimizing the least-square expression in Equation (2). When the true density α(x) lies in the parametric class of functions parameterized by the vector θ = [θ1,…, θn], then, finding parameters by minimizing Equation (2), i.e., maximizing the likelihood, properties such as convergence, consistency, and lack of bias are satisfied [3]. However, when actual density does not lie in the class of parametric functions, there is an acute need for non-parametric estimation [4].
Methods in Bayesian statistics condition the problem of learning parameters (θ) on the dataset D, which is used to learn the parameters. These methods allow for the specification of priors on the parameters as p(θ) and require a formal specification of data likelihood function p(D|θ), which is conditioned on the parameters θ. The learning of parameters and confidence bounds on the parameters occurs using Monte Carlo Markov Chain (MCMC) methods to estimate posterior distribution p(θ|D) via the following Bayesian rule:
p θ | D p D | θ × p θ
To reduce any bias in confidence bounds on the parameters, samples from the initial burn-in period are excluded from the computation of confidence bounds.
The method described in this paper does not require any knowledge of data likelihoods or user-defined priors. The “population”-based method directly generates posterior distribution samples p θ | D so that confidence bounds on parameters can be generated. There is no need to exclude any initial burn-in period samples. More specifically, a genetic algorithm (GA) approach is used to estimate the parameters and their confidence bounds. This method begins with a random sample (i.e., population) estimate p 0 θ | D and iterates over certain GA generations k so that a steady-state distribution of the sample,   p k θ | D , minimizing the average least-square error, is obtained. It is shown that from one iteration to the next, the GA forms a Markov Chain, whose transition matrix can be determined using a fitness measure of minimizing the sum of squared error expression (2). Once the population converges, the final population can be used to establish confidence bounds on the regression parameters. The transition matrix is shown to be irreducible and aperiodic, which guarantees both the uniqueness of the steady-state distribution and convergence to a steady-state, regardless of the initial random population p 0 θ | D .
The rest of the paper is organized as follows: In Section 2, preliminaries of the GA Markov Chain framework are described. In Section 3, the approach is applied to a dataset, and the results of the experiments are reported. Section 4 concludes the paper with a summary and directions for future work.

3. Experiments and Results

The regression data selected for the experiments are related to the quality of the delivery system network of a soft drink company [9]. The dependent variable in the data is the time needed by an employee to restack an automatic soft drink vending machine. This time is called the total service time and is measured in minutes. There are two independent variables: the first variable is the number of stocked items, and the second is the distance an employee walks, measured in feet. The dataset contains 25 observations. Table 1 illustrates this dataset.
Table 1. Soft drink restacking times dataset.
When the ordinary least-square (OLS) regression is run on the dataset, it results in parameter values of β 1 = 1.61591, β 2 = 0.0143848, and β 3 = 2.34123. The overall regression model is significant at a 95% statistical confidence level, and the adjusted R-squared value is 95.4%. The root-mean-squared (RMS) error for the OLS model is 3.05766. From Equation (12), the λ value is 3.34123. The no-information IW for non-parametric posterior distribution regression parameters β1, β2, and β3 is [−3.2577, 3.2577] for a value of γ = 5%.
Some implementation aspects of the procedure were not described in Section 2. First is the fitness function of the GA procedure. For a given population member, the predicted outputs are computed from its genes using Equation (5). The RMS error on the dataset is computed next. Let us say that this RMS error is ξs for some population member model s∈{1,…, Ω}. The fitness value for this population member s in generation g, ( f s g ), is computed using the following expression:
f s g = λ 1 + ξ s
The maximization of Equation (14) can be obtained by minimizing the RMS. Note that λ is a predetermined constant. The number “one” in the denominator is added to avoid dividing by zero if ξs takes a value of zero in the event of a perfect regression model fit. Using the results of the OLS and plugging them into Equation (14), we obtain a value of 0.823. This value gives a benchmark of what may be the approximate best value of the best-fitness member in the GA procedure. If the best member fitness function value exceeds 0.823, then the GA model procedure has found a regression model with a lower RMS than the OLS regression model. Usually, it would be rare to find better results from a heuristic GA regression model. Thus, the threshold value of 0.823 should be considered an upper bound on the fitness value of the best population fitness member found using the GA experiments. Second, the no-information IW for the regression parameters was computed as [−0.975 × λ, 0.975 × λ] for the value of γ = 5%. If for the GA population considered for computing the final posterior distribution sample, the lb = 100 × (γ/2)% and ub = 100 × (1 − γ/2)% quantiles represent lower and upper bounds for a given regression parameter, then the percentage improvement in parameter IW from the no-information IW can be computed using the following expression:
P e r c e n t a g e I W I m p r o v e m e n t ( % i m p . ) = 1 u b l b 2 × 1 γ × λ .
Finally, in Section 2, it was mentioned that a posterior distribution sample from the final population at convergence is considered for the computing parameter confidence intervals. Since the mutation operation introduces some randomness, it is not always necessary to pick the last generation of the population to compute a parameter confidence interval. From a practical standpoint, a sample is extracted from generation g* for population Q*, where g* is determined using the following expression:
g * = argmax g { 0 , ϑ } s = 1 Ω f s g Ω
Equation (16) implies that a sample is extracted when the average population fitness value is highest. Generally, this extraction generation is closer to the final population generation ϑ , but it may or may not be the final population at generation ϑ .
The objective of using a GA for learning regression parameters is like running a ridge regression to avoid overfitting the training dataset. As a heuristic procedure, in most cases, a GA-based regression model may not outperform OLS in terms of the RMS error. Still, it may provide better generalizability for unseen future cases, leading to improved predictability compared to the traditional OLS regression model. Given a small dataset, three regression parameters, and an initial seeded population, minor initial experiments were conducted to select GA parameters for the experiments. These parameters were Ω = 100, ϑ = 200 , pc = 30%, and pm = 5%. The reader should note that the initial population in the procedure used in this research was always random for each run. This randomization was obtained using a random number that uses computer clock times as a seed for generating a random number. There are random number generators that allow users to define the seed for the random number generation procedure. When such a random number generator is used, the same initial population can be generated for each run by keeping the value of the seed constant. In such cases, the selection of parameters will be essential because then the parameters are the only criteria that will govern the quality of the final results. However, when the initial population for each run is randomly generated using computer clock times, the impact of parameters on the quality of solutions is hard to ascertain. In the case of computer clock time-seeded random numbers for the same set of parameters, slightly different results can be obtained owing to different starting populations. Extensive experiments on GA parameters in such a case are not necessary. Different runs with different initial populations can be conducted, and the percentage improvement criterion highlighted in Equation (15) can be used to separate high-quality results from low-quality results. Additionally, the average fitness of the GA population can be compared with the benchmark mentioned earlier (a value of 0.823) to monitor the gap between the average population fitness and its upper bound value, with a lower gap representing a better solution. The GA literature advises against using very high mutation rate values because high mutation rates introduce randomness [10]. Some randomness is necessary for searching for better solutions, but too much randomness can be detrimental to solution quality [10]. A general rule of thumb is that the crossover rate value should be less than 50%, and the mutation rate value should be less than the crossover rate value. Additionally, both the crossover rate and mutation rate should be non-zero values. In this paper, multiple runs were conducted with different starting populations and only the best results are reported.
Figure 2 and Figure 3 illustrate the results of the GA experiments for two crossover operations. In both Figure 2 and Figure 3, the top solid line is the best-fitness population member and the dotted line is the average population members’ fitness. While the top line appears straight, minor improvements in the best population member fitness occur. For Figure 2, the best population member fitness value improves from 0.823015 in the 1st generation to 0.823325 in the 200th generation. In Figure 3, these numbers are 0.822516 and 0.823437, respectively. The average fitness values for the posterior distribution sample extracted at the 192nd generation from Figure 2 and the 194th generation for Figure 3 were 0.631558 and 0.714597, respectively. The higher average fitness value for the sample extracted from Figure 3 suggests that arithmetic crossover results are better and will provide a greater percentage of IW improvement (PIWI) from Equation (15). The reader may also visually observe the gap between the average population fitness values and the best fitness population member values. This gap is lower in Figure 3, indicating that arithmetic crossover results are better than single-point crossover results.
Figure 2. The single-point crossover results.
Figure 3. The arithmetic crossover results.
Table 2 illustrates the descriptive statistics of results obtained from the two crossover operators. Bayesian regression results taken from a text [9] are also reported for comparison. As expected, the PIWIs are higher for the arithmetic crossover, with all PIWIs being higher than 46%. This results in a tighter 95% confidence interval of the GA regression parameters for the arithmetic crossover operator. In all cases, the arithmetic crossover operator PIWIs are higher than those for the single-point crossover operator. A point of interest for the reader may be to note the starting average fitness values for both crossover operations at generation one. Both operators start at an average fitness value of around 0.4. Since the initial GA population is seeded with values close to the OLS regression parameters, the starting point illustrates that there is enough diversity in the population for the GA operators to still evolve the population to a higher overall fitness.
Table 2. Posterior distribution summaries.
When viewing the arithmetic crossover results with the Bayesian model results, the Bayesian model confidence bounds are tighter, partly due to the likelihood distribution assumptions that Bayesian models make. The only exception is the confidence bound for the regression intercept, which is lower for the GA regression model. The non-parametric distribution is negatively skewed since the mean values for arithmetic crossover GA models are always lower than the median values. When tight bounds are desired, it is possible to use the GA regression model first to understand the underlying properties of the non-parametric posterior distribution and then select the appropriate data likelihood and prior distributions in Bayesian regression. This way, Bayesian regression modelers can make an informed decision and improve confidence in the results of their investigations.
Table 3 illustrates the parameter values for the best-fitness population member found in all GA generations. Both models provide somewhat similar results in terms of RMS values, which in turn are identical to the RMS results obtained through OLS regression. Given that there are three values for a regression parameter (mean, median, and best-fitness population member genes), the question is, which value should be used as the final set of regression parameters? A decision maker should use the best member fitness parameter values from Table 3 if those values fall within the 95% confidence bounds provided in Table 2. For arithmetic crossover, the value for β1 = 1.616 does not belong to its 95% confidence bounds of [0.568, 1.604] from Table 2. Thus, it should be rejected. The reason for this rejection is that the value of β1 = 1.616 may not be a natural outcome of GA population evolution but was retained due to the initial seeding of the GA population with OLS regression parameters. Once a value for the best member fitness is rejected, the median values should be used as the final set of regression parameters. Ideally, the best fitness values should be chosen if these values fall within their respective 95% confidence bounds. Otherwise, median values represent the next best solution. In the case of single-point crossover, the best member fitness values fall within the 95% confidence bounds provided in Table 2, which are the final set of regression parameters.
Table 3. The best member fitness function parameters.
It may be possible to use an ensemble value, the average of all three values (median, mean, and best member fitness genes) as the final value for the regression model. The merits of different approaches in deciding the final set of regression function parameters are considered to be out of scope for the current study. However, multiple values offer different selection possibilities, where each possibility may have advantages and disadvantages.

4. Summary and Directions for Future Research

This paper proposes a GA-based Markov chain approach to directly generate samples from posterior distributions. Using linear regression as an example and least-square error minimization criteria, a sample from posterior distribution was extracted, and 95% confidence bounds on the regression parameters were established. This procedure is guaranteed to converge as long as non-zero parameter values for the GA are selected. Compared to traditional MCMC methods, the proposed method has certain advantages in that the procedure searches for all parameters simultaneously instead of one parameter at a time, as in MCMC methods. Also, no proposal density functions are required, and no burn-in period sample rejection is necessary. Knowledge of data likelihood and priors is not required as well. It is well known that MCMC methods do not allow for incorporation of multi-mode distribution [11] and the current method does not impose such restrictions either. The only challenge in using this method is that some knowledge of the maximum likelihood criterion for the posterior distribution is necessary. This knowledge is directly used for creating the GA fitness function.
There are some areas that could be explored where the proposed approach may be helpful. One of the advantages of the proposed method is that it does not require a continuous or differentiable likelihood function. The method may be adapted to generate truncated posterior distributions by imposing penalties in the GA fitness function. These penalties may be imposed by using IF–THEN rules. As noted earlier, the proposed method can also be used to aid in selecting data likelihood density functions for MCMC methods. While the linear regression problem domain was used in this research due to its widespread application and simplicity, the proposed method can easily be used for non-linear regression and linear and non-linear discriminant analysis. This method is likely more efficient than MCMC methods and will likely converge faster. When both the current method and the MCMC method can be used on a problem domain (as was the case in this research), both can be used to gain confidence in the final results. When results vary, it is possible to use the data-mining literature to devise approaches to combine different values to reduce error variance and gain confidence in the selected set of parameters. Future research is needed to explore the additional merits of the proposed GA procedure.

Funding

This research received no funding.

Data Availability Statement

All Data are reported in the paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Geman, S.; Hwang, C.-R. Nonparametric Maximum Likelihood Estimation by the Method of Sieves. Ann. Stat. 1982, 10, 401–414. [Google Scholar] [CrossRef]
  2. Gasser, T.; Engel, J.; Seifert, B. Nonparametric Function Estimation. In Handbook of Statistics; Elsevier Science Publishers: Amsterdam, The Netherlands, 1993; Volume 9, pp. 423–465. [Google Scholar]
  3. Agarwal, R.; Chen, Z.; Sarma, S.V. A Novel Nonparametric Maximum Likelihood Estimator for Probability Density Functions. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1294–1308. [Google Scholar] [CrossRef] [PubMed]
  4. Ferreira, T.R.; Liska, G.R.; Beijo, L.A. Assessment of Alternative Methods for Analysing Maximum Rainfall Spatial Data Based on Generalized Extreme Value Distribution. SN Appl. Sci. 2024, 6, 34. [Google Scholar] [CrossRef]
  5. Pendharkar, P.C.; Koehler, G.J. A General Steady State Distribution Based Stopping Criteria for Finite Length Genetic Algorithms. Eur. J. Oper. Res. 2007, 176, 1436–1451. [Google Scholar] [CrossRef]
  6. Pendharkar, P.C. A Steady State Convergence of Finite Population Floating Point Canonical Genetic Algorithm. Int. J. Comput. Sci. 2008, 2, 184–199. [Google Scholar]
  7. Pendharkar, P.; Rodger, J. An Empirical Study of Impact of Crossover Operators on the Performance of Non-Binary Genetic Algorithm Based Neural Approaches for Classification. Comput. Oper. Res. 2004, 31, 481–498. [Google Scholar] [CrossRef]
  8. van Ravenzwaaij, D.; Cassey, P.; Brown, S.D. A Simple Introduction to Markov Chain Monte-Carlo Sampling. Psychon. Bull. Rev. 2018, 25, 143–154. [Google Scholar] [CrossRef] [PubMed]
  9. Ntzoufras, I. Bayesian Modeling Using WinBUGS; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 2009. [Google Scholar]
  10. Goldberg, D.E. Genetic Algorithms in Search, Optimization & Machine Learning; Addison-Wesley: Reading, MA, USA, 1989. [Google Scholar]
  11. Tucker, J.D.; Shand, L.; Chowdhary, K. Multimodal Bayesian Registration of Noisy Functions Using Hamiltonian Monte Carlo. Comput. Stat. Data Anal. 2021, 163, 107298. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.