On a Robust MaxEnt Process Regression Model with Sample-Selection

In a regression analysis, a sample-selection bias arises when a dependent variable is partially observed as a result of the sample selection. This study introduces a Maximum Entropy (MaxEnt) process regression model that assumes a MaxEnt prior distribution for its nonparametric regression function and finds that the MaxEnt process regression model includes the well-known Gaussian process regression (GPR) model as a special case. Then, this special MaxEnt process regression model, i.e., the GPR model, is generalized to obtain a robust sample-selection Gaussian process regression (RSGPR) model that deals with non-normal data in the sample selection. Various properties of the RSGPR model are established, including the stochastic representation, distributional hierarchy, and magnitude of the sample-selection bias. These properties are used in the paper to develop a hierarchical Bayesian methodology to estimate the model. This involves a simple and computationally feasible Markov chain Monte Carlo algorithm that avoids analytical or numerical derivatives of the log-likelihood function of the model. The performance of the RSGPR model in terms of the sample-selection bias correction, robustness to non-normality, and prediction, is demonstrated through results in simulations that attest to its good finite-sample performance.


Introduction
The Bayesian nonparametric method is a powerful approach for regression problems when the shape of the underlying regression function is unknown, the function may be difficult to evaluate analytically, or other requirements such as design costs may complicate the process of information acquisition process. Bayesian orthogonal basis expansion regression, spline smoothing regression, wavelet regression, and Gaussian process regression (GPR) are powerful nonparametric Bayesian approaches that address these regression problems. These regression techniques have been extensively used in fields such as psychology, data science, engineering, neuroscience, and fishery [1][2][3][4][5].
Sample selection (or incidental truncation) in regression analysis is known to often arise in a wide variety of practical problems and standard analysis of data with sample selection leads to biased results because the selected sample represents only a subset of a full population; see [6][7][8].
Regression analysis also has problems regarding sensitivity to outliers and departures from the normality of the dependent variable (see [9,10]). Thus, when one implements nonparametric Bayesian regression with non-normal data with sample selection, the selection mechanism and non-normality of the data must be jointly modeled with the Bayesian nonparametric regression model to correct the sample-selection bias and to implement a robust statistical inference. In this regard, several estimation procedures have been considered in the literature to produce robust linear regression models that are subject to sample selection including for instance [6,7], for frequentist methods, and [8,9,11] for Bayesian methods. See [9,12] to obtain robust Bayesian sample-selection models other than the regression model. In addition, no studies have generalized a nonparametric regression model to deal with non-normal data with the sample selection.
The objective of this paper is to introduce the Maximum Entropy (MaxEnt) process regression model as a new Bayesian nonparametric regression model and to then generalize this model to propose a robust sample-selection Bayesian nonparametric regression model along with its inferential methodology. The MaxEnt process regression model is obtained by assuming a MaxEnt prior distribution for its nonparametric regression function, and it includes the GPR model as a special case. This provides a relationship between the MaxEnt nonparametric regression approach and the rationale to conduct a Gaussian regression analysis. This study focuses on the GPR model as a special MaxEnt process regression model and a powerful analysis model towards nonparametric regression problems. Then, the GPR model is generalized to obtain a robust sample-selection Gaussian process regression (RSGPR) model. This RSGPR model extends the GPR model to account for the sample selection scheme, and it is robust when the data are heavy-tailed or contain outliers.
The RSGPR model consists of two components. The first is a robust GPR model that determines the level of the dependent variable of interest and the second is an equation that describes the selection mechanism that determines whether we have observed the dependent variable or not. The sample-selection bias arises when these two components are correlated and must be modeled jointly. A Bayesian hierarchical methodology is developed here to estimate the RSGPR model. This methodology relies on a stochastic representation technique (see, e.g., [13]) to set up the Bayesian hierarchy of the RSGPR model, and it has three attractive features. First, given the likelihood function of the model, the posterior of its parameters does not belong to any well-known parametric family, but the methodology uses a simple Markov chain Monte Carlo (MCMC) algorithm that does not resort to generating random draws from the complex posterior. Second, the output of the algorithm not only provides a Bayesian analogue of confidence intervals for the regression function, but it also readily gives an indication of the presence (or absence) of the sample-selection bias. Third, if there is prior information, such as restrictions on the regression function, such information can be incorporated easily through a prior distribution.
The remainder of this study is organized as follows. Section 2 introduces the MaxEnt process regression model that strictly includes the GPR model. Then, this section formulates the RSGPR model that is obtained by incorporating the GPR model with a class of scale mixtures of normal errors as well as a selection model comprising a class of scale mixtures of the probit sample selection equations. Properties of this RSGPR model are studied including the exact distribution of a selected observation, a stochastic representation, a distributional hierarchy, and the magnitude of the sample-selection bias. In Section 3, we construct a Bayesian hierarchical model for inference in the RSGPR model by exploiting the stochastic representation and distributional hierarchy. Then we develop a Bayesian estimation methodology based on the hierarchical model to provide a simple estimation procedure for the RSGPR model. We further construct a computationally feasible MCMC algorithm through a Bayesian hierarchical approach. Section 4 examines the finite-sample performance of the method through a limited but informative simulation. This numerical illustration shows the usefulness of the RSGPR model for the Gaussian process regression analysis of non-normal data with the sample selection. The study then concludes with a discussion in Section 5. Proofs and additional details are provided in the Appendix A.

MaxEnt Process Regression Model
Consider the following nonparametric regression model, where y n = (y 1 , . . . , y n ) is n × 1 vector of responses, is the n × p design matrix, and n = ( 1 , . . . , n ) is a n × 1 vector of i.i.d. random noises with zero mean vector. In the basic model structure of (1), the parametric form of the regression function η n (x) is not assumed, but η n (x) is assumed to have specific types of functional structure. For example, η n (x) can be represented with a Fourier series [14], splines [15], kernels [16] and others.
In the Bayesian nonparametric regression, we assume that the regression function (or signal term) η n (x) is a random function that follows a particular distribution. This distribution is subjective in the sense that the distribution reflects our uncertain prior information regarding the function. Sometimes, we have a situation in which partial prior information on η n (x) is available, outside of which it is desirable to use a prior that is as non-informative as possible. In this situation, Boltzmann's maximum entropy theorem (see, e.g., [17]) yields a maximum entropy prior π max (η n (x)) that is an exponential form and maximizes the entropy, in the presence of partial information for various moment functions of η n (x). In a special case where we only have partial prior information about the mean vector and covariance matrix functions of η n (x) of the Bayesian nonparametric regression model (1), Boltzmann's maximum entropy theorem yields the following prior distribution of η n (x).
Lemma 1. Let n × 1 regression function vector η n (x) have a prior distribution on R n whose partial information on the mean and covariance functions are m(x) = (m(x 1 ), . . . , m(x n )) and K(x) = {κ(x i , x j )}, respectively. Then the maximum entropy prior of η n (x) is for η n (x) ∈ R n . This is a density of GP (m(x), K(x)), a Gaussian process defined by the mean function m(x) and the covariance function K(x).
Note that the Gaussian process GP m(x), K(x) defines a collection of random functions wherein any finite subset of the process has multivariate normal (Gaussian) distribution. From now on, we will write the Gaussian process as η n (x) ∼ GP m(x), K(x) . The only restriction on the Gaussian process is that the covariance function K(x) must be an n × n positive definite symmetric (pds) matrix. If K(x) is not a pds matrix, then the corresponding value of H(π max ) = p(1 + log(2π))/2 + log(|K(x)|)/2 will not be defined (see, e.g., [18]). As a result, this paper introduces yet another Bayesian nonparametric regression model by combining the regression model (1) and the MaxEnt prior in Lemma 1 and introducing a normal regression error distribution. The model is named as a "MaxEnt process regression model" and defined by y n = η n (x) + n , n ∼ N n (0, σ 2 I n ), where N n (0, σ 2 I n ) is an n-variate normal distribution with mean vector 0 and covariance matrix σ 2 I n , IG(ν 1 , ν 2 ) denotes an inverse gamma distribution with shape parameter ν 1 and scale parameter ν 2 , η n (x) is independent of σ 2 and n , the mean function m(x i ) reflects the expected function value at input x i , i.e., m(x i ) = E[η(x i )], and the covariance function κ(x i , x j ) models the dependence between the function values at different input points x i and x j , i.e., κ( . . , n. See [19] for choice of an appropriate covariance function based on assumptions such as smoothness and likely patterns to be expected in the data. A commonly used isotropic covariance function in practice is the squared exponential covariance function given by where u 0 and w 0 are hyperparameters and which are relevant for the shape of MaxEnt process regression. Here u 0 stands for global scale of the covariance matrix K(x) and w 0 stands for smoothing parameter, respectively. We can easily see that the MaxEnt process regression model (5) is the same as the GPR model considered by [19,20]. This proves the following corollary and can hence be used as an information theoretic justification for using the GPR model as a Bayesian approach for a nonparametric regression analysis.
Corollary 1. Suppose E[η n (x)] = m(x) and Cov(η n (x)) = K(x) are all the prior information on η n (x) for the Bayesian nonparametric regression model (1). Then MaxEnt prior distribution of η n (x) is GP m(x), K(x) which defines the GPR model.
According to Corollary 1, we shall denote the MaxEnt process regression model by the GPR model. When there is no functional constraint in the GPR model, then the prior specification in the model (3) can be used, and posterior inference can be performed without difficulty. It is seen, from [20], that the conditional posterior distribution of η n (x) is normal with the mean and covariance given by Var[η n (x)|(y n , x), However, in the GPR analysis, a sample-selection scheme often applies to the response variable that results in missing not at random (MNAR) observations on the variable. In this case, the regression analysis using only the selected cases will lead to biased results (see, e.g., [6][7][8]). This study provides a bias correction procedure for the GPR analysis with MNAR data generated from the sample-selection scheme. For the analysis, we propose a robust sample-selection GPR (RSGPR) model based on a family of scale mixtures of normal (SMN) distributions (see [21,22] for details). This approach reflects the MNAR mechanism as well as its robustness against departures from the normality assumption (see, e.g., [11,12]), and proposes a robust GPR model to analyze the partially observed sample-selection data.

Proposed Model
We propose the RSGPR model through the following steps. First, we modify the GPR model (3) by incorporating the SMN error distribution for a robust GPR analysis. Then, we connect the robust GPR model directly to a sample-selection model by introducing some latent variables to explain the partially observed sample-selection data. To model the sample-selection mechanism, we need to introduce some notation for the partially observed data.
Let s i be a binary variable that takes on value 1 if y i of subject i is observed using the sample-selection scheme, and 0 if that of the subject is not observed using the same scheme. Then, we introduce the following RSGPR model to represent the regression equation of the observable variable y i : where I(·) is an indicator function, SMN 2 (0, Σ, δ, G), a scale mixture of bivariate normal distributions with mixture function δ(ω) and mixing variable ω ∼ G. Here γ = (γ 1 , . . . , γ q ) and Σ = σ 2 ρσ ρσ 1 are parameters to be elicited by using the priors p 0 (γ) and p 0 (Σ).
Without loss of generality, we assume that the single sample selection scheme s i = I(z i ≥ 0) is applied to a random sample of n observations (y i 's) associated with the model (1) and gives only the first n 1 observed values of y i 's out of the n (n > n 1 ) observations according to the sample-selection scheme. Thus, the overall available data information of the RSGPR model consists of the set of s i binary values and the n 1 -tuples of observations (y i , v i ) corresponding to individuals with s i = 1, while v i for those with s i = 0. The purpose of this study is to estimate the regression function η n (x) based on partially observed data (i.e., sample-selection data) with size n 1 .
For fixed η(x i ), the density of the RSGPR model (6) is composed of a continuous component h(y i |s i = 1) and a discrete component p(s i ). The discrete component is . . , n 1 . This density is given by where This distribution is essentially a member of the class of skew-scale mixtures of normal (skew-SMN) distributions discussed by [13,23,24]. We will denote the distribution law of [y i |η(x i ), The following lemma is useful to generate the partially observed y i 's and indicate the difference between the RSGPR model (6) and the GPR model (3).

Lemma 2. For a given value of
for the RSGPR model can be represented by the following two-stages of distributional hierarchy: where U i iid ∼ N (0, δ(ω)) and Z C i ind ∼ T N C i (0, δ(ω)) are independent conditionally on ω, and T N C i (0, δ(ω)) denotes a N (0, δ(ω)) distribution truncated to the interval C i = (α i , ∞).
Lemma 2 shows that the RSGPR model applies to relax the classic assumption of the underlying normality as well as to reflect the sample-selection scheme. This lemma also indicates that the partially observed data y i 's does not represent a random sample from the GPR model generating y i 's, even after controlling for the regression function η(x i ). If we want to apply a GPR analysis to the partially observed sample-selection data, a fitted model should be the RSGPR model. The RSGPR model changes depending on the choice of the distribution of ω and its function δ(ω). In the special case wherein the distribution of ω degenerates at δ(ω) = 1, the RSGPR model produces a sample-selection Gaussian process normal error regression (SGPR N ) model. When we choose ω ∼ G(ν/2, ν/2), a gamma distribution with mean 1 and δ(ω i ) = 1/ω i , the model becomes a sample-selection Gaussian process t ν error regression (SGPR t ν ) model, allowing to regulate the tail distribution of the model by means of the degrees of freedom. We also see that the RSGPR model strictly includes the GPR model because the latter is obtained by setting ρ = 0. For the remainder of this study, we use the symbols in the preceding sections with the same definitions.

The Sample-Selection Bias
As indicated by the density (8) and Lemma 2 the selected observations [y i |s i = 1]'s do not represent a random sample from the GPR model generating y i 's, but they are missing not at random (MNAR) [25] inducing a sample-selection bias. The following results on the sample-selection bias are noted in the Bayesian estimation of the GPR model with the partially observed data. where y n 1 = (y 1 , . . . , y n 1 ) be an n 1 × 1 observed vector, m 1 (x) = (m(x 1 ), . . . , m(x n 1 )) , and K 11 (x) is the first n 1 × n 1 diagonal sub-matrix of K(x).
As shown in Lemma 3, if we use the partially observed y i 's to estimate the GPR model, the existence of the truncated normal distribution term (i.e., W C β 1 ) in Equation (10) induces a biased estimation of the regression function. Note that the distribution becomes normal (i.e., W 1 ∼ N n 1 (0, Ω 2 )) in the GPR model for the case where y i 's are fully observed. Therefore, the usual estimation of the regression function based on the GPR model will produce inconsistent results when ρ = 0. This clearly reveals that sample-selection bias occurs in Bayes estimation of the regression function η n 1 . The magnitude of this bias is as follows.
Corollary 2. Instead of the SGPR N , if the GPR model is used for estimating η n 1 based on observed data y n 1 then a sample-selection bias occurs in its conditional posterior mean. This bias is , and Ω * 2 = Ω 2 .
The sample-selection bias in calculating the marginal effect (or propensity) of a predictor can be also expected.
where v ki and x ki are k-th element of v i and x i , then difference in the marginal effect of the predictor x ki on the selected observation y i between the RSGPR model and the GPR model is where γ k is the k-th element of γ, and E ω denotes the expectation is taken with respect to the distribution of ω ∼ G(ω).
To compare the SGPR N model with the GPR model, various values of the sample-selection bias associated with the posterior mean (see, Corollary 2) and the difference in the marginal effect of the k-th predictor (see, Corollary 3) were calculated and are depicted in Figure 1. For the calculation, we set σ = 1, γ k = 1, and K 11 (x) = 0.5I n 1 + 0.51 n 1 1 n 1 /n 1 , an intra-class covariance matrix, where 1 n 1 is an n 1 × 1 summing vector whose elements are all one. The left panel in Figure 1 is a graph of the sample-selection bias for different values of β i and ρ. This graph shows the values of the first element of the bias vector given in Corollary 2. From the graph, we see that the sample-selection bias occurs in the GPR analysis with sample selection, and its magnitude becomes larger as the values of |ρ| or β i become larger. The sign of the bias is opposite to that of ρ. The right panel shows a graph of the difference in the marginal effect (defined by Equation (11)) as a function of α i and ρ. This graph shows that the absolute value of the difference increases rapidly as α i tends to have a large value, and this difference tends to be larger as the absolute value of ρ becomes larger. Furthermore, the signs of the difference and ρ are different, which is expected for the case where γ k > 0. These panels imply that an inconsistent nonparametric regression analysis is unavoidable, provided that the GPR model is fitted to the partially observed sample-selection data. Instead, the proposed RSGPR model should be used to correct the sample-selection bias and to estimate the true marginal effect of each predictor in the regression analysis.

Hierarchical Representation of the RSGPR Model
Let us revisit the RSGPR model (6) in Section 2.2. From Equations (7) and (8), we see that the log-likelihood function of the RSGPR model based on the partially observed n-tuples of observations This is a complex function for the Bayesian estimation of the parameters (η n 1 and Ψ) of the RSGPR model. Instead, the following hierarchical representation of the RSGPR model is useful for a simple estimation of the parameters.
First, the likelihood function in Equation (12) can be represented by the following distributional hierarchy.
Theorem 1. For the n-pairs of independent observations, (y i , s i ), generated from the RSGPR model defined by Equation (6), their distribution can be written by the following Bayesian hierarchical model: When the prior information on ξ, τ 2 , and γ is not available, a convenient strategy of avoiding improper posterior distribution is to use proper priors with their hyperparameters fixed as appropriate quantity to reflect the diffuseness of the priors (i.e., limiting non-informative priors). For this convenience, the prior distributions in Theorem are used to elicit the prior distributions of ξ, τ 2 , and γ. All hyperparameters that appeared in the prior distributions of the Bayesian hierarchical model are assumed to be given from the prior information of previous studies or other sources.

Full Conditional Posteriors
Let y n 1 = (y 1 , . . . , y n 1 ) , V = (v 1 , . . . , v n ), and s = (s 1 , . . . , s n ) be observed. Further suppose that z = (z 1 , . . . , z n ) and ω = (ω 1 , . . . , ω n ) are the latent observation vector and the scale mixing vector, respectively. Then, based on the RSGPR model, we obtained joint posterior distribution of Θ = {η n 1 , τ 2 , ζ, γ, z, ω} given the observed data set D n = {y n 1 , V, s} : where g(·) is the p.d.f. of the scale mixing variable ω. Note that the joint posterior in (13) is not simplified in an analytic form of the known density and is thus intractable for posterior inference. Instead, we derive conditional posterior distribution of each parameter in Θ in an explicit form, which will be useful for posterior inference by using a Markov chain Monte Carlo (MCMC) method. Given the joint posterior distribution (13), we can obtain the following posterior distributions whose derivations are provided in Appendix A: (1) The full conditional posterior distribution of η n 1 is given by where The full conditional posterior distribution of τ 2 is an inverse Gamma distribution: (3) The full conditional posterior distribution of ζ is a normal distribution: where The full conditional posterior distributions of z i 's are independent and their distributions are given by for i = 1, . . . , n, where The full conditional posterior density of γ is: The full conditional posterior densities of ω i 's are independent and they are given by

Markov Chain Monte Carlo Method
The MCMC scheme, working with the full conditional distributions of the parameters in Θ, is not complicated to implement. A routine Gibbs sampler can be used to generate posterior samples of η n 1 , τ 2 , ζ, z i , and γ based on each of their full conditional posterior distributions obtained in Section 3.2. In posterior sampling of ω i 's, Metropolis-Hastings (M-H) within the Gibbs algorithm can be applied because their conditional posterior densities may not have explicit form of known distribution as in Equation (19). For Gibbs sampling, one should note the following points: (1) Given the initial values of Θ (0) , the implementation of the Gibbs sampler involves R iterative sampling from each of the full conditional posterior distributions obtained in Equation (14) through Equation (19). (2) Gibbs samples of ρ and σ 2 can be obtained by using those of ζ = ρσ and τ 2 = σ 2 (1 − ρ 2 ).
If ω i degenerates at δ(ω i ) = 1, the RSGPR model can be reduced to the SGPR N model. In this case, the MCMC procedure excludes the Gibbs sampling of ω i 's by using the posterior distribution (19). (4) For various distributions of mixing variable ω i and mixing functions δ(ω i ) of the SMN distributions such as t ν , logit, stable, slash, and exponential power models (see, e.g., [21,22]).

(5)
When ω i iid ∼ G(ν/2, ν/2) and δ(ω i ) = 1/ω i , the RSGPR model becomes the SGPR t ν model. For generating ω i 's, we may use the following posteriors where z C i = z i − v i γ. Except for the SGPR N and SGPR t ν , we need to adopt the Metropolis-Hastings algorithm within the Gibbs sampler because the conditional posterior density of ω i does not have explicit form of known distribution. See [26,27] for the algorithm for sampling ω i from the posterior density. (6) When the squared exponential covariance function K(x) in Equation (4) is chosen with unknown hyperparameters u 0 and w 0 , we need to elicit the priors of u 0 and w 0 for the full Bayes methods based on the MCMC method. The priors considered by [28] can be used for this assessment as follows. The prior distributions are a conjugate u 0 ∼ IG(a, b) and w 0 ∼ HC(c, d). Here HC(c, d) denotes the half-Cauchy distribution with the p.d.f. HC(w 0 ; c, d), location parameter c, and scale parameter d. See [28], for compatibility with w 0 ∼ HC(c, d) to elicit the prior information on w 0 . (7) Full conditional posterior distributions of u 0 and w 0 are where a * = a + n 1 /2 and b * = b + u 0 (η n 1 − m 1 (x)) K 11 (x) −1 (η n 1 − m 1 (x)). Note that the conditional posterior density of w 0 does not have explicit form of known distribution. This implies the use of the Metropolis-Hastings algorithm within the Gibbs sampler to generate w 0 from the posterior density. (8) After obtaining the Gibbs samples of Θ, we can use them for Monte Carlo estimation of regression function η n 2 and missing observations y n 2 . They can be also used for predicting regression functions and y i s evaluated at new predictors (see, e.g., [26]).

Prediction with Bias Corrected Regression Function
According to the Gaussian (i.e., MaxEnt) process prior, the joint distribution of the training outputs (η n 1 ) and test outputs (η n 2 ) is where η n = (η n 1 , η n 2 ) , K 12 (x) denotes the n 1 × n 2 matrix of the covariances evaluated at all pairs of training points {x i |i = 1, . . . , n 1 } and test points {x j |j = n 1 + 1, . . . , n}, and similarly for the other entities K 11 (x), K 21 (x), K 22 (x). The RSGPR framework provides a straightforward way of predicting test outputs based on the relevant test points and the training outputs. Conditioning the joint Gaussian prior distribution on the training observations, we arrive at the predictive distribution for the future (or missing) regression function given by The regression function (η n 2 ) value can be sampled from the predictive distribution (21) by evaluating the mean and covariance matrix of the distribution. Thus, it can be generated within the preceding MCMC algorithm for estimating the RSGPR model: We can generate η n 2 and unobserved observation vector y n 2 = (y N 1 +1 , . . . , y n ) in the r-th iteration of the algorithm whose Markov chain is augmented by the following conditional distributions. η (r) where η (r) n 2 = (η(x n 1 +1 ) (r) , . . . , η(x n ) (r) ) and D 2 (δ(ω)) (r) = diag{δ(ω n 1 +1 ) (r) , . . . , δ(ω n ) (r) }. n 2 are respective samples generated from R iterations, then bias corrected expected value of η n 2 and that of posterior predictive distribution of y n 2 can be approximated via Monte Carlo byη

Numerical Illustrations
This section presents empirical results of the Bayesian hierarchical RSGPR analysis of non-normal data with the sample selection. We provide results obtained from simulated data applications comparing the performance of the RSGPR model with that of the GPR model. We developed our program written in R (see, e.g., [29]), which is available from the authors upon request.

Simulation Scheme
In this simulation, we evaluated the finite-sample performance of the RSGPR model by using sample-selection data generated for different sizes. The performance was assessed in terms of sample-selection bias correction and robustness to non-normal model errors. These could be measured by comparing the posterior estimation and prediction results of the RSGPR model with those based on the GPR model. Specifically, we compared the results obtained from the SGPR N (or SGPR t 10 ) analysis with the results of the GPR (or GPR t 10 ) analysis based on a partially observed sample-selection data. This study also demonstrated that the SGPR t ν model is more robust against outliers compared to the SGPR N model. To evaluate the performance, we generated M = 300 sets of partially observed sample-selection data with size n = 300 with n 1 = 150 (i.e., the missing rate is 0.5) from each of the three models (see details below). The general form of the three models is as follows: where s i = I(z i ≥ 0), γ = 0, ρ = 0.5, and σ = 3. Model 1 was defined by assuming that the distribution G degenerates at ω = 1. Model 2 was obtained from the model (22) by setting δ(ω) = 1/ω and G ∼ G(10/2, 10/2). Model 3 assumed a mixture of bivariate normal errors instead of the SMN 2 (0, Σ, δ, G) distribution. Throughout our simulation, the hyper-parameters for the Bayesian hierarchical model in Theorem 1 were chosen to reflect the diffuseness of the priors. To obtain the limiting non-informative priors of ζ, τ 2 and γ, their hyper-parameters were assessed as θ 0 = 0, σ 0 = 10, c = 0.001, d = 0.001, γ 0 = 0, and Ω 0 = 10. Note that when our observational data were augmented through proper prior information, as in this simulation study, the issue to identify the parameters in the RSGPR model disappeared.
In the simulation, we proceeded as follows to estimate the parameters. Using each of M = 300 datasets generated from the models (Model 1, Model 2, and Model 3), we fitted the RSGPR and GPR models and applied the proposed Bayesian hierarchical methodology to estimate the parameters of the fitted models by assuming the above prior distributions. To implement the methodology by using each generated dataset, we obtained 15,000 posterior samples from the developed MCMC algorithm (in Section 3) with 5 thinning periods after a burn-in period of 5000 samples. This sampling plan guaranteed a convergence of the chain of the MCMC algorithm. The MCMC method (applied to each of M = 300 datasets) gave estimates (or predictions) of the nonparametric regression function (η(x)) as well as the other parameters of the RSGPR model.
The variability in the regression function estimates (η n 1 ) and predictions (η n 2 ) obtained by using a dataset were then visualized as shown in Figures 2 and 3. These figures compare the estimates (or predictions) of the nonparametric regression function obtained from two models (the RSGPR and GPR models). The black line of each graph in the figures shows the true regression function of the model (22). The red dashed line denotes the posterior mean (or predicted value) of the regression function of the model (22) obtained by using a Bayesian hierarchical RSGPR analysis with the sample-selection data of size n 1 = 150, while the blue dashed line depicts that obtained by using a GPR analysis of the sample-selection data. The 97.5th quantile and 2.5th quantile of 3000 posterior samples (predictions) of each regression function (η(x i )) in the RSGPR model were also calculated. In each figure, these quantiles were used to draw 95% posterior (or prediction) intervals of η(x i )'s by using the gray band. The accuracy of parameter estimates was calculated by using the mean absolute bias (MAB) and the root mean square error (RMSE): where M = 300 andθ k is the posterior estimate of -th element of p × 1 parameter vector θ in the k-th replication.

Sample-Selection Data Generated from Model 1
If the distribution G is degenerated at ω = 1, we then obtain the SGPR N model from the RSGPR model (6). Using each of M = 300 datasets generated from Model 1, the proposed Bayesian hierarchical methodology was applied to estimate the parameters of the model. If we set ρ = 0, the methodology yielded posterior estimates of the GPR model. The estimation results for the parameter η n 1 of our primary interest, based on the SGPR N and GPR models, are shown in the left panel of Figure 2. The left panel provides the following results. First, the posterior estimates of η n 1 based on the proposed SGPR N model (red dashed line) are close to their true values (black line), while those based on the GPR model (blue dashed line) tend to have severe sample-selection bias. Second, when the SGPR N model was used to fit the generated sample-selection dataset, the posterior estimates of regression function based on the model concentrated true values of the η(x i )'s as shown in their 95% posterior intervals (gray band). Third, the difference between the true regression function (black line) and the estimated regression function obtained by using the GPR model (the blue dashed line) confirms Lemma 4, which shows the existence of the sample-selection bias in the GPR regression for data with the sample selection. In summary, the left panel of Figure 2 illustrates the existence of the sample-selection bias in the GPR analysis with the sample-selection data, as discussed in Section 2.3. It also demonstrates the performance of the proposed methodology based on the SGPR N model to eliminate the sample-selection bias (or inconsistency in estimating the regression function), which could not be achieved by using the GPR model.
The mean of M = 300 estimation results for the other parameters were listed in Table 1. As shown in Table 1, the MCMC parameter estimates were close to their true values for the SGPR N model, while those based on the GPR model were severely biased. In addition, the MAB and RMSE values in the table ensure that the performance of the SGPR N model is far better than the GPR model when the sample selection data was used for Bayesian nonparametric regression analysis. For both models, the small values of Monte Carlo (MC) error (compared to the RMSE) of each parameter suggests that approximate convergence was reached and the sequence generated from the MCMC samples was well mixed. The proposed Bayesian hierarchical methodology for the SGPR t 10 model was applied to each of M = 300 datasets generated from Model 2. The SGPR t 10 model can be obtained from the SGPR model (6) by setting δ(ω) = 1/ω and ω ∼ G(10/2, 10/2). If we set ρ = 0, the methodology could also be used to obtain posterior samples to estimate the GPR t 10 model (GPR model with t 10 errors). The results of the simulation appear in the right panel of Figure 2 and Table 1. Graphs in the right panel of Figure 2 depict the prediction results of η n 2 based on the SGPR t 10 and GPR t 10 models. The graphs clearly show that the sample-selection bias in predicting η(x i )'s based on the GPR t 10 model is too large to allow for a prediction of the true regression function η n 2 (or future regression function). However, the proposed methodology using the SGPR t 10 model correctly predicted the true regression function; see 95% prediction interval andη n 2 , i.e., red dashed line. The prediction of η n 2 based on the SGPR t 10 model is far better than that based on the GPR t 10 model. Compared to the GPR t 10 model, the methodology based on the SGPR t 10 model yields smaller MAB and smaller RMSE of the parameter estimates; see Table 1. Table 1 shows that the parameter estimates of the SGPR t 10 model with heavy-tailed errors tend to produce larger estimation errors (MAB and RMSE) than those of the SGPR N model. The results of the above simulation demonstrate the superior performance of the SGPR t 10 model and the usefulness of the proposed Bayesian hierarchical methodology to remedy the sample-selection bias in the prediction that occurred in the GPR t 10 analysis of the sample-selection data.

Data Generated from Model 3 with Normal Mixture Errors
We generated datasets from Model 3 with size n = 300. Model 3 was defined by the model (22) with independent bivariate normal mixture errors: viz. 0.4 N 2 (0, Σ (1) ) + 0.2 N 2 (0, Σ (2) ) + 0.2 N 2 (0, Σ (4) ) + 0.1 N 2 (0, Σ (8) ) + 0.1 N 2 (0, Σ (16) ), where 50% of the outcomes were missing in each dataset and Σ (k) was equal to Σ whose value of σ 2 was 9k. The generated dataset was fitted to the SGPR N , SGPR t 5 , and SGPR t 10 models in turn. Based on posterior samples, we calculated the Bayes estimates of the three models' parameters together with their deviance information criterion (DIC) values introduced by [30]. The average DIC values obtained from the dataset were found to be 2727.06, 1477.43, 1396.24 for the SGPR N , SGPR t 10 , and SGPR t 5 models, respectively. This suggested that the SGPR t 5 model is the best fitting model among the three models, while the SGPR N model is the worst.
The graphs in the left panel of Figure 3 show the true regression function (black line) and estimated regression functions (η n 1 ) under the best fitting SGPR t 5 model (red line) and the SGPR N model (blue line). The graphs in the right panel of Figure 3 depict predicted regression function (η n 2 ) based on the best fitted model (in red), the SGPR N model (in blue), and the true regression function (in black). Even though the best fitted model based on bivariate t 5 error distributions was misspecified, 95% posterior intervals (or prediction intervals) of η(x i ) obtained from the SGPR t 5 model did include the true regression function values (see gray bands in Figure 3). The prediction result of the SGPR t 5 and SGPR N models are very wild due to outliers generated by the normal mixture errors, while the graphs of the SGPR t 5 are more robust for the model misspecification.

Conclusions
This study considered a MaxEnt approach to develop a Bayesian nonparametric regression analysis of non-normal data with the sample selection. For this purpose, by using Boltzmann's maximum entropy theorem, we introduced a MaxEnt process regression model that reflects partial prior information for an uncertain regression function. We found that a special case of the MaxEnt regression model reduced to the well-known GPR model. Second, we generalized the GPR model to propose the RSGPR model and explored its theoretical properties. These properties showed that the new model was well-designed to correct the sample-selection bias and implement a robust GPR analysis. Third, we developed a hierarchical RSGPR model based on a stochastic representation of the RSGPR model and proposed a Bayesian hierarchical methodology for the RSGPR analysis of a non-normal data with sample selection. A simulation study showed that the finite sample performance of the proposed methodology eliminated the sample selection bias and estimated the population model parameters with robustness and high accuracy. In a comparative numerical study on the analysis of nonparametric regression models with sample selection data, we found that the estimation results using the RSGPR model outperformed those using the GPR model for both in-sample estimation and out-of-sample forecasts.
The theoretical results of the RSGPR model and the methodology for the RSGPR analysis proposed in this study have several interesting issues that are worth considering further. First, the RSGPR framework using the MaxEnt process prior can be generalized to the so called stochastically constrained RSGPR regression that uses the constrained MaxEnt process as the prior distribution of the regression function with uncertain constraints. Second, an empirical study with real data as well as an asymptotic evaluation, such as consistency, would be particularly noteworthy to explore. For example, estimating monotone regression function with or without uncertainty and testing the monotonicity of the regression function can be considered in the context of a constrained RSGPR analysis with sample-selection data. Finally, the Bayesian hierarchical methodology can be broadened in various regression models with the general class of skew-S MN error distributions considered by [11]. For example, this methodology can be applied to a von Bertalanffy growth curve analysis of heavy-tailed fishery data with sample selection (see, e.g., [28]). We hope to address all of these in the future. (3) Full conditional distribution of ζ: Equation (13) gives the full conditional density of ζ given by which is the kernel of N (θ ζ , σ 2 ζ ) distribution. (4) Full conditional distributions of z i 's: Equation (13) indicates that the full conditional posterior densities of z i s are mutually independent, and that for each i, Since the support of z i is {z i ; z i ≥ 0} for s i = 1, while {z i ; z i < 0} for s i = 0, we have the truncated normal distributions. (5) Full conditional distribution of γ: The full conditional posterior density of γ is given by which is the kernel of N q (θ γ , Σ γ ) distribution.