Bayesian Inference under Small Sample Sizes Using General Noninformative Priors

: This paper proposes a Bayesian inference method for problems with small sample sizes. A general type of noninformative prior is proposed to formulate the Bayesian posterior. It is shown that this type of prior can represent a broad range of priors such as classical noninformative priors and asymptotically locally invariant priors and can be derived as the limiting states of normal-inverse-Gamma conjugate priors, allowing for analytical evaluations of Bayesian posteriors and predictors. The performance of different noninformative priors under small sample sizes is compared using the likelihood combining both ﬁtting and prediction performances. Laplace approximation is used to evaluate the likelihood. A realistic fatigue reliability problem was used to illustrate the method. Following that, an actual aeroengine disk liﬁng application with two test samples is presented, and the results are compared with the existing method.


Introduction
Sample sizes are often quite small in many engineering fields due to time, economic, and physical constraints, for example, life testing data of high-reliability mechanical components, large and complex engineering systems, and so on. The influence of sample size on interpreting classical significance test results has been discussed in several studies [1][2][3]. The predictive models established based on scarce samples may highly depend on the chosen parameter estimation method [4,5]. In those cases, the probabilistic approach is usually preferred over the deterministic approach [6][7][8][9][10][11]. The Bayes rule provides a consistent and rational mathematical device to incorporate relevant information and prior knowledge for probabilistic inference.
To motivate the discussion, consider an observable random variable X with a conditional probability density function (PDF) (or simply density) of p(x|φ), where φ ∈ Φ is also a random variable. The inverse problem is to make inferences about φ given an observed value x of X. The Bayesian approach to the solution is to use some density p(φ) over Φ to represent the prior information of φ. In this way, the prior knowledge of φ can be encoded through the Bayes rule to obtain the posterior PDF of φ given x, p(φ|x) ∝ p(φ)p(x|φ). (1) The forward inference, e.g., the PDF of a certain variable or the probability of an event involving φ, can be made on the basis of the posterior PDF. Using the method of Markov chain Monte Carlo (MCMC), samples can directly be drawn from the posterior distribution without knowing the normalizing constant p(x) = p(φ)p(x|φ)dφ in Equation (1). The Bayesian method has been successfully demonstrated in all important disciplines [12][13][14][15][16][17][18][19].
The choice of p(φ) can have a great influence on the inference result. The proper choice of priors has been extensively discussed in the probability and statistics communities, and it can never be overlooked as it is one of the fundamental pieces of Bayesian inference [20][21][22][23][24][25][26]. For one thing, the formal rule of constructing a prior regardless of the data and likelihood is sought in the field of physics [9,27]. Jaynes and Bretthorst [28] argued that "problems of inference are ill-posed until we recognize three essential things. (A) The prior probabilities represent our prior information, and are to be determined, not by introspection but by logical analysis of that information. (B) Since the final conclusions depend necessarily on both the prior information and the data, it follows that, in formulating a problem, one must specify the prior information to be used just as fully as one specifies the data. (C) Our goal is that inferences are to be completely 'objective' in the sense that two persons with the same prior information must assign the same prior probabilities." (p. 373). For another, the choice of a prior may highly depend on data and likelihood in practice. Gelman et al. [29] argued that "a prior can in general only be interpreted in the context of the likelihood with which it will be paired" (p. 12).
To ensure a consistent and objective inference, rules for constructing priors with minimal subjective constraints are sought. Early work on the construction of such priors is based on the "ignorance" over the parameter space using invariance techniques [27,30,31]. The fundamental reasoning is that the priors should carry the same amount of information such that a change of scale and/or a shift of location do not affect the inference results on those parameters. The ignorant prior can systematically be derived using the concept of the transformation group. Using different transformation groups, different priors can be obtained. For simplicity, such priors are loosely referred to as noninformative priors. One of the most notable priors for a scale parameter σ of a distribution is Jeffreys' prior, i.e., p(σ) ∝ 1/σ. Jeffreys' prior can be obtained using the tool of the transformation group under the condition that a change of scale does not change that state of knowledge. A further extension of the noninformative priors is "reference priors" [32,33]. The theoretical framework has been discussed in many studies, including, but not limited to, [31,34]. Apart from the noninformative approach to derive the priors, there are several less objective approaches to construct the priors. Gelman [35] proposed the weakly informative priors based on the idea of conditional conjugacy for hierarchical Bayesian models. The conditionally conjugate priors provide some computational convenience as a Gibbs sampler can be used to draw samples from a posterior distribution, and some parameters for inverse-Gamma distributions in Bayesian applications are suggested. Simpson et al. [36] proposed a method to build priors. The basic idea is to penalize the complexity induced by deviating from a simpler base model using the Kullback-Leibler divergence as a metric [37].
For critical problems with small sample sizes, the choice of the prior in Bayesian inference can have a great impact on the inference results, rendering an unreliable decisionmaking. Despite a great deal of research, including the aforementioned, having been conducted, the choice of an optimal prior is still nontrivial to make from the practical point of view, and a systematical method to cope with such cases is rarely seen. Moreover, a quantitative measure to evaluate the performance of a prior is equally important to identify the optimal prior. This study develops a noninformative Bayesian approach to probabilistic inference under small sample sizes. A general 1/σ q -type of noninformative prior for the location-scale family of distributions is proposed. This type of prior is further shown to be the limiting state of the commonly used normal-inverse-Gamma conjugate prior, thus allowing for an analytical evaluation of the posterior of the parameter. Given a linear model with a Gaussian error variable, the analytical form of the posterior of the model prediction can also be obtained. The informative or noninformative nature of the prior is centered more on the concept of whether the prior encodes subjective information; therefore, it is less relevant to the likelihood function (and data). The subjective information is the source of inconsistency in the inference result as different practitioners can use different subjective information in the form of assumptions, which in most cases may only be justified by oneself. However, the prior has to be built with a certain state of knowledge. This state of knowledge, or loosely called information, is not related to experience nor assumptions, but is a result of a logical reasoning process [27]. The purpose of the study is not to argue whether these assumptions are justified or not, nor to question the suitable assumptions made during the inference. This study provides an alternative to existing methods in such cases to avoid introducing subjectiveness into the Bayesian inference. In particular, when the number of samples is small, the influence of the prior can be as important as the likelihood (and data). In such cases, subjective information encoded in the prior can lead to biased results.
The remainder of the paper is organized as follows. First, the Bayesian model with noninformative priors is developed, in particular, a general 1/σ q -type of noninformative prior for the location-scale family of distributions is proposed for small sample problems. The noninformative priors are further shown as the limiting states of the normal-inverse-Gamma (NIG) conjugate priors. The closed-form expressions of the Bayesian posterior and predictors using the proposed noninformative priors are obtained under a Gaussian likelihood. Next, a performance measure considering both the fitting performance and predictive performance is proposed using the concept of Bayes factors. Different priors are treated as models in a Bayesian hypothesis testing context for comparisons. Following that, the method is illustrated using two examples.

Bayesian Linearized Models with Noninformative Priors
To motivate the discussion, consider a general linear or linearized model: where θ is a k-dimensional column vector and i are independent and identical distributed random error variables. The distribution of i determines the likelihood function or vice versa. Without loss of generality, the distribution belongs to a location-scale family of distributions. Furthermore, it is enough to use a single scale parameter to characterize the distribution of since any constant nonzero mean, no matter known or unknown, can be grouped into θ. Denote the scale parameter of as σ 2 . A Bayesian model incorporates both the prior information and the observation through Bayes' rule. Using the matrix form y = xθ + , the Bayesian posterior of (θ, σ 2 ) writes: The common Gaussian error variable, i.e., i ∼ N(0, σ 2 ), corresponds to the following likelihood function, for n observations y = (y 1 , y 2 , ..., y n ) T and n input vector x = (x 1 , x 2 , ..., x n ) T where x i , i = 1, ..., n, is a row vector of size k. p(y|θ, σ 2 ) = N(xθ, σ 2 I) It should be noted that the error variable does not necessarily follow a Gaussian PDF, and other types of error distributions can be used. For example, the extreme-value PDF for the error corresponds to a Weibull likelihood for the log-transformed model prediction.

A General Form of Noninformative Priors-∝ 1/σ q
A general ∝ 1/σ q (q >= 0) form of priors is considered here. The classical Jeffreys' prior and the related asymptotically locally invariant priors are introduced first for the purpose of completeness.
Consider the PDF of a random variable x characterized by a parameter vector φ; Jeffreys' noninformative prior distribution of φ is proportional to the square root of the determinant of the Fisher information matrix, e.g., where det(·) is the determinant operator and I(·) is the Fisher information matrix. The key feature of it is invariance under the monotone transformation of φ. This feature is achieved by using the change of variables theorem. Denote the reparameterized variable or vector as ψ; it can be shown that: For a Gaussian likelihood with unknown parameters µ and σ 2 , Jeffreys' prior for the joint parameter (µ, σ 2 ) is: where I(µ, σ 2 ) is: and E(·) is the expectation operator. Using algebraic deduction and integration, I(µ, σ 2 ) is simplified to: As a result, Jeffreys' prior for (µ, σ 2 ) is: It is noted that the distribution of interest is (µ, σ 2 ), not (µ, σ); therefore, the derivative is taken with respective to σ 2 as a whole instead of σ. In other word, the parameter space is on σ 2 , not σ. For example, the Jeffreys' prior for σ 2 , with a fixed value of µ, is: The Jeffreys' prior for (µ, σ) and σ with a fixed µ are ∝ 1/σ 2 and ∝ 1/σ, respectively. It is shown in Appendix B that the Fisher information matrix is also the Hessian matrix of the Kullback-Leibler (KL) distance of a deviated distribution with respect to the true distribution evaluated at the true parameters. For exponential families of distributions, the KL distance has analytical forms, allowing for the evaluation of the Fisher information matrix without involving integrals in Equation (9).
The asymptotically locally invariant (ALI) prior is another type of prior that satisfies the invariance under certain transformations. An ALI prior can be uniquely determined using the following equation according to [31], where f 1 = ∂ ln p(x|·)/∂φ and f 2 = ∂ 2 ln p(x|·)/∂φ 2 . For a Gaussian distribution with a fixed mean and a random σ, the two terms are E[ f 1 f 2 ] = −6/σ 3 and E[ f 2 ] = −2/σ 2 . Solve: to obtain the ALI prior for σ: Similarly, the ALI prior for σ 2 is: When both µ and σ are random, the joint ALI prior for (µ, σ) is: It is noticed that the Jeffreys', ALI, and uniform priors are reproduced from ∝ 1/σ q as q takes different integer values. In the following, the derivation of a 1/σ q prior from NIG conjugates is shown.

Derivation of the 1/σ q Priors as the Limiting States of NIG Conjugates under Gaussian Likelihood
In Bayesian models, for the given likelihood function, the posterior p(φ|x) and the prior p(φ) are called conjugate distributions if both are in the same family of distributions. It can be considered as the prior can be reconditioned by encoding the evidence through the likelihood; therefore, the evidence or data merely change the distribution parameters of the prior and yield another distribution of the same type, but with a different set of parameters.
For a location-scale parameter vector (θ, σ 2 ) used in linear or linearized Bayesian models with a Gaussian likelihood, the corresponding conjugate prior is the NIG distribution. The closed-form posterior PDFs for the parameter and prediction are given in Appendix A.
Noninformative priors, including the Jeffreys', ALI, and reference priors, for (θ, σ 2 ) are mostly in the form of 1/σ q , q ∈ 1, 2, .... The uniform prior can be seen as a special case of 1/σ q as q = 0. It is shown as follows that these noninformative priors can be obtained as certain limiting states of NIG conjugates of (θ, σ 2 ). For example, the NIG distribution with parameters (α, β, Σ, µ) given by Equation (A3) can reduce to the Jeffreys' prior: as: Furthermore, a 1/σ q -type of prior can all be derived as reduced NIG distributions. By assigning the initial values for α, β, and Σ of the NIG distribution, different q values are obtained. Table 1 presents priors with different q values as reduced NIG distributions and the corresponding α, β, and Σ of the NIG distributions and α * , β * , and Σ * of the resulting NIG posterior distributions. The posterior distribution of σ 2 is an inverse-Gamma distribution IG(α * , β * ), and the PDF is given by: The posterior distribution of θ is obtained by integrating out σ 2 from the joint posterior distribution of Equation (A6) as, which is a (2α * ) degree-of-freedom multivariate tdistribution with a location vector of µ * and a shape matrix of β * α * Σ * . In particular, when the NIG prior for (θ, σ 2 ) is reduced to the Jeffreys' prior 1/σ 2 by Equation (19), the resulting posteriors of σ 2 and θ are: and: respectively. The term SSE in Equation (22) is the sum of squared errors, given by: The prediction posterior in this case is: The advantage of treating the 1/σ q -type of noninformative prior as reduced NIG conjugates is that the Bayesian posteriors of the model prediction and parameters all have analytical forms, allowing for efficient evaluations without resorting to the MCMC techniques, as done in regular Bayesian analysis. It is worth noting that the analytical results are obtained under the condition (or assumption) that the likelihood is a Gaussian. This condition implies that the conjugate prior is an NIG. The values of the NIG parameters are obtained by equating the noninformative prior to the NIG prior. In other words, equating the two distributions for the purpose of mathematical convenience is the premise of resolving those parameter values.

Assessment of Noninformative Priors
To evaluate the performance of priors, the Bayesian model assessment method using efficient asymptotic approximations is proposed. The idea is to recast the assessment as a model comparison problem. The participating models here are the posteriors obtained with different priors. The comparison can then be made using Bayes' factors in a Bayesian hypothesis testing context.

Fitting Performance
The Bayes factor, on the basis of observed data y, evaluating the plausibility of two different models, M 1 and M 2 , can be expressed as: The comparison of two models can be made based on the ratio of the posterior probabilities of the models: The ratio, when the prior probabilities of the models are equal, is reduced to the Bayes factor of Equation (26). The assessment of the Bayes factors involves two integrals over the parameter space of (θ, σ 2 ). The integrands for models M 1 and M 2 are p(y|θ, σ 2 , M 1 ) · p(θ, σ 2 |M 1 ) and p(y|θ, σ 2 , M 2 ) · p(θ, σ 2 |M 2 ), respectively.
For general multidimensional integration, asymptotic approximation or simulationbased estimation are two commonly used methods. The Laplace approximation method evaluates the integral as a multivariate normal distribution. Consider the above integrand term in Equation (26), i.e., p(y|θ, σ 2 , M 1 ) · p(θ, σ 2 |M 1 ). Drop the model symbol for simplicity of the derivation, and denote (θ, σ 2 ) as φ. The natural logarithm of the integrand, p(y|φ)p(φ), i.e., p(y, φ), can be expressed using Taylor expansion around its mode φ as: where ∇ ln p(y, φ ) is the gradient of ln p(y, φ) evaluated at φ , ∇ 2 ln p(y, φ ) is the Hessian matrix of ln p(y, φ) evaluated at φ , and O[·] are higher-order terms. When the higher-order terms are negligible, The term (•) is zero at the mode of the distribution where the gradient is zero; therefore, expanding ln p(y, φ) around φ eliminates the term (•) and yields: Exponentiate the above equation to obtain: Realizing the first term of the above equation is a constant and the last term is the variable part of a multivariate normal distribution with a mean vector of φ and a covariance matrix Σ = [−∇ 2 ln p(y, φ )] −1 , the integration of p(y, φ) writes, Notice that θ is a k-dimensional vector, so φ = (θ, σ 2 ) is a k + 1-dimensional vector. Using the result of Equation (32), the integral associated with the jth model in Equation (26) is: where (θ, σ 2 ) is the mode of the distribution and Λ j is the covariance matrix, i.e., the inverse of the negative Hessian matrix of ln p(y, θ, σ 2 |M j ) evaluated at (θ, σ 2 ) . Proper numerical derivative methods based on finite difference schemes can achieve reliable results for the gradient and the Hessian in the following equation.
Experience has shown that for a distribution with a single mode and approximately symmetric, the Laplace approximation method can achieve accurate results for engineering applications [38,39].
It should be noted that the 1/σ q -type of noninformative prior is not a proper prior, i.e., the normalizing constant for 1/σ q is not bounded for σ ∈ (0, +∞); therefore, it is necessary to limit the support of 1/σ q to a proper range and to have a finite normalizing constant. Assume the effective range of σ is (σ − , σ + ) for q ∈ (0, 1, ...); the normalizing constant is: Substitute p(θ, σ 2 ) with the 1/(Z(σ)σ q ) in Equation (3). The likelihood of Equation (33) compares the fitting performance of different 1/σ q -type of priors.

Predictive Performance
The predictive performance for a model is measured by the likelihood of the data that are not used for model parameter estimation. The likelihood of the future measurement dataỹ with the posterior PDF of p(θ, σ 2 |y) writes: where P(y) is evaluated using Equation (33). The comprehensive performance integrating both fitting performance and predictive performance in terms of likelihood in the joint space of (y ×ỹ) is: Realize that Equations (32), (36), and (37) can all be evaluated using the method of Laplace approximation, and the performance of different priors can quantitatively be compared. It is noted that the likelihood is the product of the fitting component and the prediction component, and the influence of the prior is incorporated in the fitting component, i.e., P(y).

Application Examples
To illustrate the proposed Bayesian inference with noninformative priors, two examples are presented. The results were compared with the classical least squares approach. The fitting and comprehensive performance of the 1/σ q -type of priors were evaluated using the method of Laplace approximation.

Low-Cycle Fatigue Reliability Assessment with Sparse Samples
The above numerical example reveals the advantage of using a noninformative prior over the flat prior in probabilistic inference under small sample sizes. To examine its usefulness in realistic problems and further compare the performance of different noninformative priors, a fatigue life prediction problem is presented.
The low-cycle fatigue testing data reported in ASTM E739-10 [40] were used and shown in Table 2. The cyclic strain amplitudes (∆ε/2) were (∼0.016, ∼0.0068, ∼0.0016, ∼0.0005), and each level had two or three data points. The following log-linear model was adopted, ln N = a 0 + a 1 ln(∆ε/2), where a 0 and a 1 are model parameters that need to be identified. It should be noted that other variants of the above equation can be used to include material plasticity, the mean stress effect, etc. The discrepancy between the experimental data and the model prediction can be modeled using an uncertain variable with zero mean and a standard deviation of σ e . To compare with the regular least squares method, a Gaussian likelihood was assumed, and the Bayesian posterior is: The exponent q takes the values of 0, 1, 2, ... to generate a flat prior and different noninformative priors, and (∆ε i , N i ), i = 1, ..., n is the ith experimental data points of the total n data points used for parameter estimation.
Notice that when q = 0, the above equation reduces to a regular least squares format. One data point (i = 6) was arbitrarily chosen for prediction performance evaluation, and the rest eight data points were used to estimate the model parameters using Equation (39).
To compare the performance of 1/σ q -type priors, q = 0, ...5 were used to represent the regular least squares, Jeffreys', and ALI priors. The likelihood of the fitting performance and prediction performance was evaluated using the method of Laplace approximation, and results are shown in Figure 1. The total likelihood of the fitting and prediction results indicates that the Jeffreys' prior 1/σ 2 outperformed the others. The result associated with q = 0 is the Bayesian version of the regular least squares estimation. To further demonstrate the difference between the regular least squares and noninformative Bayesian approach under this sample size, the following reliability problem was considered. The fatigue life given a prescribed probability of failure (POF) was calculated for the cyclic strain range (∆ε/2) between 10 −4 and 10 −1 . Given a cyclic strain range and a POF, the fatigue life N POF can be expressed as: For the regular least squares estimator, it can be obtained using the t-statistics given by Equation (A9) where γ/2 should be replaced with the prescribed POF because only the lower one-sided bound is needed. For the Bayesian with Jeffreys' prior, the one-side bound can trivially be obtained using the POF-quantile of the prediction results evaluated using MCMC samples of the posterior POF. Figure 2 presents the comparisons of the fatigue life results where the obvious difference was observed. The results of the noninformative Bayesian were larger than those of regular least squares, indicating a longer fatigue life for the same risk level, i.e., 10 −5 POF.

Aeroengine Turbine Disk Lifing and Failure Rate Assessment
The prescribed safe life is a critical parameter for airworthiness and service health management of turbine disks in aeroengines. To determine the prescribed safe life for turbine disks, life testing with a number of disks is performed using disk samples in a rotating pit. The number of tested piece is usually less than five due to the cost and time constraints; sometimes, only one or two test pieces are available. The determination of the prescribed safe life must be carefully made to mitigate the risk. A general and standard procedure described in [41] is centered on the idea that two source of uncertainty, namely the sampling uncertainty and the life distribution uncertainty, should be incorporated. To account for the sampling uncertainty, the 95% single-sided lower bound of the sample mean is used as the mean of the life distribution. To account for the uncertainty of the life distribution itself, no more than one in seven-hundred fifty service disks is allowed to develop an engineering crack. The total safety margin is a 1/750 cumulative failure probability to a 95% confidence. The prescribed safe life is assumed to be a log-normal random variable, and the log-life is a normally distributed variable. Denote the mean and standard deviation of the normal distribution of the log-life as ln N µ and σ. The 95% single-sided lower bound of the sample mean can be expressed as: where N µ is the geometric mean of the life, N µ,95% is the single-sided lower bound of the geometric mean, λ α = Φ −1 (1 − α), and Φ −1 (·) is the inverse cumulative density function (CDF) of the standard normal variable when the standard deviation σ is known. For the 95% single-sided lower bound, z α = Φ −1 (0.95) = 1.645. When the standard deviation is unknown, the estimate of the standard deviation based on the samples should be used. In such cases, the value of λ α is obtained using the Student distribution with a degree of freedom (DOF) of n − 1, i.e., λ α = t −1 (1 − α, n − 1). The function t −1 (·, n − 1) is the inverse CDF of the Student distribution with a DOF of n − 1. The 1/750 failure probability log-life corresponds to the −3σ shift from the mean of the log-life distribution. When using ln N 95% µ as the mean of the log-life distribution, the 1/750 failure probability life N 95% µ−3σ writes: Combine Equations (41) and (42) to have: The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed by transforming Equation (43) into the linear scale as: It is seen that the prescribed safe life depends on the number of tested samples n and the standard deviation of the distribution. The deterministic prescribed safe life A r is finally obtained as: The cumulative burst probability when the disk is used y cycles can be expressed as: where x is the log-transformed geometric mean of the samples and y is the log-transformed number of cycles, i.e., y = ln N. The failure rate in terms of the failures per cycle can be expressed as: where f b (y|x ) is the burst PDF: The sampling uncertainty in the log-transformed geometric mean x can be quantified using the sampling PDF. The sample mean x distributes according to: where s is the standard deviation of the sample, t(n − 1) is the standard Student distribution with a DOF of n − 1, and σ s = σ √ n . The final failure rate can be obtained by marginalizing out the sampling error variable x as: With the above discussion, turbine disk life testing results on two disks were obtained as shown in Table 3. It is seen that the key information is the standard deviation of the population. For one thing, whether it is considered as a known quantity requires a justification, and its actual value relies on empirical evidence and historical data. For another, when it needs to be estimated from the samples, the proper method should be used due to its limited sample size.

Existing Method
Previous disk life testing results showed that the scatter factor, defined as the ratio of the maximum life (N max ) over the minimum life (N min ), is less than six. Furthermore, N max and N min are considered as the +3σ and −3σ points of the normal distribution of the log-life. With the above empirical evidence and assumption The empirical evidence of N max /N min = 6, combined with the assumption that ln N max − ln N min = 6σ, yields: The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed, by substituting Equation (51) into Equation (43) as, Given that assumption that the standard deviation of the log-life distribution is known, the first equation of the right-hand side in Equation (49) is used. In addition, the term 1 − F b (y|x ) ≈ 1 for rare events. In such a case, the hazard rate function Equation (50) reduces to the convolution of two normal PDFs, and the resulting hazard rate function is: Note that y = ln N; the hazard rate in terms of failure per flight hour is found by using the chain rule as, where β is the exchange rate (cycles/hour) converting the cycle to equivalent engine flight hours. Using the data in Table 3, the log-transformed geometric mean ln N µ = 24,205, the standard deviation of the log-life distribution is σ b = (ln 6)/6, and the standard deviation of the sampling distribution is σ s = (ln 6)/(6 √ 2). Substitute these parameters into Equation (45) to obtain the prescribed safe life as A r = 6981 cycles.

The Proposed Noninformative Bayesian Method
Without relying on the empirical evidence and the assumption, the noninformative Bayesian estimator for (µ, σ) of the log-life distribution can be obtained by writing the posterior of the likelihood of samples with a noninformative prior. For a demonstration, the prior of ∝ 1/σ 2 was used, and the posterior writes, where D denotes the event of observed life samples N 1 , ..., N n . To estimate µ and σ, 1 × 10 7 samples were drawn from Equation (55) using the MCMC method. The resulting histogram of the samples is shown in Figure 3. The 95% single-sided confidence bound µ 95% can be estimated using the 0.05-quantile of the MCMC samples as 9.86565 or 19,259 in the linear scale. Centering on µ 95% = 9.86565, the value corresponding to a 1/750 cumulative failure probability was found by subtracting µ 95% by 3σ. The term σ was estimated using the MCMC sample mean, and in this case, it was found as 0.1893. Consequently, after applying the two safety margins, the prescribed safe life (in log scale) using the noninformative Bayesian method was found as µ 95% −3σ = 9.2976 or A r = 10,911 cycles. The failure rates calculated using the existing method and the proposed noninformative Bayesian method are compared in Figure 4. The failure rate per flight hour was based on an exchange rate β = 2.5.
Based on the results shown in Figure 4, the usable flight hours obtained using the noninformative Bayesian method were about 1000 h more than those obtained using the existing method given the same risk constraint. For example, the flight hours corresponding to a failure rate of 10 −8 /h were 2734 and 1734, respectively, for the two methods. When the failure rate was required to be 10 −6 /h, the usable flight hours were 3817 and 2720, respectively. It should be noted that the proposed noninformative Bayesian method does not rely on empirical evidence and assumptions other than the log-normal distribution for disk life.

Conclusions
The study developed a noninformative Bayesian inference method for small sample problems. In particular, a 1/σ q -type of prior was proposed to formulate the Bayesian posterior. It was shown that this type of prior can represent the classical Jeffreys' prior (Fisher information), flat prior, and asymptotic locally invariant prior. More importantly, this type of prior was derived as the limiting state of normal-inverse-Gamma (NIG) conjugate priors, allowing for fast and efficient evaluations of the Bayesian posteriors and predictors in closed-form expressions under a Gaussian likelihood. To illustrate the overall method and compare the performance of noninformative priors, a realistic fatigue reliability problem was discussed in detail. The combined fitting and prediction performance in terms of the likelihood was used to evaluate different priors. It was observed that Jeffreys' prior, 1/σ 2 , yielded the maximum likelihood. The developed method was further applied to an actual aeroengine disk lifing problem with two test samples, and the results on the prescribed safe life and risk in terms of failure/hour were comparable with those obtained using the existing method incorporating the empirical evidence and assumptions. It should be emphasized that the study did not promote abandoning justified experience and/or information in prior construction. It was argued in this study that in the absence of those data or experience, the noninformative prior should be used to avoid introducing unjustified information into the prior. The following conclusions were drawn based on the current results: • The 1/σ q -type of prior can be used equivalently as the NIG conjugate priors at their limiting states. The great features of conjugate priors, such as having analytical posterior and prediction PDFs under a Gaussian likelihood, can be retained when using noninformative priors for Bayesian linear regression analysis; • For the 1/σ q -type of noninformative prior, the classical Jeffreys' prior 1/σ 2 yielded an optimal fitting and prediction performance in terms of the likelihood or Bayes factors. When q = 0, the Bayesian estimator reduces to the regular least squares estimator.
The results of the two case studies showed the advantage of using the noninformative Bayesian estimator with the prior of ∝ 1/σ 2 over the regular least squares estimator and other empirical methods under small sample sizes.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Conjugate Prior, Posterior, and Prediction
Assume a general linear model with a parameter vector θ and a Gaussian likelihood with a scale parameter σ 2 , i.e., the modeling error i = (y i − x i θ) ∼ Norm(0, σ 2 ).

Appendix A.1. Conjugate Prior
It is known that when the likelihood function belongs to the exponential family, a conjugate prior exists and belongs also to the exponential family. To obtain the joint conjugate using p(θ, σ 2 ) = p(θ|σ 2 )p(σ 2 ), it is necessary to look at the conjugate prior of σ 2 first. For a Gaussian likelihood with an unknown variance σ 2 and zero mean, the conjugate prior for σ 2 is an inverse Gamma distribution, i.e., p(σ 2 ) ∼ IG(α, β), and is expressed as: where α > 0 and β > 0 are shape and scale parameters, respectively. The posterior distribution of σ 2 , given n data points, has the following new shape and scale parameters: where SSE = ∑ n i (y i − x i θ) 2 is the sum of squared errors. Consider the joint conjugate of (θ, σ 2 ), a widely adopted conjugate prior for Bayesian linear regression is the normalinverse-Gamma (NIG) prior for (θ, σ 2 ), and it can be written as Equation (A3). where: and: Using the inverse gamma distribution for σ 2 , the initial distribution parameter (α, β) still needs to be determined. Unfortunately, there is no formal rule to do that. Gelman [35] discussed the inverse-Gamma distribution IG( , ) where was set to a low value such as 1, 0.01, or 0.001. A difficulty of this prior is that must be set to a proper value. Inferences become very sensitive to for cases where low values of σ 2 are possible. Due to this reason, the IG( , ) family of priors for σ 2 was not recommended by the author. It was further argued that the resulting inference based on it for cases where σ 2 is estimated near zero was the case for which classical and Bayesian inferences differed the most [35]. Browne and Draper [42] also mentioned this disadvantage and suggested using a uniform one U(0, 1/ ) to reduce this defect. However, choosing an appropriate value for for U(0, 1/ ) still requires attention and understanding of the data. Indeed, the effect of the prior and its parameters is highly subjected to the data and problem, and its sensitivity to Bayesian inferences based on sparse data or extensive data can be dramatically different as the data dominants the posterior due to the law of large numbers.

Appendix A.3. Prediction
The Bayesian prediction of the response given the data y can be obtained using the above distributions with a new set of input variablex. Denote the corresponding prediction asỹ. The prediction posterior distribution ofỹ can be expressed as: when a single point is evaluated, Equation (A8) is reduced to a one-dimensional Student distribution, and the confidence interval is readily evaluated using the inverse Student distribution. For multipoint simultaneous evaluations, the interval contours can be approximated using a few methods as suggested in [43].
To compare, the simple linear regression results of the prediction mean and bounds are given. The prediction mean isxθ, and the prediction bounds of level (1 − γ)% are: y 1−γ =xθ ± tinv(γ/2, n − k) · oe y , (A9) whereθ = (x T x) −1 (x T y) is the maximum likelihood estimator of θ. It is known thatθ is also identical to the least squares estimator and the Bayesian estimator, i.e., Equation (23). The term tinv(γ/2, n − k) represents the γ/2-percentile value of the standard Student t-distribution with n − k degrees of freedom, and σ y is the standard error of the prediction given by: where the term e = y − xθ is the discrepancy vector between the model and observation, also called the residual or error. As multiple points are estimated, the F-statistic can be used instead of the t-statistic to yield the Working-Hotelling confidence intervals [44], where Finv(γ, k, n − k) represents the γ-percentile value of the F-distribution with (k, n − k) degrees of freedom.

Appendix B. Fisher Information Matrix and Kullback-Leibler Divergence
The equivalence of the Fisher information matrix and the Kullback-Leibler divergence can be shown as follows. The matrix form of the Fisher information matrix of a continuous PDF with a parameter vector θ writes, Under the condition that the integration and derivation may be exchanged and the log-function is twice differentiable, the following Equation (A13) can be attained.
It is noted that: The following result, that is the negative expected Hessian matrix of the log-function is equal to the Fisher information matrix, can be obtained.
where ∇ 2 ln p θ (x) is recognized as the Hessian matrix of ln p θ (x), denoted as H ln p θ (x) . The Fisher information matrix can loosely be thought of as the curvature matrix of the log-function graph. The Kullback-Leibler divergence, also called the relative entropy, between two distributions can be written as: Expand ln p θ (x) around θ to the second order to have, ln p θ (x) ≈ ln p θ (x) + (θ − θ )∇ ln p θ (x) Substitute Equation (A17) into Equation (A16): where ∆θ = (θ − θ ). The first term of the right-hand side of Equation (A18) is zero; and the second term of the right-hand side is related to Equation (A15). The KL divergence can finally be expressed as, Furthermore, differentiate Equation (A16) with respect to θ to obtain: Continue to differentiate Equation (A20) with respect to θ to obtain: Notice that the expectation of the second term in Equation (A21), when θ = θ, reads, and the first term is the Fisher information matrix. As a result, the Fisher information matrix is the Hessian matrix of the Kullback-Leibler distance evaluated at the true parameter θ. ∇ 2 θ D(θ||θ )| θ =θ = I(θ). (A23)