Optimal Sample Size for the Birnbaum–Saunders Distribution under Decision Theory with Symmetric and Asymmetric Loss Functions

: The fatigue-life or Birnbaum–Saunders distribution is an asymmetrical model that has been widely applied in several areas of science and mainly in reliability. Although diverse methodologies related to this distribution have been proposed, the problem of determining the optimal sample size when estimating its mean has not yet been studied. In this paper, we derive a methodology to determine the optimal sample size under a decision-theoretic approach. In this approach, we consider symmetric and asymmetric loss functions for point and interval inference. Computational tools in the R language were implemented to use this methodology in practice. An illustrative example with real data is also provided to show potential applications.


Introduction
The determination of the sample size is a relevant topic in all studies when statistical methods are applied. For example, in clinical trials, this determination was adequately discussed in ch. 6 of [1] and the references therein, giving an overview on this topic. The optimal sample size in the classical statistical setting depends crucially on the alternative hypothesis. However, this is not the case in a Bayesian framework where there is no need to state a specific alternative hypothesis.
In order to determine the sample size in any knowledge area, prior information must be available. Introducing uncertainty into this information is essentially a Bayesian approach. Then, the use of Bayesian methods for determining an optimal sample size should be explored within the distributional framework that is relevant according to the problem under study. In general, there are empirical limitations that require sample sizes to be determined in advance. Therefore, we can determine an optimal sample size that satisfies a criterion based on the Bayes risk. Given specific loss and sampling cost functions, a full Bayesian analysis may be performed for determining an optimal sample size. In the absence of precise information on costs and losses, the loss functions can be approximate, and a Bayesian approach might be employed to provide reasonable estimates of the optimal sample size. For more details about this methodology, see [2][3][4][5][6][7][8] and the references therein.
Birnbaum and Saunders [9] introduced a family of distributions to model failure times for metals subject to periodic stress. The authors provided a natural physical justification for this family, which is known as the fatigue-life or Birnbaum-Saunders (BS) distribution. In the last few decades, this distribution has received considerable attention in the literature, and many methodologies have been proposed for parameter inference. Such attention is justified by its wide applicability, and its variations have been applied in several areas [10][11][12][13][14][15][16], but mainly in reliability [17][18][19]. A detailed review of the BS distribution including methodologies under the classical and Bayesian approaches was presented in [20]. Recently, guidelines about the minimum sample size for monitoring the BS median in quality control under a classical statistical approach were presented in [21]. Although diverse methodologies related to the BS distribution have been proposed, the problem of determining an optimal sample size when estimating the BS mean has not yet been studied.
The main objective of this paper is to derive a methodology for determining the optimal sample size when estimating the mean of the BS distribution under decision theory. We develop a methodology via a Bayesian decision-theoretic approach based on a criterion that minimizes the Bayes risk and sampling cost. The proposed approach depends on ad-hoc loss functions (symmetric or asymmetric) defined to accommodate the implications of a decision. We consider three loss functions for point inference and two for interval inference. Computational tools in the R language were developed to use this methodology in practice.
The paper unfolds as follows. In Section 2, we provide background on the BS distribution, the inference of its parameters, and the Bayesian approach. Section 3 presents the methodology to obtain the optimal sample size under a decision-theoretic approach. In Section 4, we show the use of the main functions and methods of an R package implemented by the authors for the present work [22]. An illustrative example is also provided in this section. Finally, we conclude with a discussion of the results in Section 5, including ideas for future research.

The Birnbaum-Saunders Model
In this section, we present the properties of the BS distribution, the inference of its parameters, and discuss the Bayesian approach. Much of the background information about the BS distribution presented in this section has been gathered from other works [9,20,23,24].

Properties
Let X be a BS distributed random variable with a shape parameter α and a scale parameter β, which we denote by X ∼ BS(α, β). Then, the probability density function of X is given by: Besides being a scale parameter, β is also the median of the BS distribution. Furthermore, the mean and variance of the BS distribution are stated as: Moreover, if X is BS distributed, then: where Z follows a standard normal distribution, which is useful to generate random values from the BS(α, β) distribution. It is possible to show that: Some useful properties of the BS distribution are: P1: If X ∼ BS(α, β), then cX ∼ BS(α, cβ), that is, the BS distribution is a homogeneous family; P2: If X ∼ BS(α, β), then 1/X ∼ BS(α, 1/β), that is, the BS distribution is invariant under the reciprocal transformation. This property can be important in financial applications.

Inference
Given a sample X = (X 1 , . . . , X n ) and their observed values x = (x 1 , . . . , x n ), modified moment estimates [25] for α and β are expressed, respectively, as: wherex a is the sample arithmetic mean andx h is the sample harmonic mean, that is, In addition, the estimates α and β are well defined becausex a ≥x h ≥ 0, that is, these estimates are always positive and mathematically well determined or unique. We may use α and β as initial values in a sampling algorithm. Furthermore, the likelihood function obtained from the BS(α, β) probability density function satisfies: For the parameters α and β of the model, we consider proper prior distributions because the use of non-informative prior distributions yields an improper posterior distribution and continuous conjugate priors do not exist [26]. A possible choice of a prior distribution for β is the inverse gamma (IG) distribution whose probability density function satisfies: where a 1 and b 1 are positive and known constants (hyperparameters) of the IG distribution. We denote this as β ∼ IG(a 1 , b 1 ), that is, the IG distribution of parameters a 1 and b 1 . We also assume an IG prior distribution for α 2 with hyperparameters a 2 and b 2 . Thus, the model can be written hierarchically as: where "IID" stands for independent and identically distributed. In this context, the conditional posterior distribution of α 2 given β and x n is stated as: whereas the marginal posterior distribution of β given x n can be obtained from: which is not a known distribution [26]. In this way, we use the random walk Metropolis-Hastings algorithm [27] to generate samples from the marginal posterior distribution of β given x n . Using this sampling algorithm and the posterior distribution defined in (4), we may generate values from the joint posterior distribution of α 2 and β. For a given x n , first, we generate values of β from (5), and with them, we draw values of α 2 using (4). Note that the parameter of interest µ is the mean of the BS distribution, which is a function of α 2 and β. In order to obtain a random sample of the posterior distribution of µ given x n , we may draw values from the joint posterior of α 2 and β. Then, we apply the expression defined in (1) for each sampled pair of values.

Optimal Sample Size
In this section, we introduce the methodology to obtain the optimal sample size for estimating µ of the BS distribution under a decision-theoretic approach. Furthermore, we define the different loss functions to be considered.

Determining the Optimal Sample Size
We may approach the problem of determining the optimal sample size as a decision problem [4,28]. Given that µ is the parameter of interest, we specify a loss function L(µ, d n ) based on a sample X n = (X 1 , . . . , X n ) and a decision function d n ≡ d n (X n ). For a given n and depending on the adopted loss function, the action d n (x n ) consists of specifying a quantity (point inference case) representing an estimate for µ or two quantities (interval inference case) representing the lower and upper limits of a credible interval for µ.
Let π be associated with a prior distribution for the unknown parameter µ and d n be a decision function. Then, the Bayes risk [4] is defined as: where g is related to the sampling distribution for X n given µ, M is the parameter space, and X n is the sample space. The decision d * n that minimizes r(π, d n ) among all the possible decisions d n is called the Bayes rule. In this context, we define the optimal sample size as the one that minimizes the total cost (TC) stated as: where C(n) is a function representing the cost of sampling n observations. Here, we take C(n) = cn, where c is the per-unit cost for observing a unit in the population. Since it is not possible to compute r(π, d * n ) analytically, we use Monte Carlo simulations as an alternative to estimate TC(n) for each n.
Suppose that the order of the integration may be reversed in (6). Note that this reversal is possible whenever the conditions for the Fubini theorem are satisfied. In this case, as is known, minimizing the Bayes risk is equivalent to minimizing the posterior expected loss. Then, we have: so that we may estimate the minimized Bayes risk through the posterior expected value of the loss function applied to the Bayes rule d * n . This may be done as summarized in Algorithm 1.
After obtaining an estimate of r(π, d * n ), we added the respective sampling cost cn, which provided us an estimate of TC(n) for a given n. We apply this procedure for a grid of plausible values of n. For example, if we set this grid of values as n = 2, 12, . . . , 82, 92, then we estimate TC(2), TC(12), . . . , TC(82), TC(92), respectively. The choice of the grid values is arbitrary, but as the distance between its consecutive elements is shorter, a better visualization is reached of the behavior of the TC. However, as this distance decreases, the required computer processing power also increases, as well as the time to compute all these estimates. Thus, the choice of this grid must consider all these settings.

Algorithm 1: Estimation of the minimized Bayes risk.
1 Set values for the hyperparameters, which reflect the prior knowledge about α 2 and β. 2 Draw one value of α 2 and one value of β from the respective prior distributions, and compute the square root of α 2 . 3 Given α and β, generate a value of X i from the BS(α, β) distribution using (2), for i = 1, . . . , n, obtaining a sample x n = (x 1 , . . . , x n ). 4 Given x n , collect a sample of size N (as large as possible) from the joint posterior distribution of α 2 and β as explained in Section 2, generating values (α 2 j , β j ), for j = 1, . . . , N. 5 For j = 1, . . . , N, compute the posterior values µ j using the generated values in Step 4 and the expression stated in (1). 6 Obtain the corresponding Bayes rule d * n using the sample of the posterior distribution of µ generated in Step 5. 7 Use the values computed in Step 5 to estimate E[L(µ, d * n )|x n ]. 8 Repeat Steps 1-7 K times (as large as possible), generating K estimates of E[L(µ, d * n )|x n ]. 9 Take the average of the K estimates generated in Step 8 as an estimate of r(π, d * n ).

In
Step 4 of Algorithm 1, when sampling from the marginal posterior distribution stated in (5), we consider a burn-in of 500 iterations and a thinning of 20 with a final number of iterations of 500. We use these 500 iterations to estimate the Bayes risk. A trace and autocorrelation plot for a lower value of the grid used for n is inspected. We expect the same or better behavior as n increased in the grid. All the trace plots showed a random behavior around a value, and in all the autocorrelation plots, the autocorrelations for almost all lags were zero. In each value of n in the grid, we estimate the Bayes risk ten times. We inspect the trace plots, autocorrelations plots, and the acceptance rate to set the burn-in, thinning, and final number of iterations as 500. All these inspection tools showed that such settings provide good results. If we increase these values, we may have the same or better behavior. However, we must consider the computational cost, which increases notoriously as these values increase. Then, taking all this into account, we decide to set 500 iterations. Due to the computational burden, other values were not tested, but we use triplicate values to show the stability of our results.
Consider the fitting proposed by [8] of the total cost curve established as: to the grid of values of n and the respective estimates of TC(n), denoted by tc(n), where E and G are parameters to be estimated. This curve may be linearized by means of: whereas the estimates of E and G can be computed by the least-squares method. In this setting, the optimal sample size (n o ) is the integer closest to: where E and G are, respectively, the least-squares estimates of E and G.

Loss Functions
We consider five loss functions to determine the optimal sample size. The loss functions 1 and 2 may be used for point inference and are symmetric. The loss function 3 also may be used for point inference, but this loss function is asymmetric. The loss functions 4 and 5 may be used for interval inference. Such loss functions are defined below. L1: Loss Function 1. The first loss function is defined as: which is known as the absolute loss function. For this loss function, the Bayes rule d * n is the median of the posterior distribution of µ. Given a sample µ j , for j = 1, . . . , N, of the posterior distribution of µ, L2: Loss Function 2. Second, we consider the well-known quadratic loss function stated as: For this loss function, the Bayes rule d * n corresponds to the posterior expected value of µ, and in this case, E[L(µ, d * n )|x n ] = Var(µ|x n ). Given a sample µ j , for j = 1, . . . , N, of the posterior distribution of µ, an estimate of E[L(µ, d * n )|x n ] may be obtained from the respective sample variance.

L3: Loss Function 3.
The loss functions L1 and L2 suffer from two disadvantages in practical applications: both are symmetric and unbounded. In the list of bounded loss functions that might be considered, we may include those suggested in [29,30]. However, in our case, these loss functions are not simple to deal with. Nevertheless, there is a simple well-know asymmetric loss function that we may consider. This is the linear exponential (known as LINEX) loss function given by: where = 0. As increases positively, the overestimation is more costly than the underestimation. As increases negatively, the situation is reversed [31]. From p. 447 in [31], the Bayes rule for this loss function is established as: Given a sample µ j , for j = 1, . . . , N, of the posterior distribution of µ, we may compute an estimate of d * n through: For the LINEX function, we have that [5]: An estimate of E[L(µ, d * n )|x n ] may be obtained from: L4: Loss Function 4. The fourth function is defined as: where 0 < ρ < 1 is a weight, τ = (b − a)/2 is the half-length of the desired interval, and the function x + is equal to x if x > 0 and equal to zero, otherwise. Note that as τ decreases, the interval is narrower. The terms (a − µ) + and (µ − b) + are included to penalized intervals that do not contain the parameter of interest µ. These terms are equal to zero if µ ∈ [a, b] and increase as µ moves away from the interval. Note that the loss function given in (7) is a weighted sum of two terms, τ and (a − µ) + + (µ − b) + , where the weights are ρ and one, respectively. The Bayes rule d * n corresponds to taking a and b as the 100(ρ/2)th and 100(1 − ρ/2)th quantiles of the posterior distribution of µ [8,32]. If we consider this loss function applied to the Bayes rule, we have that: , a * and b * are the corresponding bounds of the Bayes rule d * n , whereas δ µ is the indicator function. Given a sample µ j , for j = 1, . . . , N, of the posterior distribution of µ, an estimate of E[L(µ, d * n )|x n ] may be obtained from: ).

L5: Loss Function 5.
The fifth and last loss function considered here is expressed as: where γ > 0 is a fixed constant and m = (a + b)/2 is the center of the credible interval.
The first term defined in (8) involves the half-width of the interval, and the second term is the square of the distance between the parameter of interest µ and the center of the interval, which is divided by the half-width to maintain the same measurement unit of the first term. The weights attributed to each term stated in (8) are γ and one, respectively. If γ < 1, we attribute the largest weight to the second term; if γ > 1, the situation is reversed; and if γ = 1, the two terms have the same weight. For this loss function, the Bayes rule d * n corresponds to the quantities that define the interval [a * , b * ] = [m * − SD γ , m * + SD γ ], where m * = E[µ|x n ] and SD γ = γ −1/2 (Var(µ|x n )) 1/2 , that is, the corresponding standard deviation [4,8,32]. In this case, we have that: Given a sample µ j , for j = 1, . . . , N, of the posterior distribution of µ, an estimate of E[L(µ, d * n )|x n ] may be obtained from the sample variance and the expression (9).

Computational Aspects and Empirical Applications
In this section, we provide the characteristics of the computer that was used in our study. Also, we show the capabilities and features of a new R package which is named samplesizeBS [22] and is available from GitHub at https://github.com/santosneto/ samplesizeBS (accessed on 21 May 2021). The capabilities of the samplesizeBS package allow us to calculate an optimal sample size when estimating the BS mean and generate random numbers of the joint posterior BS/IG distribution. This section finishes with an empirical application based on real data.

Computer Characteristics
The characteristics of the Cluster Euler, used when calculating the optimal sample size in Section 4.3, are available at http://www.cemeai.icmc.usp.br/Euler/index.html

The samplesizeBS Functions
In Table 1, we present the details of the functions contained in the samplesizeBS package.  In order to determinate the BS optimal sample size, we use the function bss.dt.bs(). In the example below, we calculate the optimal sample size considering the loss function L1, with a 1 = a 2 = 8, b 1 = b 2 = 50, and c = 0.01. The function also returns the graph with the sample size (n) versus the TC(n) of sampling, that is, TC(n) takes into account both the cost of sampling n observations and the cost of inference through the loss function. In this way, the optimal sample size is the value that minimizes the total cost (see Figure 1). 1

Optimal Sample Sizes
Next, we calculate the optimal sample size assuming different scenarios. For the hyperparameters of the prior distribution of β, we take b 1 = 50 and a 1 = 8, 10, 13, 15. With these values, we have different degrees of prior information (see Figure 2). For the prior distribution of α 2 , we set a 2 = a 1 and b 2 = b 1 . We consider c = 0.001, 0.01, 0.1 for the per-unit cost. For the loss function L3, we take = 0.50, 1.00, 2.00, for L4, ρ = 0.01, 0.05, 0.10, and for L5, γ = 0.25, 0.50, 1.00. For each combination of these values, we compute the optimal sample size n o for estimating µ. The average acceptance rate for the Metropolis-Hastings algorithm in all these combinations was ≈ 70%. Since the proposed methodology is based on simulation methods, we obtain n o in triplicate and observe the difference for the three values. Table 2 reports the optimal sample sizes computed with these settings. In addition, n o may be reached via the following link, which also presents a graph with the fitted curve: https://santosneto.shinyapps.io/samplesizeBSapp (accessed on 21 May 2021) Note that values of the hyperparameters are fixed according to the practitioner's knowledge for the parameters α 2 and β (see Step 1 of Algorithm 1). For example, if we set the hyperparameters as a 1 = 8 and b 1 = 50, it means that our prior knowledge indicates the most likely values for β in the interval between the numbers 3 and 5 (see Figure 2) and similarly for the parameter α 2 . Thus, for instance, if the practitioner's prior knowledge is that the possible values for β (and/or α 2 ) are in the interval between the numbers 6 and 9, the practitioner must set the values of the hyperparameters (parameters of the IG distribution) such that the greatest mass of probability of the IG distribution is in the interval between the numbers 6 and 9.

Illustrative Example
A practical application of the methodology to compute the optimal sample size when estimating the mean of the BS distribution is illustrated here with one example. In this example, we consider a data set composed by 46 observations given in [33] and available in the samplesizeBS package [22]. These observations correspond to maintenance data on active repair times (in hours) for an airborne communications transceiver. Let x i be the observation i of this data set associated with the random variable X i ∼ BS(α, β).

Discussion, Conclusions and Future Research
We proposed a methodology to compute the optimal sample size for estimating the mean of the Birnbaum-Saunders distribution, a widely applied and studied distribution in several areas of science. We considered five different loss functions, which allowed us to perform both point and interval inference for the parameter of interest.
An advantage of the proposed methodology is that the per-unit cost, represented by c, is explicitly taken into account. When the cost c is fixed and b 1 = 50, the optimal sample size n o decreases as a 1 increases (or as the prior variance decreases), as expected, since in such a case, the prior knowledge increases as a 1 increases. This occurred with all loss functions considered in our study. For b 1 = 50 and a 1 fixed, n o also decreases as c increases, but the sampling total cost increases. For example, if we take the loss function L1, a 1 = 8 and c = 0.001, the corresponding n o is 651 (see Table 2), which generates a sampling cost of C(651) = 0.001 × 651 = 0.651, whereas if we take c = 0.1, the corresponding n o is 23 (see Table 2), which produces a sampling cost of C(23) = 0.1 × 23 = 2.3. For the loss function L3, n o increases as increases, if we consider a 1 and c fixed, which makes sense since overestimation is more costly as increases. For the loss function L4, when ρ increases, n o also increases, if we consider a 1 and c fixed. This makes sense because ρ is the weight attributed to the term τ in L4, a term related to the length of the credible interval. In addition, when ρ increases, we expect longer credible intervals, and consequently, the probability of the respective interval decreases. The same is valid for γ in the loss function L5, but in this case, the decreasing of the corresponding credible interval is easily noted by the presence of the term γ −1/2 in the expression of the respective Bayes rule. When γ increases, this term shrinks the length of the interval.
Since the proposed methodology was based on simulations, we obtained the value of n o in triplicate for each scenario of values of a 1 , c, , ρ, and γ. We observed that the largest discrepancies in the scenarios occurred for a 1 = 8. However, these discrepancies decreased as a 1 increased, or when the prior variance decreased. This also occurred when c = 0.001 and/or when we considered the loss function L2. In general, the discrepancy was close to zero, but if a large discrepancy occurs, we suggest visually inspecting the graph of the fitted curves and taking the value of n o that corresponds to the best fit. Nevertheless, if all the curves fit visually well, we suggest using the median of the values obtained for n o . In our case, we obtained the values of n o in triplicate, as mentioned. For example, in Figure 3, under the loss function L5 with a 1 = 8, c = 0.001, and γ = 0.50, the values of n o were 1208, 1179, and 1183. Since there was a discrepancy between these values and the fittings of the curves were visually suitable, in this case, we suggest using n o = 1183. Note that we have optimal sample sizes equal to zero in Table 2 in some scenarios, which means that it is not worth sampling in these cases because the sampling cost outweighs the decreasing of the minimized Bayes risk. This was also observed in [2,6].  We recall that the objective of the present investigation was to calculate the optimal sample size when estimating the mean of the Birnbaum-Saunders distribution. Simulation experiments should be performed to study the relative bias of the estimator of the mean (or the width of the corresponding interval estimate). A sensitivity analysis considering such simulation experiments would also be helpful. However, as mentioned, our methodology uses very intensive computational resources, which limits the implementation of such experiments. In addition, we plan to consider different priors for the parameters and to examine the robustness and the sensitivity of the results. Moreover, we are interested in studying the possibility of using bounded loss functions and other sampling algorithms.
The relevance of calculating sample sizes in statistics is undeniable. Another important challenge to be implemented is related to calculating the sample size when estimating the Birnbaum-Saunders mean (or other parameters) under more complex modeling structures, such as regression, temporal, spatial, functional, PLS, and errors-invariables settings [34][35][36][37][38]. We hope to report findings associated with these open aspects in future research.

Data Availability Statement:
The real-world data set used in this work is available in the sample-sizeBS package. To access the data, use the R command data(repair).