To Burn-In, or Not to Burn-In: That’s the Question

In this paper it is shown that the bathtub-curve (BTC) based time-derivative of the failure rate at the initial moment of time can be considered as a suitable criterion of whether burn-in testing (BIT) should or does not have to be conducted. It is also shown that the above criterion is, in effect, the variance of the random statistical failure rate (SFR) of the mass-produced components that the product manufacturer received from numerous vendors, whose commitments to reliability were unknown, and their random SFR might vary therefore in a very wide range, from zero to infinity. A formula for the non-random SFR of a product comprised of mass-produced components with random SFRs was derived, and a solution for the case of the normally distributed random SFR was obtained.


Introduction
Burn-in testing (BIT) [1][2][3][4][5][6][7][8][9][10] has for many years been an accepted practice for detecting and eliminating early failures in newly fabricated electronic products prior to shipping the "healthy" ones that survived BIT to customers. BIT is mandatory on most high-reliability procurement contracts, such as for military and aerospace applications, but is also a must for automotive, medical, long-haul telecommunication and other electronic materials, packages and systems, whose high operational performance is paramount. BIT stimulates failures in defective materials and vulnerable structural elements of the manufactured products by accelerating the stresses that will supposedly cause these materials and elements to fail. BIT is usually conducted at the component level, because the cost of testing and replacing parts is the lowest at this level. The products are tested by applying stress extremes, usually, but not necessarily, of the expected operational stressors. It is believed that once a sufficiently long BIT process is complete, no further early failures are likely to occur.
Depending on the anticipated operation conditions of the product and testing capabilities of a particular manufacturer, BIT can be based on temperature cycling, elevated temperatures, voltage, current, humidity, random vibrations, and so on, or, since the principle of superposition does not work in the reliability engineering-on the appropriate combination of these stressors. The duration of stressing depends on the product, the manufacturing technology and the reliability requirements, with consideration of the consequences of possible failures. Elevated temperature (say, 125 • C for 168 h) or elevated stresses screening (say, twenty temperature cycles from −10 • C to 70 • C) are most often used. For complex products, dynamic BIT might be employed. The thermal stress, caused by the change in temperature, is combined in these tests with dynamic (shocks, random vibrations) loading. Such a temperature-dynamic bias is thought to provide worst-case operating conditions [11,12]. For commercial applications, BIT, if any is conducted at all, does not last longer than one or two days (24 or 48 h). BIT is a costly effort, and its application is therefore thoroughly planned and carefully executed.
It goes without saying that as a result of successfully applying BIT, early failures are avoided and the infant mortality portion (IMP) of the bathtub-curve (BTC) (Figure 1) is eliminated at the expense of an undesirable reduced yield caused by the BIT process. In addition, high BIT stresses might not only eliminate "freaks", but could cause permanent damage to the main population of the "healthy" products, thereby reducing their lifetime. It is unclear, however, to what extent it happens indeed: the highly accelerated life testing (HALT) [13], a "black box" that tries "to kill many birds with one stone" and is at present the testing procedure of choice employed as a suitable BIT vehicle, is unable to provide any information on that.
Aerospace 2019, 6, x FOR PEER REVIEW 2 of 14 case operating conditions [11,12]. For commercial applications, BIT, if any is conducted at all, does not last longer than one or two days (24 or 48 hours). BIT is a costly effort, and its application is therefore thoroughly planned and carefully executed. It goes without saying that as a result of successfully applying BIT, early failures are avoided and the infant mortality portion (IMP) of the bathtub-curve (BTC) (Figure 1) is eliminated at the expense of an undesirable reduced yield caused by the BIT process. In addition, high BIT stresses might not only eliminate "freaks", but could cause permanent damage to the main population of the "healthy" products, thereby reducing their lifetime. It is unclear, however, to what extent it happens indeed: the highly accelerated life testing (HALT) [13], a "black box" that tries "to kill many birds with one stone" and is at present the testing procedure of choice employed as a suitable BIT vehicle, is unable to provide any information on that. It remains unclear what could possibly be done to develop an insight into what is actually happening during and as a result of BIT and what could possibly be done to effectively eliminate "freaks", while shortening the testing time and not damaging the sound devices. In a mature production, when HALT is relied upon to do the BIT job, it is not easy even to determine whether there exists a decreasing failure rate. To determine the failure time for a very low percentage of the production, one has to destroy a large number of devices, unless there are additional considerations of what could be possibly done to enhance the merits of the BIT process and to minimize its shortcomings. Thus, there is an obvious incentive to develop ways in which the BIT process could be quantified, monitored and possibly optimized. Accordingly, in this analysis some important aspects of BIT are addressed for an electronic product comprised of numerous mass-produced components. Our intent is to shed some quantitative light on the BIT process. Particularly, we try to develop a suitable and predictable criterion that would be able to answer the fundamental "to burnin or not to burn-in" question.
Two mutually complementing modeling studies have been carried out here: (1) the analysis of the configuration of the IMP of a BTC of a more or less well established manufacturing technology; and (2) the analysis of the role of the random statistical failure rate (SFR) of the mass-produced components that the product of interest is comprised of. Particularly, as far as the second study is concerned, we consider the effect that the random SFR of the mass-produced components might have on the nonrandom initial SFR of the product. Although this paper does not offer a straightforward and an ultimate answer to the "to burn-in or not to burn-in" question, nor to how It remains unclear what could possibly be done to develop an insight into what is actually happening during and as a result of BIT and what could possibly be done to effectively eliminate "freaks", while shortening the testing time and not damaging the sound devices. In a mature production, when HALT is relied upon to do the BIT job, it is not easy even to determine whether there exists a decreasing failure rate. To determine the failure time for a very low percentage of the production, one has to destroy a large number of devices, unless there are additional considerations of what could be possibly done to enhance the merits of the BIT process and to minimize its shortcomings. Thus, there is an obvious incentive to develop ways in which the BIT process could be quantified, monitored and possibly optimized. Accordingly, in this analysis some important aspects of BIT are addressed for an electronic product comprised of numerous mass-produced components. Our intent is to shed some quantitative light on the BIT process. Particularly, we try to develop a suitable and predictable criterion that would be able to answer the fundamental "to burn-in or not to burn-in" question.
Two mutually complementing modeling studies have been carried out here: (1) the analysis of the configuration of the IMP of a BTC of a more or less well established manufacturing technology; and (2) the analysis of the role of the random statistical failure rate (SFR) of the mass-produced components that the product of interest is comprised of. Particularly, as far as the second study is concerned, we consider the effect that the random SFR of the mass-produced components might have on the nonrandom initial SFR of the product. Although this paper does not offer a straightforward and an ultimate answer to the "to burn-in or not to burn-in" question, nor to how to optimize the BIT process, in terms of its cost and duration, the suggested physics-of-failure and statistics-of-failure based criterion, and the calculated probabilities of non-failure for the given loading conditions and time of testing provide, in our judgment, a useful step forward in advancing the state-of-the-art in today's BIT practice.
BIT, being an HALT effort, is, in effect, a failure-oriented-accelerated-test (FOAT) [14][15][16] and, as such, should be geared, to confirm the anticipated physics of failure and the expected failure modes, to a physically meaningful accelerated test model. The application of the probabilistic design for reliability (PDfR) approach [17,18] and its constituents, FOAT and multi-parametric Boltzmann-Arrhenius-Zhurkov's equation (BAZ) [19][20][21], are beyond the scope of this paper. The PDfR/FOAT/BAZ concept is considered, however, as important future work. Let us briefly elaborate on its substance.
If the well-known Arrhenius model [22] is employed, FOAT should be conducted to determine the corresponding activation energies and other data that characterize the device reliability [23]. The desirable steady-state portion of the BTC occurs, as is known, at the end of the BIT process as a result of the interaction of two major irreversible processes: the "favorable" SFR process, resulting in a decreasing failure rate with time, and the "unfavorable" physics-of-failure-related process (PFR), resulting in an increasing failure rate. The first process dominates at the IMP of the BTC and is considered in this paper, and the second one-at its wear-out portion. These two processes start to compensate for each other at the beginning of the low enough and acceptable level λ 0 of the steady-state BTC failure rate process. The SFR process can be predicted [24,25] for a product comprised of mass-produced components, from sheer theoretical considerations. Assuming that the physics-of-failure and statistics-of-failure processes are statistically independent, the failure rates of the first process at the given moment of time can be obtained by simply deducting the predicted SFR values from the experimentally obtained BTC ordinates. In our BIT analysis, a different application of the Ref. [24,25] finding is employed, namely, to quantify, on the probabilistic basis, some more or less well known considerations underlying the existing BIT practice, including the "to burn-in, or not to burn-in" question. Application of the PDfR/FOAT/BAZ concept will be able, hopefully, not only answer this question for the given manufacturing technology, but, most importantly, will be able to establish the appropriate elevated stresses and their levels, and decide on the effective BIT duration to minimize the number of devices that will be destroyed and the time of testing. The numerical example in Appendix B gives an indication of what could be expected from the application of the PDfR/FOAT/BAZ concept.

Prediction Based on the Analytical Approximation of the Bathtub-Curve (BTC)
The typical BTC, the "reliability passport" of a mass-produced electronic product (Figure 1), can be approximated by the following expressions [18]: Here λ(t) is time-dependent failure rate, λ 0 is its steady-state minimum, λ 1 is its initial (high) value at the beginning of the IMP, t 1 is the duration of this portion, λ 2 is the final (actual or acceptable) value of the failure rate at the end of the wear-out portion, t 2 is the duration of this portion, and the exponents n 1 and n 2 are expressed through the fullnesses β 1 and β 2 of the BTC infant-mortality and the wear-out portions as n 1,2 = β 1,2 1−β 1,2 . These fullnesses are defined as the ratios of the areas below the BTC (i.e., the areas between the BTC and the time axis) to the areas (λ 1 − λ 0 )t 1 and (λ 2 − λ 0 )t 2 of the corresponding rectangulars. The exponents n 1 and n 2 change from zero to one, when the fullnesses β 1 and β 2 change from zero to 0.5. The "to burn-in or not to burn-in" question can be tentatively answered based on the derivative: calculated for the initial moment of time t = 0. This yields: If this derivative is zero or next-to-zero, this means that there is no IMP at all, so that no BIT is needed to eliminate this portion, and "not to burn-in" is the answer to our basic question. This certainly happens when the initial value λ 1 of the BTC is not different from its steady-state λ 0 value. What is less obvious is that the same result takes place for β 1 t 1 = 0. This means that no more or less durable BIT is needed in such a case, because there are not too many "freaks" in the population, and that these "freaks" are characterized by very low probabilities of non-failure, so that the planned BIT process is a next-to-instantaneous one. The maximum value of the fullness β 1 is β 1 = 0.5. This corresponds to the case when the IMP of the BTC is a straight line connecting the initial, λ 1 , and the steady-state, λ 0 , values of the BTC. In this case, The derivative (3) with respect to the fullness β 1 changes from the value expressed by the formula (4) to the value, which is four times greater, when the fullness β 1 changes from zero to 0.5. But how to establish the most likely λ 1 value and the required BIT time, even for the worst case scenario β 1 = 0.5, so that the question "to burn-in or not to burn-in?" could be answered with some certainty? To do that let us address two additional and independent, methodologies: the one based on the use of the SFR [24,25] and, briefly, also the one, based on the application of the BAZ constitutive equation [19][20][21].

Prediction Based on the Analysis of the SFR Process
In the simplest case of the uniformly distributed random failure rates λ, when the probability density distribution function f (λ) is constant, the formula (A3) of the Appendix A yields: In such a case, the probability of non-failure becomes time independent, that is, constant over the entire operation range: 3679. This result does seem to make physical sense. Let us consider therefore a more realistic case, when the random failure rates λ of the components are normally distributed: Here λ is the mean value of the random SFR λ and D is its variance. Introducing (7) into the formula (A3) and using [26], the following expression for the non-random SFR of the product can be obtained: The function depends on the dimensionless "physical" (effective) time and so do the auxiliary function and the probability integral (Laplace function) The term s in formula (10) can be interpreted as a sort of a measure of the level of uncertainty of the random SFR. The s value changes from infinity to zero, when the variance D changes from zero, in the case of a non-random SFR, to infinity, in the case of an "ideally random" SFR. As is evident from formula (10), the "physical" time τ of the SFR process depends not only on the "chronological" (actual) time t, but also on the mean λ and variance D of the components' random SFR. The rate of change of the "physical" time τ with the change in the "chronological" time t is dτ dt = D 2 : the "physical" time τ changes the faster the larger the standard deviation √ D of the random SFR is. Considering this relationship, the formula (8) yields: The "physical" time τ is zero, when the "chronological" time t is t = λ D and changes from −∞ to ∞ when the variance D of the random SFR changes from zero to infinity. The function ϕ(τ) is tabulated in Table 1. It changes from 3 to zero when the "physical" time τ changes from −3 to infinity, that is, when the "chronological" time changes from zero to infinity. The function ϕ (τ) in this table is calculated numerically.
The expansion (11) can be used to calculate the auxiliary function Φ(τ) for large τ values, exceeding, say, 2.5, and has been, in effect, employed, when computing the Table 1 data. The function Φ(τ) changes from infinity to zero, when the "physical" time τ changes from −∞ to ∞. For the times τ below −2.5, the function Φ(τ) is large, and the second term in (9) becomes small compared to the first term. In this case the function ϕ(τ) coincides with the time τ itself, with an opposite sign though. As evident from Table 1, the derivative dϕ(τ) dτ can be put, at the initial moment of time, equal to −1.0, and therefore, This result explains the physical meaning of the initial failure rate λ 1 of the BTC. At the initial moment of time (t = 0) the formulas (10), (11) and (8) yield: where the function is tabulated in Table 2. This function changes from 1 √ π = 0.5642 to infinity, when the factor s changes from zero to infinity.     With the product's initial SFR value λ ST = λ 1 (the degradation failure rate λ DG is obviously zero at initial moment of time, so that the initial value λ ST of the non-random SFR coincides with the initial value λ 1 of the BTC), the last formula in (13)  increases from zero to infinity (see Table 2), the ratio = Ψ(s) increases from 1 √ π = 0.5642 to infinity. The initial failure rate can be put equal to its mean value, if the ratio λ √ 2D exceeds 2.5. This is usually indeed the case in an actual situation, since the accepted normal distribution, when applied to a random variable that cannot be negative, should be characterized by a significant ratio of its mean value to the standard deviation, so that the negative values of such a distribution, although exist, are insignificant and do not contribute appreciably to the sought information.
The probability of non-failure, can be calculated as, and is tabulated in Table 3 as the function of the "physical" time τ and the "safety factor" s. From (8) we obtain: The derivative dϕ(τ) dτ can be evaluated analytically or obtained numerically using Table 3 data.

Conclusions
The following conclusions could be drawn from the carried out analysis: • Two mutually complementing modeling studies have been carried out: (1) the analysis of the configuration of the IMP of the BTC, the reliability "passport" of an established semiconductor technology; and (2) the analysis of the role of the random SFR of the mass-produced components that the product of interest is comprised of. • The first analysis has shown that the BTC-based time-derivative of the failure rate at the initial moment of time can be considered as a suitable criterion of whether BIT should or does not have to be conducted. If this derivative is small, no BIT might be needed, because the initial part of the IMP is more-or-less parallel to the time axis, and this is an indication that there are no highly unreliable items ("freaks") in the lot and that the initial moment of time is, in effect, the start of the steady-state BTC condition. In the opposite extreme case, when this derivative is significant, BIT is needed, but could be made very short, because the "freaks" are so unreliable that even a very short and weak BIT could successfully remove them.

•
The second analysis has indicated that the above criterion is, in effect, the variance of the random SFR of the mass-produced components that the product manufacturer received from numerous vendors, whose commitments to reliability were unknown, and their random SFR might vary therefore in a very wide range, from zero to infinity. • A solution for the case of the normally distributed random SFR was obtained. Using this solution, probabilities of non-failure as functions of time and the ratio of the mean value of the random SFR of the mass-produced components to its standard deviation (in analysis of structures this ratio is known as safety factor) were calculated. This adds useful information to the next-step investigations and a more effective answer to our fundamental "question in question". • Although this paper does not offer a straightforward and an ultimate answer to this question, the suggested physics-of-failure and statistics-of-failure based criterion, and the calculated probabilities of non-failure for the given loading conditions and time of testing, provide a useful step forward in advancing today's BIT practice, which is based on the HALT, a "black box" that has many merits, but does not quantify reliability, even on a deterministic basis. • Future work should include experimental verification of the suggested "to burn-in or not to burn in" criterion, as well as its acceptable values, which would enable to answer the "to burn-in or not to burn-in" question. It should include also investigation of the effects of other possible distributions of the random SFR, such as, for example, Rayleigh distribution.

Conflicts of Interest:
The author declares no conflict of interest. Since the activation energy should remain the same at both temperature levels, sensitivity factor γ R could be found from the following formula:

Acronyms
,n 1,2 = − ln P 1,2 R * t 1,2 (A6) Let, for example, in accordance with the data accumulated during a product launch or lot release is accumulated: the BIT at the temperature T 1 = 125 • C = 398 K was conducted for t 1 = 12 h, and 1.5% of the tested devices failed (P 1 = 0.985). When the test was conducted for t 2 = 24 h with another group of the same devices at the temperature T 2 = 150 • C = 423 K, 3.5% of the tested devices failed (P 2 = 0.885). The observed failure modes were mechanical failures of the solder joints, and the failures corresponded to the increase in the electrical resistance to the level of R * = 450 Ω. Then the second formulas in (A6) yields: in the second step.
In an approximate analysis, when the times t 1 and t 2 are short, one could tentatively evaluate the variance of the random SFR as A more accurate prediction could be obtained using Table 3 data. It is noteworthy that a similar approach could be applied with different failure modes, such as short or open circuits, leakage current, charge accumulation, etc.