Long-Term Ultrasonic Benchmarking for Microstructure Characterization with Bayesian Updating

: Ultrasonic non-destructive characterization is an appealing technique for identifying the microstructures of materials in place of destructive testing. However, the existing ultrasonic characterization techniques do not have sufﬁcient long-term gage repeatability and reproducibility (GR&R), since benchmarking data are not updated. In this study, a hierarchical Bayesian regression model was utilized to provide a long-term ultrasonic benchmarking method for microstructure characterization, suitable for analyzing the impacts of experimental setups, human factors, and environmental factors on microstructure characterization. The priori distributions of regression parameters and hyperparameters of the hierarchical model were assumed and the Hamilton Monte Carlo (HMC) algorithm was used to calculate the posterior distributions. Characterizing the nodularity of cast iron was used as an example, and the benchmarking experiments were conducted over a 13-week transition period. The results show that updating a hierarchical model can increase its performance and robustness. The outcome of this study is expected to pave the way for the industrial uptake of ultrasonic microstructure characterization techniques by organizing a gradual transition from destructive sampling inspection to non-destructive one-hundred-percent inspection.


Introduction
The microstructure of a material is a critical factor that influences and limits its qualities, affecting the material fatigue strength, yield strength, creep characteristics, and other mechanical properties [1][2][3][4][5]. As a result, it is critical to precisely characterize the microstructure of materials. Microstructure characterization methods include destructive and non-destructive methods [6][7][8], and ultrasonic characterization techniques have gained popularity in the non-destructive testing (NDT) community because they can measure internal microstructure, cause no harm to the human body, and are inexpensive [9]. Theoretical modeling [10,11], numerical simulation [12][13][14], and ultrasonic benchmarking [15] are three types of ultrasonic microstructure characterization techniques.
Theoretical modeling methods use elastodynamics theories, micromechanics theories, and crystallography theories to define the models of direct or indirect measurements for deriving microstructure parameters or elastic tensors. However, it is challenging to establish a model of velocity, attenuation, backscattering, and non-linear coefficients for some materials with complicated microstructures [10]. Numerical simulation methods can calculate the ultrasonic characteristics of materials with complicated microstructures [13,14], but their accuracy is dependent on mesh creation and boundary conditions [12]. Furthermore, three-dimensional simulation has poor computing efficiency, and surrogate models may be necessary for practical applications.
Currently, the tools for ultrasonic benchmarking comparisons include least-square fitting, machine learning, and Bayesian regression [15]. Ordinary least-square fitting techniques can provide clear empirical formulas for linking ultrasonic characteristics to microstructure parameters, and their connection can even be non-causal, as principal components analysis or specialized signal analysis methodologies are used (Lyapunov exponent, fractal dimension, recurrence analysis, etc.). The resultant models, however, are oversimplified due to the omission of many uncertainties [16][17][18], and the impacts of coupling factors or multi-variable factors on ultrasonic characteristics cannot be eliminated [19]. Machine learning methods are data-driven, which reduces the need for artificial selection of ultrasonic characteristics. Through repeated training, the characteristics may be automatically extracted from the original ultrasound signals by anchoring with microstructural parameters. However, the interpretability and generalizability of machine learning still need to be improved. It is commonly challenging to collect a big data set in the application of ultrasonic non-destructive characterization [20]. Additionally, the experimental data for least-square fitting and machine learning are often obtained across a single or several days. Thus, it is impossible to check their long-term gage repeatability and reproducibility (GR&R).
On the contrary, Bayesian regression methods are updatable and interpretable, and they can treat uncertain parameters as random variables, demonstrating their potential in the field of NDT. Liu and colleagues have been working on Bayesian estimations of fatigue damage prognosis and yield strength with ultrasonic inspection data throughout the last 12 years [21][22][23][24][25]. Their contributions involved the maximum entropy approach, reversible jump Markov chain Monte Carlo (MCMC) method, baseline quantification model, missing data disposal, and Bayesian network technique. Tant et al. [26] used a Bayesian analysis of simulated ultrasonic phased array data to reconstruct inhomogeneous wave speed maps for flaw imaging in random media. More recently, Bai and Drinkwater et al. [27,28] also employed Bayesian inference for defect characterization based on ultrasonic phased array imaging. They compared the results of defect information inversion using Bayesian inference versus machine learning. However, the aforementioned studies ignored the impacts of uncertainties in the experiment setup, human factor, and environmental factor, all of which might affect the Bayesian estimation of microstructure parameters.
Hierarchical Bayesian regression models allow the exploration of the influence of minor factors on major factors [29], which is a potential option for ultrasonic benchmarking in practical applications. Aldrin et al. [30] developed a hierarchical Bayes' model for estimating the probability of detection (POD) and the model-assisted POD for the eddy-current non-destructive inspection by considering random variables in physics-based models. Chiachío et al. [31] proposed a hierarchical Bayesian approach for damage identification in composite laminates, which took into account the model parameters, damage hypotheses, and damage patterns. Nevertheless, Refs. [30,31] do not describe how to deal with the residual between the regression function and destructive metallographic observation using the Bayesian confidence interval.
In this study, the hierarchical Bayesian regression model proposed by Austin et al. [29] for health research was modified for ultrasonic non-destructive characterization. First, data with a hierarchical structure were defined by ultrasonic characteristics that depending on various experiment setups, human factors, and environmental factors. Next, the priori distributions of regression parameters and hyperparameters were assumed, and the Hamilton Monte Carlo (HMC) algorithm [32] was used to calculate the posterior distributions. An example of classical application for characterizing the nodularity of ductile iron was then revisited without loss of generality. We collected equal amounts of metallographic and ultrasonic experimental data weekly, and we updated the hierarchical model periodically. Through a data collection process that lasted several months, the hierarchical Bayesian regression model was built. Finally, it was compared against the least-square fitting, traditional Bayesian regression model and a hierarchical model that has not been updated.

Traditional Bayesian Regression Model
To begin, a brief review of traditional Bayesian regression models is necessary, accompanied by a declaration of our nomenclature. Suppose just one microstructure parameter needs to be characterized, for instance. N is the number of benchmarking specimens that will be updated each time, and M is the number of ultrasonic characteristics that will be measured from each benchmarking specimen. We define the input characteristic vector as {x 1 , x 2 , · · · , x N }, i = 1, 2,···, N, x i ∈ R M . We denote the microstructure parameter of the ith specimen measured by metallographic experiment as y i , and the mean of microstructure parameter of the ith specimen is µ i , such that the random fluctuation between the measured value and the mean value is ε i = y i − µ i . Assuming ε i is independent and identically distributed (i.i.d.), let ε i = ε, where ε obeys normal distribution with zero mean, i.e., ε ∼ N 0, σ 2 . σ is the standard deviation of the fluctuation term. Thus, the traditional Bayesian regression model is [29]: where the regression parameter array associated with the mean value is η = {β, α}, β j is the parameter corresponding to each ultrasound characteristic, and α is the intercept. In terms of Bayes' theorem, we have [33]: where Pr({η, σ}|y i ) is the posterior distribution. It reflects the probability that the model parameters will take values of {η, σ} when the observed data y i is obtained. Pr({η, σ}) is the prior distribution of {η, σ}, and Pr(y i ) is the marginal distribution of the observed data y i . The likelihood function Pr(y i |{η, σ}) determines the conditional probability of the observed data y i for each set of augmented parameter array {η, σ}.
If the prior distribution of η is normally distributed, the likelihood function for a single specimen is also normally distributed, as [33]: Therefore, multiplying N i.i.d. normal distributions yields the likelihood functions corresponding to all specimens, as [33]: Consequently, the posterior distributions of parameter array {η, σ} can be calculated as [33]: After obtaining the posterior distribution Pr( {η, σ}|y), we can utilize it to calculate the regression function and its associated confidence bounds. When updating the Bayesian model by adding new benchmarking specimens and their correlated ultrasonic characteristics, the posterior distribution of the old model could be utilized as the input prior distribution of the new model.

Hierarchical Bayesian Regression Model
Considering that the establishment of ultrasonic benchmarking is inextricably impacted by the experiment setups, human factors, and environmental factors, the traditional Bayesian regression model is insufficient for reliable microstructure characterization. When analyzing data having a multilevel or hierarchical structure, a hierarchical Bayesian regression model might be employed to avoid overfitting or underfitting. Meanwhile, the hierarchical structure is often depicted as distinct groups, and the hierarchical model facilitates the discovery of data relationships between the groups. Additionally, the hierarchical model would apply appropriate treatment to the uncertainty of each group, avoiding the estimation results being dominated by one group's data [34]. Figure 1 is our flow diagram of using hierarchical model to achieve long-term ultrasonic benchmarking for microstructure characterization.

Hierarchical Bayesian Regression Model
Considering that the establishment of ultrasonic benchmarking is inextricably impacted by the experiment setups, human factors, and environmental factors, the traditional Bayesian regression model is insufficient for reliable microstructure characterization. When analyzing data having a multilevel or hierarchical structure, a hierarchical Bayesian regression model might be employed to avoid overfitting or underfitting. Meanwhile, the hierarchical structure is often depicted as distinct groups, and the hierarchical model facilitates the discovery of data relationships between the groups. Additionally, the hierarchical model would apply appropriate treatment to the uncertainty of each group, avoiding the estimation results being dominated by one group's data [34]. Figure 1 is our flow diagram of using hierarchical model to achieve long-term ultrasonic benchmarking for microstructure characterization.  For example, it is assumed that there are K1 types of experiment setups settings, K2 human factors, and K3 environmental factors for ultrasonic benchmarking. Thus, a total of . In this study, we developed a hierarchical Bayesian regression model with a two-level structure using the aforementioned destructive and non-destructive data, where the base level's regression parameters were utilized as the dependent variables for the second level. We assumed that ( ) k η could be determined by another For example, it is assumed that there are K 1 types of experiment setups settings, K 2 human factors, and K 3 environmental factors for ultrasonic benchmarking. Thus, a total of K 1 C 1 K 2 C 1 K 3 C 1 = K sets of ultrasonic measurements for each benchmarking specimen should be undertaken, where n C r is the number of combination, n is the total number of elements in the set, and r is the number of elements chosen from this set. The measured microstructure parameter of the ith specimen in the kth (k = 1, 2, · · · , K) group is y i (k), and its mean value µ i (k) is calculated by the regression parameter array η(k) = {β(k), α(k)}. In this study, we developed a hierarchical Bayesian regression model with a two-level structure using the aforementioned destructive and non-destructive data, where the base level's regression parameters were utilized as the dependent variables for the second level. We assumed that η(k) could be determined by another hyperparameter array κ η = γ η ,ν η , i.e., a multi-dimensional joint prior to distribution Pr( η| κ η ) could be used to describe the correlation between {η(1), η(2), . . . , η(K)}. As a result, by specifying a hyper-prior distribution Pr(κ η ) for the hyperparameter κ η , the Bayesian estimation of the regression parameter array η(k) = {β(k), α(k)} could be implemented.
First, the first level of the hierarchical Bayesian regression model is [29]: Next, all regression parameters at the first level were rewritten as dependent variables. We presumed that the intercept term α(k) at the first level had an mean term of µ α (k) and a random fluctuation of ε α , and that the new intercept term and slope terms of µ α (k) at the second level were γ α and ν α . Similarly, the slope term β j (k) at the first level was supposed to have an mean term of µ βj (k) and a random fluctuation of ε βj , and that the new intercept term and slope terms of µ β j (k) at the second level were γ β j and ν β j . Thus, the second level of the hierarchical model is [29]: where x l (k) are called the covariates. They serve as independent variables at the second level, assessing the difference of ultrasound characteristics of each group [29]. We assumed that the random fluctuations ε α and ε β j followed a multidimensional joint normal distribution with zero means. Their covariance matrix is denoted by Cov, as [32]: where the diagonal elements σ 2 α and σ 2 β j are the variances of ε α and ε β j , respectively. cov a,b are the covariances between a and b (a, b = α, β 1 , β 2 , · · · , β M ). Finally, substituting Equation (7) into (6), the hierarchical Bayesian regression model could be obtained as [29]: On the right side of Equation (9), the first row represents the fixed-effects term, while the second row represents the random-effects term. Both are associated with the group number k, and the other terms without k communicate information between groups. As a result of communications, there is a distinction between developing a hierarchical model with k-groups and building k traditional models individually. This helps the model to take into account both the information about each group and the overall information, and to synthesize all the learned knowledge in order to effectively improve each group's prediction accuracy.
Following the establishment of a hierarchical Bayesian regression model, the prior distributions of the parameters must be assumed, and the posterior distributions must be estimated. Among the several Bayesian parameter estimation strategies, the traditional Markov chain Monte Carlo approach can only find the optimal solution by random walk. However, the HMC algorithm is more efficient in analyzing the state space than the MCMC algorithm due to the application of the concept of dynamics in physical systems. In the next section, we will utilize the HMC algorithm to calculate the future states of the Markov chains in order to obtain faster convergence.

Experiments and Results
The study's overall goal was to propose a gradual transition from destructive sampling inspection to non-destructive one-hundred-percent inspection through the use of long-term ultrasonic benchmarking with Bayesian updating. As an illustration, we reconstructed a classical application that characterized the nodularity of ductile iron. Although ultrasonic nodularity characterization was accomplished 40 years ago [35], it is still not widely available in the industrial applications (at least not in China's factories). For this reason, we established a transition period and a long-term experiment during which destructive sampling inspections were continued and additional ultrasonic measurements were conducted. The ultrasonic benchmarking procedure was performed weekly, and data from the current week were utilized to update the model from the previous week. The transition period was sustained until the Bayesian estimation results were accurate enough to meet industrial requirements. In addition, the impact of experimental conditions on ultrasonic benchmarking was discovered. A series of damping levels were employed to reflect a variety of experimental setups. For human factors, inspectors with varied grades of expertise were engaged. Environmental factors were represented by varying electromagnetic noise levels. The detailed steps and results of the experiments are shown below.

Experimental Preparations
The chemical composition of ductile iron QT500-7 (China, GB) was (in wt.%): 3.60-3.80 C, 2.50-2.90 Si, Mn ≤ 0.60, P ≤ 0.08, S ≤ 0.025, and Fe, balance. Our experiment lasted for 13 weeks, from November 1st to January 30th. Weekly tests were conducted on ten benchmarking specimens (30 mm × 30 mm × 25 mm) cut from various portions of blanks and subjected to metallographic and ultrasonic measurements (12,000 mm × 2000 mm × 30 mm). After grinding and polishing, five metallographic images of each specimen were acquired using an Olympus optical microscope. The graphite particles in the metallographic images were then statistically measured to determine the nodularity [36]. For each benchmarking specimen, the mean value and standard deviation of the five metallographic images were used as the results of the destructive measurement.   Next, water immersion ultrasonic measurements were performed on ten specimens every week. As shown in Figure 1, the ultrasonic system consisted of a JSR DPR 300 ultrasonic pulse generator/receiver (Imaginant Inc., Pittsford, NY, USA), a 200 MHz ADLINK PCIe-9852 DAQ card, an Olympus immersion transducer (I3-0508, 0.5 inch, 5 MHz, Olympus, Tokyo, Japan), and a computer-controlled micro-positioning system. A dual-mode ultrasonic velocity method was used to obtain the longitudinal wave velocity and transverse wave velocity [37]. A total of 12 groups of experimental conditions are shown in Table 1. Different damping levels were chosen since junior inspectors in the factory Next, water immersion ultrasonic measurements were performed on ten specimens every week. As shown in Figure 1, the ultrasonic system consisted of a JSR DPR 300 ultrasonic pulse generator/receiver (Imaginant Inc., Pittsford, NY, USA), a 200 MHz ADLINK PCIe-9852 DAQ card, an Olympus immersion transducer (I3-0508, 0.5 inch, 5 MHz, Olym-pus, Tokyo, Japan), and a computer-controlled micro-positioning system. A dual-mode ultrasonic velocity method was used to obtain the longitudinal wave velocity and transverse wave velocity [37]. A total of 12 groups of experimental conditions are shown in Table 1. Different damping levels were chosen since junior inspectors in the factory change the settings at whim. The maximum value, default value, and minimum value of damping in DPR 300 are 333, 49, and 25 Ohms, respectively. Consequently, we selected these three representative values for the experiments. Junior and senior inspectors were chosen based on their different operating skills and understanding of ultrasonic characterization. Different noise levels were chosen, owing to the inevitability of electromagnetic noise in the factory, and the likelihood that the double-shielded cable would be broken due to frequent use. Thus, a standard cable (with a high level of noise) and a double-shielded cable (with a low level of noise) were utilized to imitate this scenario. In addition, the Bayesian models in this study were mostly implemented using the BRMS package from Stan software (R package version 2.21.3). With the assistance of Stan software for HMC, a fast posterior distribution estimation for Equation (9) may be accomplished. Running the BRMS software in particular facilitates the rapid convergence of hierarchical model, regardless of whether the prior distributions are conjugated or not.

Results for First
Week without Updating Figure 3 shows the relationships between nodularity, longitudinal wave velocity c L , and transverse wave velocity c T , based on 12 groups of experiments conducted on ten specimens during the first week. The results show that the nodularity was proportional to the c L and c T , which agreed with earlier experimental findings. An interesting finding was that the specific results from the 12 groups of experimental conditions varied slightly. When building a least-square fitting model or a traditional Bayesian regression model, certain users may ignore the impact of the experimental conditions. Therefore, we first built a traditional model by combining all of the data and then built a hierarchical model to see whether any discrepancies existed. The slope terms β 1 and β 2 of these two models are related to c L and c T in the following.

Traditional Model
By mixing the 120 data points collected during the first week, a traditional Bayesian regression model was built. α, β 1 , β 2 , and σ were the parameters that were to be calculated. Four Markov chains were set up for each parameter, and each chain was iterated 4000 times, containing 1000 warm-up and 3000 actual sampling results. The four Markov chains for each parameter are shown in Figure 4a1-d1. Each chain began with varying values, but all the chains converged after the warm-up step (shown by the blue background). In terms of the kernel density estimation (KDE) for the actual sampling results of 4 × 3000 data [38], the posterior distributions of each parameter in the traditional model are shown in Figure 4a2-d2. From the posterior distributions, the mean, standard deviation, and 90% confidence bounds of each parameter can be indicated.

Results for First
Week without Updating Figure 3 shows the relationships between nodularity, longitudinal wave velocity cL, and transverse wave velocity cT, based on 12 groups of experiments conducted on ten specimens during the first week. The results show that the nodularity was proportional to the cL and cT, which agreed with earlier experimental findings. An interesting finding was that the specific results from the 12 groups of experimental conditions varied slightly. When building a least-square fitting model or a traditional Bayesian regression model, certain users may ignore the impact of the experimental conditions. Therefore, we first built a traditional model by combining all of the data and then built a hierarchical model to see whether any discrepancies existed. The slope terms β1 and β2 of these two models are related to cL and cT in the following.

Traditional Model
By mixing the 120 data points collected during the first week, a traditional Bayesian regression model was built. α, β1, β2, and σ were the parameters that were to be calculated. Four Markov chains were set up for each parameter, and each chain was iterated 4000 times, containing 1000 warm-up and 3000 actual sampling results. The four Markov chains for each parameter are shown in Figure 4a1-d1. Each chain began with varying values, but all the chains converged after the warm-up step (shown by the blue background). In terms of the kernel density estimation (KDE) for the actual sampling results of 4 × 3000 data [38], the posterior distributions of each parameter in the traditional model are shown in Figure 4a2-d2. From the posterior distributions, the mean, standard deviation, and 90% confidence bounds of each parameter can be indicated.   According to the posterior distribution of the parameters, the regression function and its associated confidence bounds can be obtained. Figure 5a,b show two cross-sections of the regression function used to characterize nodularity, one at cT = 3070 m/s and the other at cL = 5606 m/s, respectively. These two particular velocity values correspond to the mean values for cT and cL in the mixing dataset. The light blue areas illustrate the 90% confidence According to the posterior distribution of the parameters, the regression function and its associated confidence bounds can be obtained. Figure 5a,b show two cross-sections of the regression function used to characterize nodularity, one at c T = 3070 m/s and the other at c L = 5606 m/s, respectively. These two particular velocity values correspond to the mean values for c T and c L in the mixing dataset. The light blue areas illustrate the 90% confidence intervals for nodularity, the dark blue areas illustrate the 90% confidence intervals for the mean value of nodularity, and the red lines illustrate the mathematical expectations for the mean value of nodularity. To verify the model further, the residuals were calculated and analyzed as shown in Figure 6. The red bars represent the absolute values of the residual. The blue bars represent the mean of the red bars in each group, and the maximum deviation of group mean residual was 1.659% (between Group #9 and Group #12). The green line represents the mean of the blue bars, indicating that the total mean residual was 1.604%. Additionally, the residuals were all covered within the confidence intervals, shown by the light blue areas in Figure 5. In addition, the estimations of the nodularity of validation specimens T-1 and T-2 are shown in Section 3.3. Therefore, it is evident that the traditional Bayesian regression model's estimation accuracy was low, the range of confidence intervals was large, and the model has to be optimized.
The iterations of the Markov chain were plotted to examine if they converged, and some of the results are depicted in Figure 7. The results of Markov chain sampling for the parameters γ α , ν α,1 , γ β 1 , ν β 1 ,1 , σ, σ α , cov α,β 1 , and cov β 1 ,β 2 are shown in Figure 7a1-h1. The associated posterior distributions are shown in Figure 7a2-h2, and we can see that the posterior distributions of the random effect terms of second level, such as σ α , cov α,β 1 , and cov β 1 ,β 2 , were not normally distributed. The mean value, standard deviation, and 90% confidence interval for each parameter are given as shown in Table 2.

Hierarchical Model
Next, the parameters of the hierarchical Bayesian regression model were estimated using 12 groups of experimental condition, including the fixed effect terms γα,   The preceding parameters' estimation results were used to calculate the regression function and to characterize nodularity. Figure 8a1,a2, like Figure 5a Figure 8b1,b2,c1,c2,d1,d2 correspond to Groups #6, #7, and #10. As noted, the confidence intervals for Groups #1 and #6 were rather large. This was because a higher level of electromagnetic noise led to in increased uncertainty, and the junior inspector was more likely to bias the experimental results. Furthermore, Figure 9 depicts the hierarchical model's residuals. Both Groups #7 and #10 had relatively low residuals, indicating that the effects of damping setting on ultrasonic nodularity characterization were limited. Additionally, the hierarchical model's total mean of residual was 1.175%, which is 0.429% lower than the total mean residual in the traditional model. 1 2 , cov β β , were not normally distributed. The mean value, standard deviation, and 90% confidence interval for each parameter are given as shown in Table 2.   (a2-h2) Posterior distributions of parameters γ α , ν α,1 , γ β 1 , ν β 1 ,1 , σ, σ α , cov α,β 2 , and cov β 1 ,β 2 .
In addition, the model can be evaluated according to the information law, and the widely applicable information criterion (WAIC) is calculated as [39]: where LPPD is the log pointwise predictive density, defined as the sum of the mean of likelihood corresponding to all observations, and p WAIC is the number of valid parameters obtained from the variance of the log-likelihood values of the observation. A small value for p WAIC indicates a simple model, while a large value for LPPD indicates an accurate model, implying that the WAIC evaluation is a balance between simplicity and accuracy. The WAIC value of a traditional Bayesian model is 515.1, but the WAIC value of a hierarchical Bayesian model is 488.3, demonstrating that the hierarchical model is more effective.
noted, the confidence intervals for Groups #1 and #6 were rather large. This was because a higher level of electromagnetic noise led to in increased uncertainty, and the junior inspector was more likely to bias the experimental results. Furthermore, Figure 9 depicts the hierarchical model's residuals. Both Groups #7 and #10 had relatively low residuals, indicating that the effects of damping setting on ultrasonic nodularity characterization were limited. Additionally, the hierarchical model's total mean of residual was 1.175%, which is 0.429% lower than the total mean residual in the traditional model.  In addition, the model can be evaluated according to the information law, and the widely applicable information criterion (WAIC) is calculated as [39]: where LPPD is the log pointwise predictive density, defined as the sum of the mean of likelihood corresponding to all observations, and pWAIC is the number of valid parameters obtained from the variance of the

Traditional Model
To update the original traditional model weekly, 120 benchmarking data of nodularity and ultrasonic longitudinal and transverse wave velocities were collected from ten benchmarking specimens using the aforementioned procedure. Figure 10 shows the evolution of the posterior distribution of parameter β 2 from the first week to the thirteenth week. It can be seen that the standard deviation of posterior distribution of β 2 gradually decreased as the sampling size of data increased with each model's updating. Figure 10 shows the evolution of posterior distribution of parameter β 2 from the first week to the thirteenth week. The standard deviation of the posterior distribution of β 2 gradually decreased as the sample size of data increased with each updating. After a long-term Bayesian updating for the traditional model, the results of the regression function and its associated confidence bounds are shown in Figure 11, which still had two cross-sections. Compared with Figure 5, both the 90% confidence intervals for nodularity and the mean value of nodularity given by the updated model were significantly smaller, due to the larger sampling size of the input data. Because the experiment was repeated for three months, the long-term gage repeatability and reproducibility of the updated Bayesian model were superior to those of the non-updated Bayesian model, and that is self-evident.

Hierarchical Model
Finally, we divided all of the benchmarking data into 12 groups depending on the different experimental conditions to update the hierarchical model. Based on a total of thirteen weeks of data, Figure 12 shows the best regression function and accompanying 90% confidence bounds from Group #7. The confidence intervals were noticeably smaller in comparison to Figure 8c1,c2 without updating. Table 3 summarizes the results of the nodularity estimation on validation specimens T-1 and T-2 using various methods, including the metallographic method, least-square fitting method of all data, traditional and hierarchical Bayesian regression models for the first week, traditional and hierarchical models for the sixth week, and traditional and hierarchical models for the thirteenth week. As can be seen, the least-square fitting approach was rarely used to determine the standard deviation and confidence interval for the estimated nodularity. The hierarchical model performed better than the traditional model, whether updated or not. When comparing the hierarchical models for the first, the sixth, and the thirteenth weeks, it was found that the updated model had improved accuracy and a reduced standard deviation. After longterm Bayesian updating, the confidence intervals got shorter, and the residual between the regression function and destructive metallographic observation could be described by the confidence intervals. In addition, the estimation results of the updated hierarchical

Hierarchical Model
Finally, we divided all of the benchmarking data into 12 groups depending on the different experimental conditions to update the hierarchical model. Based on a total of thirteen weeks of data, Figure 12 shows the best regression function and accompanying 90% confidence bounds from Group #7. The confidence intervals were noticeably smaller in comparison to Figure 8c1,c2 without updating. Table 3 summarizes the results of the nodularity estimation on validation specimens T-1 and T-2 using various methods, including the metallographic method, least-square fitting method of all data, traditional and hierarchical Bayesian regression models for the first week, traditional and hierarchical models for the sixth week, and traditional and hierarchical models for the thirteenth week. As can be seen, the least-square fitting approach was rarely used to determine the standard deviation and confidence interval for the estimated nodularity. The hierarchical model performed better than the traditional model, whether updated or not. When comparing the hierarchical models for the first, the sixth, and the thirteenth weeks, it was found that the updated model had improved accuracy and a reduced standard deviation. After longterm Bayesian updating, the confidence intervals got shorter, and the residual between the regression function and destructive metallographic observation could be described by the confidence intervals. In addition, the estimation results of the updated hierarchical

Hierarchical Model
Finally, we divided all of the benchmarking data into 12 groups depending on the different experimental conditions to update the hierarchical model. Based on a total of thirteen weeks of data, Figure 12 shows the best regression function and accompanying 90% confidence bounds from Group #7. The confidence intervals were noticeably smaller in comparison to Figure 8c1,c2 without updating. Table 3 summarizes the results of the nodularity estimation on validation specimens T-1 and T-2 using various methods, including the metallographic method, least-square fitting method of all data, traditional and hierarchical Bayesian regression models for the first week, traditional and hierarchical models for the sixth week, and traditional and hierarchical models for the thirteenth week. As can be seen, the least-square fitting approach was rarely used to determine the standard deviation and confidence interval for the estimated nodularity. The hierarchical model performed better than the traditional model, whether updated or not. When comparing the hierarchical models for the first, the sixth, and the thirteenth weeks, it was found that the updated model had improved accuracy and a reduced standard deviation. After long-term Bayesian updating, the confidence intervals got shorter, and the residual between the regression function and destructive metallographic observation could be described by the confidence intervals. In addition, the estimation results of the updated hierarchical model for the last week were comparable to the metallographic results, requiring the need for industrial application.

Conclusions
In this study, we proposed a long-term ultrasonic benchmarking method for microstructure characterization, along with a hierarchical Bayesian regression model, and analyzed the impacts of experimental setups, human factors, and environmental factors. The ultrasonic non-destructive characterization of the nodularity of cast iron was taken as an example. The benchmarking experiments were carried out over a transition period of 13 weeks. The results indicate that as the hierarchical model was considered the specialty of

Conclusions
In this study, we proposed a long-term ultrasonic benchmarking method for microstructure characterization, along with a hierarchical Bayesian regression model, and analyzed the impacts of experimental setups, human factors, and environmental factors. The ultrasonic non-destructive characterization of the nodularity of cast iron was taken as an example. The benchmarking experiments were carried out over a transition period of 13 weeks. The results indicate that as the hierarchical model was considered the specialty of employing diverse experiment conditions, it worked better than the traditional Bayesian regression model, whether updated or not. Comparing the first-, sixth-, and thirteenth-weeks' results, we observed that the updated hierarchical model had a more accurate mean value, a smaller standard deviation, and shorter confidence intervals. Moreover, the previous week's updated hierarchical model provided comparable outcomes to the metallographic testing, bringing the transition period to an end. The proposed method was limited in that independent benchmarking and Bayesian updating were necessary for materials of varying grades, meaning that the regression models could not be transferred. In the next study, we will combine a Bayesian network with a transfer learning network to achieve ultrasonic benchmarking. We may also expand our method to characterize other microstructure parameters, such as the porosity of additive manufacturing materials, the content of second phase, and the grain size of polycrystals in the future.