1. Introduction
Additive manufacturing (AM) has been defined as “a process of joining materials to make objects from 3D model data, usually layer upon layer, as opposed to subtractive manufacturing methodologies” [
1]. The goal of AM is to rapidly and efficiently fabricate intricately designed parts, which may be impossible to produce with traditional machining techniques [
2] using a wide array of materials, including polymers, metals, and ceramics [
3]. Compared to traditional manufacturing, AM facilitates the production of highly customized and complex parts [
4]; however, these benefits often imply an increase in economic investment and production time [
5,
6,
7] along with poor geometric accuracy [
8], repeatability, and reproducibility [
9]. Coupled with these facts, the lack of standardization of best practices has slowed the development of quality assurance and control in AM [
10]. Multiple institutions have attempted to implement industrial criteria for AM, including the International Organization of Standardization, ASTM International, the National Aeronautics and Space Administration, and America Makes [
11,
12,
13,
14,
15], but benchmarks across these guidelines vary. For example, standards such as AMS7003, AMS7032, ISO/ASTM 52930, ISO/ASTM and NASA-STD-6030 [
16,
17,
18,
19,
20,
21] can be used for a type of AM called laser-based powder bed fusion of metals (PBF-LB/M), but each requires different sets of measurements, even though there is overlap among the various guidelines.
Recent work has continued to emphasize that qualification and certification remain major barriers to the broader industrial deployment of metal additive manufacturing. In high-consequence applications, qualification still depends heavily on costly inspection, destructive testing, and part-specific evidence, while the connection between in situ process signals and final part quality remains incomplete [
22,
23]. At the same time, recent reviews of powder bed fusion quality assessment and metal–laser-AM monitoring have highlighted substantial progress in sensing, monitoring, and control, but also persistent gaps in standardization, comparability, and practical industrial implementation [
24,
25]. These challenges extend beyond sensor development alone: recent work on standardized evaluation of in situ process-monitoring systems underscores that the field still lacks broadly accepted frameworks for objectively comparing monitoring outputs and deciding whether a modified process remains acceptably aligned with a trusted baseline [
26]. Together, these developments reinforce the need for practical, interpretable, and sample-efficient statistical methods for AM process comparability, requalification, and change control.
Some of these attempts at standardization can introduce other problems as well. The quality of parts printed by PBF-LB/M can also be assessed with the Metallic Materials Properties Development and Standardization (MMPDS) Handbook [
27]. This handbook outlines statistical guidelines for first characterizing the distributions of various mechanical properties (ultimate tensile strength, elastic modulus, elongation, etc.) and then uses a traditional hypothesis testing framework to determine whether the property meets some minimal standard. This framework, however, does not consider the high cost of samples and smaller average batch sizes of PBF-LB/M. First, sample size requirements depend on the shape of the underlying distribution of the mechanical property being tested. The shape of the underlying distribution may not be known a priori, and if nonparametric methods are required, then sample size requirements can also be prohibitively large (up to 299 samples, [
27]). These large sample sizes can make the statistics robust to outliers, but if sample sizes are kept small, these outliers can have a disproportionate effect on parameter estimates. Finally, MMPDS depends on using traditional statistics to determine whether mechanical properties meet the minimally desired level. Determining whether two processes produce equivalent results is critical for qualification [
28], calibration [
29], and validation [
30], and since statistical tests can be used to determine whether summary statistics such as the mean differ between groups (such as a
t-test), at first blush this looks like a logical framework. However, traditional statistical practices are not well equipped to determine whether two populations are equivalent [
31,
32].
Under the traditional hypothesis testing framework, we assume that the shape of the underlying distribution is known and we have some evidence of the value of the parameters defining that population, either from historical information, inference, or statistical inference based on samples from the population. Thus, hypotheses are usually made around these values to determine if a sample came from a population whose shape is defined by the hypothesized values. For instance a classical null hypothesis is that two populations are the same in terms of the values of the parameters defining that population; the alternative hypothesis is that they are different. This places the burden of proof on the wrong hypothesis, that there is a difference between the groups. Therefore, it is relatively easy to make the incorrect conclusion that two populations are the same when in fact they are not. This issue is magnified in studies where samples are expensive and are thus scarce. These studies lack statistical power [
33], so populations will appear equivalent even when they are not.
Equivalence tests were developed to resolve these issues [
30,
34]. Equivalence tests work by defining an acceptance region (the indifference zone; [
35]) for the differences in means between two or more groups [
36]. In one formulation of these tests, if the confidence interval for the differences between the groups lies within this region, the two are considered equivalent. If it lies outside of the region, or the confidence interval is so wide that it encompasses both limits of the acceptance region (which would happen for small sample sizes), then the populations are considered to be non-equivalent [
31].
Equivalence testing has seen extensive use in psychological research and other medical fields [
29,
36,
37,
38,
39], but has seen little use in manufacturing or quality control [
32,
40]. This could be due to the fact that equivalency tests are still limited in their applicability. Fundamentally, the standard equivalence test only determines equivalency with respect to differences in a point estimate of a distribution (which is almost always the mean). Differences in other aspects of the distribution (variance, skew, etc.) only matter insofar as they influence the mean. Additionally, various methods of realizing these equivalency tests depend in part on the underlying distribution, and a mismatch of distribution and method can lead to biased results [
31].
Equivalence tests are also incomplete. These tests are only formulated for single variables, but manufactured parts can differ in potentially dozens of different relevant dimensions [
41], so equivalency with respect to a single dimension is not sufficient. These variables may also be best characterized by different distributions, such as normal, lognormal, and Weibull [
42], so the corresponding statistical test should be agnostic to the type of distribution. Finally, equivalency tests are often presented in a vacuum, when in reality the assumptions of the test (especially independence of measurements) also need to be explicitly tested. These tests, then, should be presented in a larger framework where other features of the experiment (such as stability) are assessed in conjunction with equivalency.
Other AM qualification and comparability approaches generally fall into three broad categories: specification-based acceptance methods tied to particular material/property standards, classical hypothesis-testing approaches focused on differences in summary statistics, and process-monitoring or high-throughput screening workflows intended to accelerate parameter exploration [
22]. Each has clear value, but none directly resolves the specific problem addressed here: determining whether a candidate process is sufficiently similar to a stable reference distribution for a single response variable while remaining robust to distributional form and economically feasible at small sample sizes. Specification-based approaches are often application-specific and may require large datasets or stronger assumptions than are practical in some qualification settings; mean-based comparisons can miss changes in variance or tail behavior; and high-throughput workflows, while powerful in LPBF and related contexts, do not eliminate the need for decision rules when throughput is limited or artifacts are expensive [
43]. Accordingly, there remains a need for a lightweight equivalency framework that is nonparametric, explicit about stability assumptions, and compatible with sequential stopping.
Here, we define a process that minimizes the cost of an experiment by discretizing the distribution of the performance metric and by evaluating information as the experiment progresses, stopping when there is sufficient information to determine whether or not two processes are equivalent. Discretization limits the resolution of the distribution, so smaller sample sizes are needed to resolve it. Sequential sampling allows the user to stop the experiment before the maximal sample size is attained, and thus can save resources. By using a nonparametric approach, we minimize the influence of extreme observations and avoid committing to a parametric family. However, outliers can be diagnostically meaningful in AM: they may reflect transient process faults (e.g., recoater events, spatter, parameter drift) or unusually favorable outcomes. For this reason, we view formal outlier detection and root-cause investigation as a complementary secondary analysis. We perform simulations to validate our approach, estimating the prevalence of Type I and Type II error rates for different experimental protocols. We integrate control charts into the process to ensure that the variable being measured is stable. Finally, we perform validation experiments to ensure that this statistical method has real-world value. To facilitate adoption and reproducibility, we provide an accompanying R package (AMEquivalency, version 3.0.0.9000) that implements control-chart assessment, percentile binning, sequential sampling schedules, and the equivalency decision rule [
44].
Figure 1 and
Box 1 summarize the workflow and the key equations used to assess equivalency. An additional tutorial is available in the GitHub repository.
Box 1. Overview of the proposed equivalency framework, showing the recommended analysis parameters, governing equations for bin construction and sequential sampling, experimental constraints, and the stopping rule used to assess equivalency between a candidate and reference process.
Variables:| Variable | Definition | Suggested Value |
| m | Number of Subgroups | 20 |
| n | Number of Samples per Subgroup | 2 |
| b | Number of Bins | 8 |
| Maximal Sample Size | 299 |
| Initial Sample Size | 5 |
| T | Maximum Number of Samples | 14 |
| Sig. Level for Non-equivalent Stopping Rule | 0.01 |
Equations:Experimental Constraints:Characterizing the reference distribution requires sampling
m subgroups of size
n and using an x-bar chart to determine whether it is stable. If it is, then use percentiles (Equation (
1)) to divide the distribution into
b equally-likely bins. If it is not, then perform root-cause analysis to determine the source of instability and then redraw samples. To determine whether the candidate distribution is equivalent to the reference requires a protocol for selecting the sample sizes of sequential samples (Equation (
2)) and measuring how well evenly distributed the observed counts are across the bins (Equation (
3)), and converting this value to a standardized goodness of fit metric (Equation (
4)) and comparing it to the upper bound of its expected value (Equation (
5)). For each sample, see if
. If it is, keep sampling until
. If all samples fall within this bound, the processes are significant. If any one of these samples has
, then the processes are not equivalent and sampling can stop here.
2. Characterizing the Reference Distribution
Qualification and certification are critical steps for developing and manufacturing products [
45]. Qualification involves examining a prototype, a material, or a product during its production [
28]. Many different aspects of AM production can be qualified, including techniques, machines, materials, parts, and suppliers [
28]. These same processes can also be certified, which is primarily established by satisfying a centralized authority or organization [
28]. In both cases, properties of the process to be qualified or certified must be assessed and compared to some industry standard [
27]. Here, we describe how a stable reference distribution can be established for a single response variable to enable principled comparisons across builds, machines, parameter sets, or suppliers (
Figure 1A).
First, a reference process should be selected. A reference process should ideally already be certified. That is, it should be able to produce products that meet institutional requirements with respect to a performance metric, such as tensile strength [
12], although some care should be taken to ensure that such metrics are sensitive to underlying defects or other undesirable characteristics [
46]. If such a process is not available, then there should be a good reason to believe that the reference process is reliable or otherwise represents a gold standard of production. These criteria are considered representative of an acceptable reference process since they indicate a degree of repeatability (the parts produced by the machine have similar mechanical properties per material) and provide a higher level of user control (most process parameters are available to the user).
Second, a sample of the reference process should be taken. This sample should serve a dual purpose: it can be used to assess process stability, and it can be used to characterize the reference distribution. In the field of statistical quality control, a process is stable if sequential measurements show no persistent pattern and thus can be modeled as Gaussian noise [
47]. Additionally, if there are no outliers that draw samples outside of process control limits, then the process is said to be ‘in control’ [
48]. Stability is a necessary condition for the reference process, as otherwise it is untrustworthy.
In
Section 2.1, we show how control charts can be used to assess stability, and in
Section 2.2, we discuss how percentiles can be used to characterize the reference distribution. In
Section 2.3, we will give recommendations for sample sizes.
2.1. Establishing Process Stability with Control Charts
To determine whether or not a process is stable, researchers in additive manufacturing can utilize control charts [
49]. Control charts, also known as Shewhart charts or process-behavior charts, are a key tool in statistical process control (SPC). They help monitor how a process changes over time by plotting data points in time order [
50]. While there are multiple types of control charts, including CUSUM and EWMA charts [
51], we will focus on the Shewhart control chart as this is the single type of chart that has been the focus of the highest number of optimization studies [
52].
A Shewhart control chart consists of 3 lines and points on a graph (
Figure 2; [
48]). The chart is built by taking measurements of the focal metric over time. The data are grouped by time period, location, or another blocking factor of the process. In an AM context, this factor could capture the number of products on a single build plate. These sets of data are called subgroups, and each subgroup will have more than one observation. One then takes the mean of each subgroup and plots these means chronologically on the control chart. The global average of these subgroups gives the center line. The lines above and below the center line (the upper and lower control limits, respectively) are derived from the center line, the range of the metric within each subgroup, and
n. One then uses various rules to determine whether or not the process is in control or not (
Table 1; [
53]), although different authors will use different variations of these rules (i.e., [
50]). If a sample passes each of these tests, then the reference process is considered stable and can be characterized. If it does not pass, the source of special-cause variation should be identified and corrected, and samples should be retaken, although this is beyond the scope of this publication. For candidate processes, control-chart assessment is best performed after all planned samples are collected (or after enough subgroups exist to estimate limits reliably), because early control limits can be noisy and may yield false alarms (a stable process appears unstable).
2.2. Dividing Reference Distribution into Discrete Bins
In many situations, characterizing a distribution requires fitting various families of distributions to data and then determining which of these fits the best [
54]. However, such procedures can be sensitive to outliers [
55] and since various goodness-of-fit tests can be insignificant for multiple types of distributions, there can be some ambiguity when two distribution types seem to describe the same empirical distribution well. Additionally, when determining whether or not two distributions are equivalent, it is not necessary to know the exact distribution type (if one even exists for a particular dataset). It is sufficient to know where future datapoints are likely to occur. We can therefore sidestep the issue of an arbitrary selection by characterizing the reference distribution nonparametrically. That is, we can describe the distribution with empirical percentiles rather than with point estimates of distribution parameters. These percentiles divide the distribution into equal-probability bins, and we can use these bin limits to compare the reference to the candidate process via
-like goodness-of-fit tests.
Percentile binning introduces an intentional loss of resolution, and this is both a limitation and a design feature of the method. By converting a continuous distribution into
b empirical bins, the procedure creates an explicit trade-off between distributional resolution and feasibility. Increasing the number of bins improves resolution and allows finer distinctions between candidate and reference distributions, but it also increases sample-size requirements and can reduce statistical power [
56]. Conversely, using fewer bins improves feasibility and may increase power, but at the cost of reduced information content, since small shifts that occur largely within bins may be blurred and therefore remain undetected. The choice of
b should therefore be treated as a user-selected design parameter rather than a universal constant, allowing investigators to determine what level of resolution is worth the associated sampling cost. This binning strategy also provides some robustness to extreme observations, since outliers are absorbed into the first or last bins rather than exerting disproportionate influence on the full characterization procedure. More generally, the use of a
-based goodness-of-fit framework can provide favorable power relative to some alternatives under certain conditions [
56], further supporting its use in small-sample settings. For these reasons, we view the proposed method as especially appropriate when investigators prefer a coarse but robust nonparametric comparison that can be supported by limited sample sizes. When the primary goal is to detect very subtle distributional shifts and larger samples are available, methods that operate directly on continuous distributions may be preferable.
To set percentile
P at index
i for a given reference distribution,
For example, if , where P is the vector which contains each percentile. These percentiles define the ranges of each bin. For instance, if an observation of the metric of interest of the candidate process appears around the 10th percentile of that same metric for reference process, then it would appear in the first bin, as it is bounded by the 0th and 25th percentiles.
2.3. Setting the Sample Size for the Reference Distribution
The production of parts with PBF-LB/M as well as other AM methods is expensive, so therefore sample sizes should be kept as small as possible. Still, characterizing the reference distribution requires more than demonstrating stability alone. Although statistical control is a necessary prerequisite, the reference distribution must also be based on measurements obtained under an adequate and consistent measurement system, from sampling conditions that are representative of the machine–material–parameter configuration of interest, and with appropriate attention to spatial and temporal sources of variation that could affect the intended equivalency claim. We therefore suggest a fixed sample size method for assessing the reference distribution derived from constraints set by both control charts and tests.
Control charts require
n samples from
m chronologically ordered groups, so the total sample size is
. Generally,
and
([
48]; although it is also possible to build control charts with single measurements where
; [
57]), so the minimal sample size needed to assess stability is
. The
test requires that the expected number of observations for each bin is 5, so when there are
b bins,
. Setting these requirements equal to one another means that for a sample size of 40, we can assess
bins. Therefore, characterizing the reference distribution requires at least 40 samples at a resolution of 8 bins. This represents the lowest sample size limit for characterizing the reference process, although sample sizes can go up to
if the budget permits it. Here, the upper limit for
b is 60. It should be noted, however, that there is some evidence that smaller within-subgroup sample sizes (
) can outperform larger subgroup sample sizes (
n = 4–6) for
[
58] in terms of determining stability, so there may be diminishing returns from increasing the sample size. As control limits constrict with large sample sizes, many samples may appear to be out of control when there is no common cause of variation, so the false positive rate can be artificially high with large sample sizes [
59]. Additionally, we show that the bin limits stabilize around their true values at
using simulations (
Appendix A), so this small sample size should be feasible for many applications. That said, after qualification, the process under question may continue printing the same parts, so control charts should be used to continue monitoring stability. These additional sample points can then be pooled into the existing dataset, and bin limits can be updated accordingly (
Appendix B). This recommendation for
, then, only represents the starting point of characterizing the reference distribution.
3. Optimal Sequential Sampling for Equivalency
After the reference distribution has been characterized, it can be compared to multiple candidate processes. In this situation, there is a clear null and alternative hypothesis. The null is that the candidate distribution is equivalent to that of the reference distribution. Numerically, this means that observations from the candidate process are uniformly distributed across the bins defined by the reference process. If this distribution is anything other than uniform, then the two processes are not equivalent. As there is a clear alternative hypothesis, we can save on production costs of parts by using an iterative sampling method. Here, we do not sample all at once, instead dividing the sample into smaller sub-samples and assessing equivalency on the way. In practice, equivalency should only be evaluated for processes that are demonstrably stable with respect to the metric of interest. In this work, candidate-process stability is treated as a prerequisite to equivalency testing rather than as a condition that can be assumed without verification. This distinction is especially important in additive manufacturing, where both temporal variability (e.g., drift across builds, recoater wear, optics degradation, powder aging, or environmental changes) and spatial variability (e.g., location-dependent effects within a build plate) may introduce a systematic structure into the data. Accordingly, we assume that the submitting entity has established process stability using time-ordered statistical process control (SPC) evidence prior to equivalency evaluation, for example through control charts or an equivalent monitoring framework applied to a stable machine–material–parameter configuration [
27]. In practice, this means that the measurements used to define the reference distribution, as well as those collected from the candidate process, should be drawn from production conditions for which no unresolved out-of-control signals, sustained trends, or nonrandom patterns are present. When relevant, stability assessment should also account for within-build spatial heterogeneity by either controlling artifact location, randomizing sampling locations, or explicitly demonstrating that positional effects are acceptably small relative to the intended equivalency claim. In addition, because the proposed framework operates on the observed measurements, equivalency decisions reflect the combined effects of intrinsic process variability and measurement-system uncertainty. For this reason, the method should be applied only when the measurement system has also been shown to be adequately controlled, for example through calibration, repeatability/reproducibility studies, or equivalent metrological evidence appropriate to the application. If measurement error is large relative to the process differences of interest, or if measurement conditions differ systematically between the reference and candidate datasets, equivalency decisions may become difficult to interpret because measurement noise can either obscure real process differences or induce apparent non-equivalence. If instability is detected, whether in the process itself or in the measurement system, the appropriate response is not to proceed directly to equivalency testing, but to first identify and address the source of variation and then re-establish control. The proposed framework should therefore be interpreted as operating after both the process and the measurement system have been shown to be adequately controlled, not as a substitute for the broader process-validation, metrology, and SPC activities required to demonstrate that control. This is consistent with the broader qualification ethos that process capability and control are prerequisites to statistical accept/reject decisions [
27].
Here, we specify a sequential sampling method specifically geared to test equivalency in a way that optimally trades off the information gain from sequential sampling with the marginal costs of those samples. This optimization protocol results in a set of closed-form equations where users set the parameters so that the sampling strategy is tailored to their needs. We validate these results with simulations and give general guidelines on how to set the experimental parameters. To derive these equations, we follow these steps:
- 1.
We show the statistic is a poor statistic to assess equivalency, as it is independent of sample size (N).
- 2.
We show that Cramér’s V has a predictable relationship with N, so it makes for a better measure of sample quality.
- 3.
We define functions for the setting the sample sizes of sequential samples, the information gain of sequential samples (derived from Cramér’s V), and their marginal costs.
- 4.
We combine these functions into a single objective function to be minimized by the sequential sampling strategy.
- 5.
We constrain the objective function by limiting the space of possible solutions.
- 6.
We find the sampling strategy that minimizes the objective function in the acceptable domain.
- 7.
We define a stopping rule for sequential sampling when the candidate distribution is sufficiently different from the reference.
- 8.
We discuss the potential utility of outlier detection as a complementary approach to equivalency.
- 9.
We optimize experimental parameters for various non-equivalent scenarios with power analyses via simulations and compare final sample sizes to those required by the fixed sample size method.
- 10.
We summarize the process so that readers have the minimal information needed to perform sequential sampling.
- 11.
We discuss a different formulation of equivalency based on current standards, and discuss how the two methods align.
To begin, we assume that percentiles have already been used to characterize the reference distribution and are available for comparison with the candidate distribution.
3.1. Statistic Is Invariant to Sample Size
Once the continuous reference distribution has been discretized into
b empirical percentile bins, the comparison problem becomes one of assessing whether the candidate observations are allocated across those bins as expected under equivalency. Because bins are defined by reference percentiles, each bin has the same expected probability under the reference distribution, and therefore a candidate sample of size
N should yield expected bin counts of approximately
under equivalency. This converts the problem into a multinomial goodness-of-fit setting, for which a
-type discrepancy statistic provides a natural and interpretable measure of departure between the observed and expected counts. An additional advantage of this choice is that the contribution of each bin to the overall discrepancy can be examined directly, and the resulting statistic can be normalized through Cramér’s
V to support comparisons across sample sizes within the sequential sampling framework. In this context, the
test statistic captures the difference between the expected counts from a fitted distribution (
) and the observed counts (
) within each bin:
where
b is the number of bins. One may expect that as the sample size (
N) increases,
should decrease, perhaps asymptotically approaching some constant value if the null hypothesis is true; otherwise it may diverge. To calculate the expected value of
across
N given the null hypothesis, we can simplify Equation (
8) as
. This, coupled with the fact that the number of observations in each bin should be equivalent, means that the summation is unnecessary and we can simply multiply by the number of summation steps, so
As
is a binomial random variable (an observation may or may not occur in bin
i),
where
and
, so:
Notice that the Ns cancel out in this expression, so therefore will not change with the sample size, so it cannot be used to compare the quality of two samples with different sample sizes. The larger sample may have the higher value not due to any features or bugs of the sampling strategy, but rather due to a random fluctuation.
3.2. Cramér’s V Varies Predictably with Sample Size
Cramér’s
V is the measure of the effect size of a
test, measuring the association between nominal variables [
60]. If
, then there is no association between the variables; therefore, the expected counts of each cell in each row are equivalent. Conversely, if it is high, then there is a strong association; the counts are different from one another. Cramér’s
V is given by the expression
where
r is the number of rows in a contingency table. For our equivalency tests,
, and the lowest possible value of
, so the denominator of this expression is always 1. Therefore, this expression simplifies to
To make this value more interpretable, we can normalize it so that it varies between 0 and 1, where 0 indicates that the observed and expected counts are identical and 1 represents maximal difference between these counts. We can normalize this statistic with the equation
occurs when
is at its smallest possible value where
, so
, thus
.
occurs when
is at its largest possible value (
), which occurs when all observations occur in a single bin (
, so
and therefore
Plugging this into Equation (
14) along with
and
V and simplifying yields
To get the expected value of
, we first separate the random variable
from the rest of the expression:
which means that
is a
-distributed random variable. To get the expected value of
, we multiply the expected value of a
distribution against the scaling factor on the left:
We further simplify by moving
b to the right-hand side of this expression:
The right side of this expression acts as a correction factor, adjusting the expected value of
given different values of
b. Its minimum is approximately 0.798 when
and approaches 1 when
. To simplify further derivations, we define the function
so that it contains all of the expressions related to
b:
which gives us our final equation for
(
Figure 3), where
can be understood as the degree to which
needs to be corrected to account for the fact that it derives from a
distribution rather than a
distribution:
3.3. Defining the Sampling Function, Information Gain, and Marginal Costs of Sequential Samples
Let
t be the sampling step and
, where
T is an experimental parameter set by the user which determines the maximal number of sequential samples they are willing to take. The sample size at each step is
, which cumulatively adds the number of samples from each step. For instance, if
and the user wants to add 5 samples, then
. This iterative process can be expressed as
where
is an initialization parameter set by the user and
determines how many samples will be added to
to produce
. We assume that
is linear where sequential values of
increase or decrease at a constant rate
s and is scaled by
,
which has the closed form:
Our optimization goal is to find the value of
s which simultaneously maximizes the information gain of the next sample and minimizes its cost. We hypothesize that the optimal value of
s depends in part on the underlying shape of the cost function. Generally, the cost of producing
parts can be calculated as
where
F is the fixed cost of producing parts,
d is the per-unit cost of testing the quality of
parts,
q is the per-unit cost of producing
parts, and
determines the scaling relationship between the sample size and the cost of each part ([
61];
Figure 4). To define the optimization problem, we focus on the costs that vary with the sequential sampling decision itself rather than attempting to model the full cost of qualification. Accordingly, fixed costs are omitted because they remain constant across the candidate sequential strategies considered here and therefore do not affect which strategy is preferred. We also omit testing cost in the present formulation because every produced part is assumed to be measured for equivalency evaluation, so measurement cost scales directly with the number of produced parts and does not introduce an additional trade-off separate from sample production. Under these assumptions, the relevant cost quantity is the incremental cost associated with continuing to produce and evaluate additional samples before a stopping decision is reached. Therefore the cost equation simplifies to
When , the cost of production diminishes with sample size, perhaps as a result of some efficiency that can be leveraged to produce parts at scale. Here, s can be greater than 0, as later samples cost the same amount as earlier samples. When , then the scaling relationship is linear; the cost per part does not change with the total number of parts produced. Spacing between sequential samples can therefore also be linear; there will be no difference in costs of later samples versus earlier ones. Finally, if , then later samples will be more expensive than earlier samples, so we can set to avoid the region of high costs.
The marginal cost of the next sample is given as
whereas the expected information gained from the next sample will be measured as the expected reduction in
across samples
which simplifies to
Note that this cost model is intentionally simplified. It is not intended to capture the full economics of any single AM process and does not explicitly represent build-level batching, setup, fixturing, post-processing, queue time, or parallel specimen production. In LPBF especially, multiple artifacts may be produced in a single build, which can flatten or otherwise alter the marginal-cost structure assumed by the model [
43]. Accordingly, the role of the current cost function is heuristic: it provides a transparent way to encode whether marginal sampling cost is expected to increase, decrease, or remain approximately linear across sampling steps. Future work should replace this abstraction with process-specific cost models that explicitly incorporate batching, measurement constraints, and build-level economics.
3.4. Defining the Objective Function to Trade off Information Gain with Marginal Cost
Our optimization goal is to find the value of s which maximizes information gain and minimizes marginal cost. To do this, we perform linear scalarization: we transform both objectives so that they are on the same scale, and then we add the two to generate the objective function . We then find the value of s which minimizes .
First, as our goal is to minimize the objective function, we take the negative of the information gain so that
To scale
, we divide this expression by the maximal possible difference. As
is already scaled to take on values between 0 and 1, the maximal difference is 1. To scale
, we also divide by the maximal possible difference. This difference occurs when the sample size is 0 and when the sample size is at its maximal allowable value,
, another parameter set by the user. Therefore, the maximal possible difference is
and
is given by
where the
w’s are the weights attributed to each minimization goal. Here, we assume that the weights sum to unity, so
.
simplifies to
As we set the differences between sequential sample sizes to be linear (Equation (
24)), we can fully determine
s by only evaluating the first two samples,
and
.
is a constant set by the user, and
(setting
in Equation (
25)), and we can substitute these values into
and
, respectively:
3.5. Setting Constraints on s
Not all real values of s are allowed. In sequential sampling, later samples are added to prior samples, so . Additionally, we cannot allow any strategies that generate larger sample sizes than the limit set by the user, so . These set lower and upper bounds on s, respectively.
To find
, we set
, which is
and solve for
:
To find
, we set
, which is
and solve for
3.6. Minimizing in the Acceptable Domain of s
To optimize
, we first find the derivative of Equation (
34):
By definition,
,
, and
, meaning this derivative is always positive when
. This is least likely to be true when
. Given Equation (
36),
will be at its minimum when
, so
which is true so long as
, which is another necessary constraint of sequential sampling. This means that
is always positive in the acceptable domain, so its minimum will be at the lower boundary
. This means that regardless of the value of
and
, the optimal value of
s will always be negative.
With this optimal value of
s, we can finally define a function for selecting sequential samples by substituting
in for
s in Equation (
25):
which simplifies to:
3.7. Stopping Rule for Determining Equivalency
The main benefit of sequential sampling is that if two processes are not equivalent then early samples will show this. Larger sample sizes would incur an unnecessary cost. A mismatch between the reference and candidate processes would manifest as a larger than expected . However, under the traditional hypothesis testing framework, an early sample with within its expected bounds is not necessarily proof that the null hypothesis (the two processes are equivalent) is true. Only repeated failures to reject the null hypothesis would provide evidence that it is true, or at least approximately true. Therefore, one would need to sample T times if the two processes are equivalent, but if the two processes are not equivalent then one would need to sample fewer than T times.
We would stop sampling when the measured value of
for sample
t is greater than the upper confidence for this random variable for its given
for that sample. We find this limit by substituting the square root of the critical value of the
distribution into the expected value of
for a given significance level
:
It should also be noted that a root cause analysis should be performed when a sample is declared non-equivalent. A few bad parts can throw off the statistic, so if it can be determined that the part was malformed due to a one-time event (i.e., an operator mistake), then it can be thrown out and built again, althe sample is representative of the entire process, it cannot be discarded. If it is determined that the sample is representative of the whole process, then it cannot be tossed out.
3.8. Outliers as a Complementary Diagnostic, a Recommended Secondary Analysis
The proposed equivalency workflow is intentionally nonparametric and percentile-based, which reduces sensitivity to single extreme observations by confining their influence to the outer bins. This robustness is desirable for accept/reject decisions when sample sizes are limited. Nevertheless, outliers can be operationally important in AM because they may indicate transient special-cause events (process instabilities, hardware faults, powder anomalies) or unusually favorable behavior worth understanding. We therefore recommend that outlier screening be treated as a complementary diagnostic rather than part of the equivalency decision rule; numerous robust and classical procedures for univariate and multivariate outlier identification are available in the literature (e.g., robust distance and minimum covariance determinant methods, influence-function-based diagnostics, and forward search techniques), and the reader is referred to established treatments for practical implementation and guidance [
62,
63,
64].
In practice, if equivalency is rejected, investigators should examine which bins (often tails) contribute most to and whether a small number of observations dominate those bins, then conduct a root-cause analysis to distinguish metrology artifacts from genuine process excursions. Even when equivalency is accepted, reviewing tail observations can help identify rare but high-consequence failure modes (or opportunities for improvement). Outlier-handling rules should be specified a priori to prevent post hoc bias.
3.9. Finding Optimal Values at T and with Power Analyses
The parameters T and can be set to the discretion of the user based on cost and time constraints of the experiment. However, different combinations of these values will yield different discriminatory capacities of the experiment, so these capacities should also be included in the experimental design process. Here, we perform power analyses via simulation to determine the values of T and which minimize N while achieving a power level of 0.8 or higher while also minimizing the false positive rate.
3.9.1. Simulation Description
In this context, the null hypothesis is that the candidate distribution is equivalent to the reference. The alternative hypothesis is that they differ in some way. Power is the probability that a statistical test will determine that two processes are not equivalent when this is in fact the ground truth. Under the null hypothesis, there is an equal probability that any observation will occur in any bin. The alternative hypothesis is that the probability distribution is not uniform. As the order in which the probabilities differ does not matter for the calculation of the
statistic (Equation (
8)), we choose to vary the probabilities with a truncated geometric distribution. When the singular parameter of this distribution (
is high, then the probability clusters to early bins (
Figure 5; [
65]). When
approaches 0, then it produces a near uniform distribution; thus, by altering a single parameter we can produce a wide array of potential experimental outcomes:
We set
to account for the possibility that users will set the sample size for the reference distribution to be as small as possible. While we show what the optimal sampling strategy is given this constraint, we also give a blueprint for how such simulation studies can be done for future studies which may allow for higher sample sizes and bin counts. This optimization framework is also implemented in the AM equivalency package [
44].
To run these simulations, we first find the values of
and
T which give a final sample size of 40. Each sampling strategy can be summarized as
, and there are 8 strategies where
. These are
,
,
,
,
,
,
,
. Note that
s is either 0 or negative for each of these experiments, so the change in sample sizes either stays the same (as is the case for
) or decreases until the difference between the last and second-to-last samples is 1. Once these experimental parameters are set, we use Equation (
43) to determine the sample size sequence. Next, for half of our simulations, we randomly draw a value for
between 0 and 1 from a uniform distribution to set the probability that an observation will occur in each bin. For the other half of our simulations, we set
. We then sequentially sample from this truncated geometric distribution. For each sample
t, we calculate
(Equation (
4)). We then compare the sample’s
to
(Equation (
5)) for a given value of
. If
, then we continue sampling. If
, then the processes are not equivalent and sampling ceases. If this condition is never met, then sampling continues until
. In cases where
, we calculate power (the probability that the sampling strategy correctly rejects the null hypothesis) as well as the final sample size where this determination was made. In cases where
, we calculate the false positive rate: the probability that the null hypothesis was rejected even though it is true. For every sampling strategy, we run 100 simulations where the null hypothesis is true and another 100 where the null hypothesis is false. There is a set value of
for each simulation. We repeat each simulation 100 times and calculate power and the average sample size at the conclusion of sampling when
. When
, we calculate the false positive rate. In total, there are
= 160,000 simulations. We also set
.
3.9.2. Finding Optimal Sampling Strategy
As this is a multi-objective optimization problem where we want to maximize power (discrimination level) while minimizing final sample size and the false positive rate, we use desirability functions to combine these 3 variables into a single metric and find the strategy that maximizes this composite metric [
66]. Each of these performance measures indexed as
can be recast as a desirability
. When the goal is to maximize
as is the case with power (
), we use the following function to maximize it:
where
is the lowest acceptable power value,
is the target value of the power, and the exponent
k determines how important it is to hit the target.
produces a convex function,
produces a linear function, and
produces a convex function. Here we set
and
.
Next, we set the desirabilities for sample size (
) and the false positive rate (
). In both of these cases, we want to minimize the performance metric, so we use the following functions to do it:
where
T is still the target but
U is the upper limit. Here we set
,
,
and
. We also set
for each desirability function.
Next, we introduce the weight to set the relative importance of each variable, where the weights sum to unity, . If , then the j’th variable is unimportant and will not influence the final optimization result. In contrast, if , then j’th variable is maximally important and will be the only variable that influences the final result.
With these weights, we can introduce the overall desirability
D, which is the product of the desirabilities with their associated weight set as the exponent:
As two of these performance metrics are involved in evaluating situations where the null hypothesis is false (the reference and candidate distributions are not equivalent) and only one is calculated when the null hypothesis is true, we set the weights so that the null is not true case does not overwhelm the null is true case: .
We find that
D peaks at
(
Figure 6). Here, the expected sample size (given the null is not true) is minimal while the power level is above
and the false positive rate is around
. It therefore has some desirable properties which make it a good candidate as a sampling strategy. This sampling strategy corresponds to the sequence
.
3.10. Summarized Steps of Equivalency
Box 1 gives the minimal amount of information needed to perform experiments using the equivalency process. It gives quick descriptions of variable names, their recommended values, and the equations needed to test for equivalency. All analyses in this paper are reproducible using the accompanying R package
AMEquivalency (which is publicly available at the repository [
44]), which implements reference-process stability checks, percentile-based binning, sequential sampling schedules, computation of
and
, and the stopping rule. The package also supports simulation-based power analyses for selecting
under user-defined constraints. An introduction on how to use the package is also available in the repository.
3.11. Univariate Equivalency Based on Variation Limits
There are numerous potential ways to determine equivalency between two different distributions even beyond traditional goodness-of-fit tests. One alternative approach is based on tolerance-bound statistics such as
and
, which are used in specification-based qualification frameworks for metallic materials, including MMPDS-style procedures [
27]. A
value is a one-sided lower tolerance limit, representing a
confidence interval lower limit on the first percentile of the distribution, which can be computed via parametric or nonparametric means ([
27]). A
value is the upper tolerance limit of the
interval for the 99th percentile of that distribution. To establish equivalency, one simply measures these statistics for the reference and candidate distributions and see if the latter is nested within the former (
Figure 7). If this is the case, the processes are equivalent; otherwise they are not equivalent. Significant work needs to be done to determine whether such an option is feasible.
Still, obvious comparisons can be made between this method and our bin-based method. In a sense, the bin-based method subsumes this variation limit method, as the bin-based method would be sensitive to differences in the mean, variability, and other moments of the reference and candidate distributions. Conversely, the proposed approach may be less sensitive to some mean shifts when the candidate distribution is substantially less variable than the reference distribution. This occurs because the method evaluates discrepancies through candidate counts in reference-defined percentile bins rather than through the mean alone. If the candidate mean is shifted but the candidate distribution is also relatively narrow, many observations may remain concentrated within a small number of central bins, which can limit the overall discrepancy in bin counts. In such cases, a change in location does not necessarily produce a strong equivalency signal unless enough observations are displaced across multiple bin boundaries.
4. Validation Case Study
We performed a set of experiments designed to test the efficacy of the univariate equivalency method. In these experiments, a single process is used to create a reference and candidate distribution while a second process is used to create a secondary candidate distribution. If the equivalency package works as intended, then the reference distribution and the first candidate distribution should be equivalent, while the reference and the second distribution should be different from the reference. We first discuss how the experiment was performed, and then we go through the steps highlighted in
Figure 1 to demonstrate how the package should be used and the results of the case study.
4.1. Experimental Methods
Plate scans were selected for the validation of the univariate equivalency method proposed in this work. Unlike three-dimensional parts, plate scans consist of a single-layer build in which vectors or melt-pools are isolated for a layer-wise analysis. This approach also reduces the time required to obtain and measure samples. In this case, two LPBF systems, AconityMIDI+ and SLM280 HL, were used to produce the populations required for the univariate equivalency method. The AconityMIDI+ (Aconity3D GmbH, Herzogenrath, Germany) features a continuous-wave Yb-fiber laser (CFL-500, nLIGHT, Vancouver, WA, USA) with a maximum power of 500 watts, and controlled by a scanner head co-developed by Aconity3D and Raylase GmbH (Weßling, Germany). This system was selected to create the reference population since it offers the user extensive configurability of multiple parameters. The same machine was also used for the candidate population that, theoretically, should be defined as equivalent by the method proposed here. For the “non-equivalent” candidate population, the SLM280 HL (Nikon SLM Solutions AG, Lübeck, Germany) system was used. This machine is equipped with two 400-watt, continuous wave, Yb-fiber lasers and a SCANLAB intelliSCAN III 20 scanner controller (SCANLAB GmbH, Puchheim, Germany). In contrast to the AconityMIDI+, the SLM280 is a more closed system with less availability for parameter user configuration.
To guarantee a standardized process and eliminate the potential for external confounding variables, all tests were printed under the same conditions. Each plate was positioned at the center of the build plate of the corresponding system, with the same orientation relative to the flow direction. The surface of the plates was aligned to the focal plane in the z-axis. All plate scans were also produced under manufacturing conditions, maintaining the build chamber oxygen concentration below 1000 ppm.
Following the steps described in this publication, the reference population was produced by consecutively printing 20 subgroups (), or plates, of 2 samples each (). Each sample was manually positioned at the center of the build platform before activating the laser to continue with the next subgroup. For the candidate populations, the samples were printed according to the sampling strategy with the sequence . To expedite the data gathering process for purposes of validating the univariate equivalency method for this publication, the 40 samples for both candidate populations were printed with a 20-min time interval between each sample increment to accommodate the change of plates between runs. However, the data was not processed using the univariate equivalency method at this phase. By omitting intermediate data processing while keeping the planned validation framework, the overall experiment duration was greatly decreased. By omitting intermediate data-processing steps while preserving the planned validation framework, the overall experiment duration was reduced relative to a fully sequential implementation in which data would be processed and evaluated after each sampling stage. If such intermediate processing is retained, the framework can still reduce the total number of samples required, but the calendar-time savings may be smaller because each sequential step must be followed by measurement, analysis, and a stopping decision before additional samples are collected.
Once the plates were printed, a Keyence VHX-7000 4K microscope (Keyence Corporation, Osaka, Japan) was used to image the samples at 150× magnification. The geometry selected for the validation experiment consisted of a quadrilateral with angle values of 45°, 90°, and 135°. The Keyence VHX-7000 software was used to measure the corner deviation of the 135° corner. This measurement quantified the deviation between the scanned corner and the intersection of the centerlines of the vectors forming the corner, as illustrated in
Figure 8. The centerlines of the vectors were created manually and used to generate a bisector guideline that crossed the inner and outer edges of the corner. Then the distance from the intersection of the centerlines to both edges of the corner vectors along the bisector guideline was measured, ensuring that both measurements were taken at the same angle. The corner deviation value was calculated by adding the measurements to both edges. Measurements falling within the bisector were assigned “positive” values, while those outside the bisector were assigned “negative” values. An ideal corner would have a corner deviation of zero, indicating that the centerline intersection is equidistant from the vector edges of the corner. Corner deviation was selected as the metric for this study because it can be influenced by polygon delay, a hidden parameter in some LPBF systems. Equivalency was determined using the available R Package [
44].
4.2. Experimental Results
The corner deviations measured on the AconityMIDI+ were found to be stable according to the Shewhart control chart (
Figure 9A). This ensures that the process can produce a viable reference distribution. Furthermore, sequentially sampled corner deviations from the AconityMIDI+ were shown to be equivalent to the reference distribution (
;
;
;
Table 2;
Figure 9B). However, corner deviations measured from traces produced by the SLM280 HL were found to be non-equivalent by the 1st sample (
;
;
;
Table 2;
Figure 9C). Thus, this univariate equivalency method successfully classified two equivalent processes as being equivalent while determining that a non-equivalent process was indeed non-equivalent to the reference at a small sample size. In applications where non-equivalency is detected, a natural next step is to inspect tail bins and potential outliers to identify whether the divergence reflects a systematic shift or rare special-cause excursions.
5. Discussion
One major challenge for the maturation of additive manufacturing as an industrial technology is its ability to scale [
67]. While economies of scale are achievable for AM generally [
5], there are a number of significant costs that make it expensive for mass production relative to traditional manufacturing techniques [
10]. One major cost is the cost of testing for the purposes of qualification, certification, or in-house validation [
68]. There is also a lack of standards for AM processed objects [
69], resulting in huge variation in quality among additively manufactured parts. Our goal with developing this univariate equivalency workflow is to provide a reliable and cost-friendly approach for comparing a candidate process to a stable reference with respect to a single response variable. This workflow is implemented in an open-source R package (
AMEquivalency) for reproducibility and ease of adoption [
44]. We demonstrated the method using a validation case study based on geometric measurements of quadrilateral traces (plate scans), correctly classifying a candidate process expected to be equivalent to the reference and detecting a non-equivalent process at a small sample size. We also briefly discussed an alternative equivalency strategy based on tolerance limits [
27], and argued that the bin-based method provides a more general distributional comparison that is sensitive to differences in multiple moments (mean, variability, and higher-order features).
While this study is motivated in part by qualification challenges in additive manufacturing, equivalency testing should not be viewed as a standalone replacement for existing qualification standards. Significant additional work is still needed to determine how distributional equivalency in specific metrics relates to part performance, reliability, and defect tolerance across applications, and how univariate, and ultimately multivariate, equivalency evidence should be incorporated into formal qualification and certification frameworks. Even so, distributional equivalency has important uses beyond formal qualification. Potential applications include machine-to-machine comparability, parameter-set change control, supplier comparisons, requalification after hardware or software updates, calibration of in situ monitoring proxies against ex situ measurements, and early-stage process development in which the goal is to match a known-good baseline while minimizing the number of test artifacts. The proposed workflow is especially valuable in settings where sample production, post-processing, or destructive evaluation is sufficiently expensive that early stopping can yield meaningful practical savings. In some LPBF workflows, however, many specimens can be produced in a single build and paired with high-throughput evaluation strategies, which reduces the marginal benefit of sequential sampling [
43]. By contrast, the framework may be particularly useful for lower-throughput processes, expensive qualification artifacts, machine-to-machine comparability studies, requalification after process changes, or any setting in which only a few artifacts can be produced or evaluated at a time [
70]. The LPBF case study presented here should therefore be interpreted primarily as a validation of the statistical workflow rather than as evidence that LPBF is universally the setting of greatest benefit. In these near-term contexts, the workflow is best positioned as a practical tool for process comparability and change control, supporting principled accept/reject decisions relative to a reference baseline while explicitly accounting for cost constraints through sequential sampling.
It should be noted that equivalency is not the only method which can be used to compare reference and candidate distributions. Alternative methods for comparing continuous distributions which do not require binning include empirical-distribution-function procedures such as the Kolmogorov–Smirnov (KS) test and distance-based methods such as Wasserstein metrics [
71,
72]. These approaches preserve the continuous scale of the data and can be more sensitive than a discretized method to certain fine-grained differences between distributions. However, they do not by themselves resolve the experimental-design problem addressed here: how to construct a sequential, interpretable, and cost-aware decision rule for small-sample AM equivalency testing. We therefore view such methods as useful complements rather than direct replacements for the present framework. To make this comparison explicit, we now include
Appendix D, which compares the empirical power of the proposed method to that of the two-sample KS test under the same simulation logic used throughout the manuscript.
The validation case study provides more than a simple proof of implementation. Most importantly, it shows that the sequential workflow behaves differently in the two practically relevant regimes that motivated its development. For the candidate population produced on the same system as the reference, the method did not trigger an early non-equivalency decision and instead continued through the full planned sample size, indicating a conservative tendency when evidence remains consistent with equivalency. By contrast, the candidate population produced on the second LPBF system was rejected at the first sequential step, showing that large distributional departures can be detected with very little sampling effort. This asymmetric behavior is desirable in many qualification and change-control settings: strong evidence of non-equivalency can be identified quickly, whereas claims of equivalency require sustained agreement with the reference distribution across the full sampling sequence. At the same time, the case study should be interpreted as a validation of the statistical workflow rather than as a comprehensive validation of equivalency for LPBF more generally. The example is based on a single response metric derived from plate scans, and additional work will be needed to determine how equivalency in such metrics relates to broader questions of part performance, reliability, and certification.
This work also highlights several directions for future research. First, the present framework evaluates only a single attribute of a candidate build, underscoring the need for a multivariate extension that can assess equivalency across multiple correlated or independent attributes (i.e., tensile strength, surface roughness, geometric accuracy, etc.) simultaneously and thereby support a more holistic notion of process comparability. Such an extension will be important if equivalency testing is to move beyond specification-limited comparisons and toward broader qualification and certification applications. Second, the general framework developed here may be especially valuable in other additive manufacturing settings in which throughput is lower and qualification artifacts are more expensive, including processes such as laser directed energy deposition and extrusion-based AM. Extending the method to these contexts may help clarify where sequential sampling provides the greatest practical advantage, as here only one or a few parts may be built simultaneously, and thus could benefit from sequential sampling. Third, sequential sampling may also be leveraged to reduce sample-size requirements for establishing reference distributions when artifacts are extremely expensive, provided that process stability can still be assessed reliably. Finally, when non-equivalency is detected, follow-on diagnostic analyses—for example, targeted investigation of tail behavior or outliers as possible indicators of special-cause variation—may help identify why a candidate process diverged from the reference and guide corrective actions.
6. Conclusions
This study introduced a nonparametric framework for assessing univariate equivalency between additive-manufacturing processes through comparison of full empirical distributions rather than isolated summary statistics. By characterizing the reference process with percentile-defined bins and evaluating candidate data through a sequential sampling rule, the method provides an interpretable accept/reject workflow that is explicitly designed for settings in which sample production and evaluation are costly. The principal advantage of the framework is therefore not that it replaces broader qualification standards, but that it offers a practical and statistically grounded tool for process comparability, requalification, and change control when only limited sample sizes are feasible.
The results show that the method can distinguish between equivalent and non-equivalent candidate distributions while supporting early stopping when strong evidence of non-equivalence is present. More generally, the framework provides a way to compare processes on the basis of their observed distributions without requiring parametric assumptions about distributional form. This makes it attractive for AM applications in which normality may not hold, tail behavior may matter, and investigators wish to retain information about variability and distributional shape rather than relying only on means or specification-based checks.
At the same time, the method should be interpreted within its intended scope. It evaluates observed distributions for a single response metric and assumes that both process stability and measurement-system adequacy have been established before equivalency testing is applied. Accordingly, it is best viewed as a component of a broader validation or monitoring workflow rather than as a standalone replacement for qualification and certification standards. Its greatest practical benefit is expected in settings where artifacts are expensive, throughput is limited, and early stopping can reduce the number of required samples. In higher-throughput settings, the main value of the method may lie less in reducing calendar time and more in providing a principled and interpretable framework for process comparison.
Future work should extend this approach to multivariate equivalency, strengthen the treatment of measurement uncertainty, and further evaluate its applicability across additional AM settings, materials, and process conditions relevant to industrial qualification workflows. Even in its present form, however, the proposed workflow establishes a useful foundation for distribution-based process comparison in additive manufacturing and provides a practical step toward more rigorous and sample-efficient equivalency assessment in this field.