Speeding up Statistical Tolerance Analysis to Real Time

: Statistical tolerance analysis based on Monte Carlo simulation can be applied to obtain a cost-optimized tolerance speciﬁcation that satisﬁes both the cost and quality requirements associated with manufacturing. However, this process requires time-consuming computations. We found that an implementation that uses the graphics processing unit (GPU) for vector-chain-based statistical tolerance analysis scales better with increasing sample size than a similar implementation on the central processing unit (CPU). Furthermore, we identiﬁed a signiﬁcant potential for reducing runtime by using array vectorization with NumPy, the proper selection of row-and column-major order, and the use of single precision ﬂoating-point numbers for the GPU implementation. In conclusion, we present open source statistical tolerance analysis and statistical tolerance synthesis approaches with Python that can be used to improve existing workﬂows to real time on regular desktop computers.


Introduction
Manufactured components are subject to deviations that reduce the functional and aesthetic quality of the final product.For this reason, tolerances are specified based on the results of statistical tolerance analyses to minimize the degradation of geometrical accuracy and number of rejects in manufacturing.Additionally, economic aspects must also be considered.Needlessly tight tolerance specifications lead to higher expenses and an increase in production time due to the stricter demands on the manufacturing process.A common approach to satisfying both of these requirements is establishing statistical tolerance analyses based on Monte Carlo simulation to identify the cost-optimal manufacturing tolerance specification based on repeated random sampling and statistical analysis [1].Monte Carlo simulation has consistently been the preferred choice of the tolerance community to evaluate the statistical implications of manufacturing deviations on the key characteristics of mechanical assemblies.Furthermore, Monte Carlo simulation is an established and common standard in a large variety of different scientific and industrial areas, because it allows simulating and tackling complex systems and processes.However, Monte Carlo simulation is computationally intensive, which means that the associated execution times can range from seconds to hours and even days.The ongoing digital transformation of manufacturing through the adoption of Industry 4.0 [2][3][4] requires that product characteristics be analyzed after each manufacturing process in order to establish a continuous information flow to enable a self-learning and self-adapting manufacturing process [5,6].
To support the digitization of manufacturing products with high geometric accuracy, it is therefore important to accelerate the statistical tolerance analysis.The runtime optimization of Monte Carlo simulation is nothing new to academic research.Various studies have discussed approaches to speed up Monte Carlo simulation, mainly focusing on either a proper distribution of the random sample generation among several local computers (such as [7]) or on carrying out the Monte Carlo simulation on a single GPU or on multiple GPUs (such as the application of NVIDIA's Compute Unified Device Architecture (CUDA) in [7]).However, none of the existing work considers the specifics of tolerance engineering and investigates whether a Monte Carlo simulation using the GPU architecture has the potential to significantly speed up statistical tolerance analysis and statistical tolerance synthesis.The present paper attempts to investigate this using scientific open source computing libraries provided by the Python programming language.
The rest of this paper is structured as follows.Section 2 gives an overview of the software solutions used in tolerance engineering and provides a sufficient background concerning the basics of the programming language Python.Section 3 defines the research question and the main challenge of the paper.In Section 4, an overconstraint door hinge assembly is introduced, which serves as a case study.The implementations of a statistical tolerance analysis and statistical tolerance synthesis of the door hinge assembly are detailed in Sections 5.1-5.4.This is followed by a brief description of the runtime measurement procedures and details on the software and hardware used in Sections 5.5-5.7.The runtime results of the statistical tolerance analysis and statistical tolerance synthesis implementations are presented and discussed in Sections 6.1 and 6.2.In Section 6.2.2, the sample size required for credible statistical tolerance synthesis for the tolerance model is determined, followed by some conclusions in Section 7, in which recommendations for tolerance engineers and researchers are given to speed up statistical tolerance analysis to real time.

Numerical and Programming Foundations for This Study
For already about 30 years, commercial software tools for the analysis and optimization of tolerance specifications have been available.These so-called software tools for computer-aided tolerancing (CAT) enable the design engineer to carry out a statistical tolerance analysis of any given mechanical assembly in a virtual 3D-CAD environment.All of them use Monte Carlo simulation to obtain the statistical results.However, due to their high complexity and the high costs, very few companies employ commercial CAT tools.According to a survey from [8], 80% of the questioned companies in Germany use the spreadsheet program MS Excel to carry out vector-chain-based statistical tolerance analyses while the remaining respondents "relied on analytical calculation of tolerance stacking problems by hand, using established and simple approaches such as a worst-case tolerance analysis [9] or the root-sum-square approach" [8].In research, however, the use of the commercial numeric computing environment MATLAB is the common standard.
In the present paper, the authors expand this portfolio of programming languages with Python.Python is an open-source dynamic scripting programming language which reduces the effort of implementing an initial prototype.It also provides a large number of libraries, already shipped, as well as external libraries, which are useful for a wide variety of scientific and engineering applications.The fact that Python is a dynamic programming language unfortunately is also one of its biggest weaknesses.This makes Python significantly slower than statically-typed programming languages, especially when a large number of operations needs to be performed repeatedly, as in the case of loops [10].An efficient approach to improving the performance of Python code is to use NumPy Arrays, to replace explicit loop structures with "pre-compiled code written in a low-level language (e.g., C) to perform mathematical operations over a sequence of data" [11].Figure 1a illustrates the general procedure and the runtime advantages for the Ishigami Function [12]: Due to the drastic performance increase, NumPy Array programming has become the standard within the Python community, especially in the context of scientific computing.According to [13], NumPy played an important role in the software stack that led to the discovery of gravitational waves.For additional information on NumPy, see [14,15].Another external open source Python library used in this paper is the CuPy library [16].It can be implemented as a drop-in replacement for the NumPy library to execute computations on the GPU without having to apply NVIDIA's CUDA programming syntax (see Figure 1b).Whenever one is working with larger array sizes with NumPy, as in the case of Monte Carlo simulation, it is necessary to be aware of row-major and column-major ordering.In order to have less overhead (due to memory access operations), the storage layout of the language needs to match the loop structure [17].With row-major order, the elements are arranged consecutively along the row.With column-major order, they are arranged consecutively along the column [15,17].The choice of the floating-point number precision defined in the IEEE/ISO/IEC 60559-2020 standard [18] is also of importance for both the performance and memory use of the statistical tolerance synthesis.For example, the GeForce 2060 RTX GPU used in this research, which is based on the NVIDIA Turing architecture, provides, on each of the installed 30 streaming multiprocessors, 64 FP32 ALUs (arithmetic logic units) designed for floating-point calculations.This allows fast floating-point arithmetic operations on the graphics units, see Table 1.

Research Question
A comparatively high number of samples is required to obtain statistically reliable results when performing statistical tolerance simulations (such as tolerance analysis and tolerance synthesis) [20][21][22].However, with an increasing sample size, not only does the accuracy of the obtained results increase, but also the computational effort [23,24].Recommendations range from 5000 [25] to 100,000 samples [24].However, especially for more complex tolerance simulations, a size of 10,000 samples has become established among industrial experts and researchers [26].The choice of the sampling procedure as well as the corresponding sample size have a significant influence on the reliability of a statistical tolerance simulation, and must therefore always be made with care.Hence, the key to speed up vector-chain-based statistical tolerance simulations is to accelerate the Monte Carlo simulation.The authors face the research question: Is it possible to realize an implementation of the Monte Carlo simulation in real time on a regular desktop computer with a sample size that satisfies the quality requirements in tolerance engineering?
In this paper, the definition of real time is based on response times of only a few milliseconds, which state-of-the-art automation solutions with a programmable logic controller or fieldbus-based production lines manage.In the following sections, statistical tolerance analyses and statistical tolerance syntheses are developed and executed with different programming approaches and varying sample sizes.Therefore, a non-trivial case study of an overconstraint door hinge assembly is used.We aim to quantify (i) how the computation times behave for different sample sizes with different implementations and (ii) whether, and if so how significantly, a tolerance simulation is sped up.

Case Study: Overconstraint Door Hinge Assembly
The overconstraint door hinge assembly is shown in Figure 2, following [27,28].It should be noted in advance that this supposedly 'simple' assembly (the assumption of a one-dimensional chain with purely linear dependencies seems obvious) is, however, significantly influenced by interactions between the occurring deviations (due to the overconstraint design), and thus is non-trivial.The hinge consists of two main parts (the outer and inner hinge plates) connected by a pin.The hinge enables the opening and closing of a cover plate.Due to manufacturing deviations of the dimensions M 1 to M 7 , a variation of the vertical position of the cover results.These deviations are limited by the tolerances specified in Table 2.The tolerances follow a Gaussian or a uniform distribution.In order to ensure a sufficient sealing gap, the standard deviation of the gap p should not exceed the limit of 0.1 mm.Hence, the functionally relevant key characteristic of the assembly is the vertical displacement of the inner hinge plate with respect to the outer hinge plate, which directly influences the width p of the gap for the gasket.First, the mathematical relation between the functional key characteristic p and the deviating dimensions M 1 to M 7 is required.The overconstraint design of the assembly already comes into play here: Due to the design of the hinge and the resulting interactions of the deviations, two assembly scenarios are possible (see Figure 3).The assembly scenarios differ in how the plates' surfaces are in contact.In scenario #1, the plates are in contact in the upper slot, whereas in scenario #2, there is contact in the lower slot.Thus, the distances A 1 (upper slot) and A 2 (lower slot) can be determined from M 1 to M 7 .
If A 1 ≤ A 2 the hinge will be assembled according to scenario #1.In this case, the resulting key characteristic p is the determined value of the distance A 1 and vice versa.This results in a piecewise-defined mathematical function of the gap p.

Methods: Detailing the Approach to Speeding Up Tolerance Simulations
The core of this work is the execution of the statistical tolerance analysis and the statistical tolerance synthesis considering different programming and architecture approaches as well as a varying sample size for the underlying Monte Carlo simulation.First, the procedure (Section 5.1) as well as the implementation (Section 5.2) of the statistical tolerance analysis are presented.The following sections focus on the procedure (Section 5.3) and the implementation of the statistical tolerance synthesis using numerical optimiza-tion (Section 5.4).This is followed by details on the implementations and hardware used (Sections 5.5 and 5.6).

Procedure for the Statistical Tolerance Analysis
The application of Monte Carlo simulation [29] in statistical tolerance analysis carries out a virtual reproduction of the manufacturing, assembly and inspection of a large number of products.Figure 4 illustrates these three steps.First, a defined number n of variants of each individual part are generated virtually, which differ from each other in various properties (e.g., dimensions), which are subject to tolerances [30].The value of this deviation is determined taking into account the associated probability distribution of each tolerance.This step reproduces virtually the production of n single parts.In the next step, the single parts generated are assembled into n final products.This corresponds to the n-fold virtual assembly of the final product and is based on the "pulling without adding back" of the individual parts according to Bernoulli's urn model [31].Finally, the relevant key characteristic is determined using the vector-chain and documented for each of the n virtually assembled products.This corresponds to the inspection of the products after their assembly.

Implementation of the Monte Carlo Simulation
The sample generation and simulation are the most runtime-intensive steps of the tolerance optimization process, as a sufficient number of samples must first be generated and subsequently processed in the simulation step in order to obtain a trustworthy estimate for each tolerance configuration.Accelerating either of these steps would therefore provide the largest reduction of runtime of the overall process.Therefore the two steps 'sample generation' and 'simulation' have been implemented using the hardware architectures of the CPU and GPU of present-day desktop computers.To investigate the influence of different memory storage layouts and floating-point number configurations on the runtime, different implementations have been included in this study as well as an implementation in MATLAB (see Tables 3 and 4 for an overview of the CPU and GPU implementations).The generation of samples and the simulation is accomplished either on the CPU or with the use of the graphics unit.In more detail, the NumPy implementations use np.random.normal to generate a normal distribution pseudorandom sample and the np.random.uniformfunction for a uniform distribution.The simulation step was realized with NumPy arrays for fast vectorized array operations.The CuPy implementations use the same code as the NumPy implementations, but with CuPy as a drop-in replacement for NumPy to perform the sample generations and simulations on the GPU.For the MATLAB implementations, mlfg6331_64, RandStream is used to generate the samples using the CPU, but mrg32k3a, parallel.gpu.RandStream for the GPU implementation.The simulation is implemented using MATLAB matrix and vector operations.The Python program used is available on GitHub [32], accessed on 28 April 2021 (see page 17).

Procedure for Statistical Tolerance Synthesis
The purpose of statistical tolerance synthesis is to allocate the maximum tolerance of the functional key characteristic to the characteristics of the parts of the product [9].It is thus the inverse or inversion of statistical tolerance analysis [9].The goal of statistical tolerance synthesis is to identify the best compromise between (usually two) divergent requirements for a product and thus for tolerating the individual part.In the case of a costdriven statical tolerance synthesis, this compromise is the tolerance that is associated with the lowest manufacturing costs, but at the same time ensures that a defined rejection rate of the final product is not exceeded [9].If this compromise is identified by means of numerical optimization, the phrase statistical tolerance optimization has become established.Pseudocode for the implementation of a cost-driven statistical tolerance optimization algorithm is provided in Figure 5.Let T be a selected set of tolerances, which is chosen by a global optimization solver.The cost function is denoted by C. The total production cost is the sum of the production costs of each tolerance C(T) on the basis of the tolerance cost model.
and is usually specified by the coefficients C ind , m, and k to approximate the dependencies between the costs and tolerances [33,34].The manufacturing costs consist of the following components: The variable costs C var and the fixed manufacturing costs C f ix for each tolerance T. Let S be the system model of the corresponding tolerance problem and D be the probability distributions of the considered tolerances T.

Implementation of Statistical Tolerance Synthesis Using Numerical Optimization
To investigate the impact on runtime, both open source CPU (NumPy) and GPU (CuPy) implementations of the statistical tolerance analysis have been integrated into the statistical tolerance synthesis process to calculate the cost-optimal tolerances using the differential evolutionary algorithm of [35].The chosen parameter settings of the algorithm are listed in Table 5.For the CPU implementation with NumPy, the population is distributed among all CPU cores.For the GPU implementation with CuPy, a single Python process is sequentially executed.The cost-driven tolerance optimization has been applied to the overconstraint door hinge assembly defined in Section 4, leading to the following optimization problem: under given constraints: σ p (X) ≤ 0.1 (7) with the bounds of: To ensure the functionality of the assembly, the requirements for p are defined as an inequality boundary condition.It demands that the standard deviation σ p must be less than or equal to 0.1 mm in order to guarantee a sufficient tightness.The tolerance-cost models are taken from [28].Their coefficients are listed in Table 6.Table 6.Parameters of tolerance-cost models.

Runtime Measurement Notes
The Python built-in time function time.time is used to determine the runtime of each Python implementation.For the MATLAB implementations, the integrated stopwatch timer functions tic toc were used.

Software Details
The programming languages MATLAB 2018b and Python 3.8.2(including the Python libraries NumPy 1.18.4,CuPy 8.1.0,and SciPy 1.5.4) were used.

Platform Details
The runs were conducted on a Windows 10 workstation with an AMD Ryzen 9 3900 12-core processor (3.8 GHz clock speed) equipped with 32 GB of DDR4 RAM (3200 MHz clock speed) and an Nvidia GeForce RTX 2060 graphics unit with 6.0 GB of VRAM that supports CUDA up to version 7.5.

Results and Discussion
In this section, we present and discuss the results of the study.First, we review the results of the statistical tolerance analysis implementations.Next, we discuss the results of the Statistical Tolerance Synthesis and determine, based on an appropriate sample size, which of the two best open source implementations is more appropriate for the underlying tolerance problem.

Statistical Tolerance Analysis
The statistical tolerance analysis was performed on the basis of the implementations described in Section 5.2.The results are presented for the CPU in Section 6.1.1 and the GPU in Section 6.1.2.

CPU Implementations
Figure 6 and Table 7 present the average runtime in seconds with its standard error of the mean of 25 executions of the steps of generating the samples and then simulating them, using CPU architecture.It can be seen that both CPU MATLAB implementations perform better than the Python open source implementations with NumPy for nearly all sample sizes.This is due to MAT-LAB's default being able to work with all processor cores of the workstation, while the Python implementations are limited to one CPU core by the 'Global Interpreter Lock' (GIL) [36].However, in Section 6.2.1, the GIL constraint is bypassed using the multiprocessing feature of the numerical solver in the tolerance synthesis.Multiprocessing is the creation of several new Python processes (each with its own GIL), in which each process takes a portion of the computation.A difference in runtime between the use of C-Order and F-Order as well as single precision and double precision can also be identified.For all CPU implementations, the use of F-Order as the NumPy array storage layout tends to be more advantageous than C-Order.On the other hand, runtime gains can also be achieved using single precision instead of double precision floating-point numbers for the CPU implementation.This, however, involves a loss of precision in the estimation.

GPU Implementations
Figure 7 and Table 8 show the average runtime in seconds with the standard error of the mean of 25 executions of the statistical tolerance analysis, subject to the sample size using the GPU architecture.

MATLAB GPU Float32
MATLAB GPU Float64 1.0 × 10 4 0.000520 ± 0.000143 0.000629 ± 0.000140 0.000440 ± 0.000101 0.000440 ± 0.000098 0.096187 ± 0.000887 0.094625 ± 0.000577 5.0 × 10 4 0.000541 ± 0.000028 0.000721 ± 0.000051 0.000520 ± 0.000020 0.000600 ± 0.000041 0.098829 ± 0.001141 0.098982 ± 0.000765 1.0 × 10 5 0.000681 ± 0.000057 0.001041 ± 0.000028 0.000480 ± 0.000020 0.000641 ± 0.000046 0.096635 ± 0.000422 0.098395 ± 0.000749 5.0 × Since, for the GPU implementations, the generation and processing of the random numbers does not take place in the workstation's memory, but in the typically smaller video memory of the GPU, some distinctive results can be observed compared to the CPU implementations: (i) The video memory of the GPU in use reaches its capacity limit for double precision floating-point numbers for sample sizes larger than 4.5 × 10 7 (NaN results in Table 8).With single precision, which only requires half as much memory as double precision floating-point numbers, the memory limit is reached at sample sizes greater than 9.0 × 10 7 .(ii) The choice between single and double precision also has a greater influence on the runtime, which is significantly lower in the case of single precision compared to using double precision.(iii) As in the case of CPU implementation, the use of the F-order storage layout seems to be more advantageous for the implementations.On the one hand, the run-time is consistently lower, and on the other, the GPU implementations with C-order storage layout exhibit runtime spikes for higher sample sizes.We see the cause in the less efficient memory access operations compared to the implementations using the F-Order storage layout.(iv) In contrast to the CPU implementations, the GPU Python implementations perform significantly better than the GPU implementations with MATLAB.
A comparison of the CPU and GPU results shows that the statistical tolerance analysis that uses the GPU scales significantly better with an increase in sample size (apart from memory related effects).The most effective implementation (CuPy Float32 F-Order) is able to perform statistical tolerance analysis in real time up to a sample size of one million, which is sufficient to analyze even complex tolerance problems.

Statistical Tolerance Synthesis
In the following (Section 6.2.1), the fastest Python CPU and GPU implementations are incorporated into the statistical tolerance synthesis (described in Section 5.3) to determine how each implementation performs given various sample sizes.For this purpose, the average runtime of the differential evolution step is used as a metric for comparing the CPU and GPU implementations in the statistical tolerance synthesis.This will enable making a statement about which implementation achieves better runtime results, independently of how fast the solver converges to a solution in each individual case.In Section 6.2.2, we then analyze what sample size is actually required in order to obtain a reasonably good estimate from the statistical tolerance synthesis.

CPU vs. GPU Implementations
Figure 8 and Table 9 present the average runtime in seconds with its standard error of the mean of 10 executions of the differential evolution for each configured implementation, with the parameters shown in Table 5.
The results in Table 9 and Figure 8 show that the computational cost of the statistical tolerance synthesis increases linearly with the sample size, analogous to the results of the statistical tolerance analysis.CPU implementation (which, in the context of a statistical tolerance synthesis, is able to use all CPU cores) provides faster computation of the differential evolution for sample sizes smaller than 50,000.However for larger sample sizes, GPU implementation is faster.This indicates that reaching a certain complexity, the calculation using a GPU architecture is faster than a computation using only the CPU.Given a sample size of 10,000, which is a common standard in industrial applications, it therefore is possible to perform a faster tolerance synthesis based on CPU implementation due to a lower runtime for the differential evolution.To double check this assumed standard, we determine in Section 6.2.2 how large the sample size actually needs to be to obtain a reliable estimate for the given tolerance model.

Sample Size Required for a Reliable Estimate
Figure 9a displays the manufacturing costs calculated by the statistical tolerance synthesis (with CuPy Float 32 F-Order) vs. the sample size, from Section 6.2.1.Figure 9b displays, on the basis of a sample size of 1.0 × 10 7 , the corresponding standard deviation σ p , in order to determine if a sample size of 10,000 is sufficient for the tolerance optimization problem.The spread of the results of the statistical tolerance syntheses in Figure 9a,b show that for the given tolerance problem, a sample size of at least 50,000 samples should be used to satisfy the boundary condition (σ p < 0.1 mm) with sufficient confidence.Smaller sample sizes tend to result in manufacturing costs, and thus tolerances, that do not comply with the required boundary conditions of the statistical tolerance synthesis.On this basis, computation using the GPU is preferable due to its lower runtime for the differential evolution.Figure 10 illustrates the reduction of manufacturing costs during optimization.The optimization is carried out using the parameter settings shown in Table 5 and a sample size of 100,000 with the CuPy single precision Float F-Order implementation.Hence, a cost-optimal tolerance specification is obtained after 55 steps of the evolution (approximately 13.5 s).

Additional Notes on Accuracy and Comparability of the Implementations
Although the NumPy and CuPy implementations are based on the same codebase (with the exception of imports), the question arises whether the different implementations provide the same level of accuracy.It might have been that the performance gain of the GPU implementation was partially obtained at the expense of the accuracy of the estimation, for example, by using a lower quality random sample generation process.A fair comparison between the Python and the MATLAB implementation is considerably more difficult than a comparison within only the Python implementations, due to the different programming languages, their quirks, and individual restrictions.For this reason, the claim in this paper is by no means meant to be a Python vs. MATLAB argument.Rather, the claim is that one can accomplish a statistical tolerance analysis and a statistical tolerance synthesis with open source software and methods.

Current Drawbacks of the Method and Future Research Demands
The tolerance optimization methodology using sampling methods with stochastic solution algorithms is approaching the limits of reasonable computation time according to [37].With our approach to compute the Monte Carlo simulation on the GPU, we show a possible way to reduce the computation time that is expected to further increase in the future due to the growing complexity of future tolerance synthesis models [37].However, several aspects require improvement and further research.

•
The mathematical model used does not yet take into account geometric deviations of the manufacturing components.The consideration of these however is warranted because a majority of companies consider geometric deviations in the specification of tolerances [8].The implementation in this paper is not limited only to deviations in size.Vector-chains with geometric tolerances can also be analyzed.

•
The vector-chain needs an extension to fully account for 3D effects.This is due to the assumption that most problems can be traced back to 1D or 2D is usually overly optimistic [37].However, the demonstrated approach using a derivative free optimization algorithm already lays a foundation to implement this in the future, since the problem does not have to be oversimplified [37].The prerequisite is that a closed vector-chain is available.

•
The approach has been tested on an Nvidia GPU.However, the CuPy library also has experimental support for RadeonOpenCompute (ROCm) [16], the non-proprietary open-source alternative to CUDA from AMD.Further work could investigate how applicable this is.• Finally, we see a need for research concerning the support of multiple GPUs.This would allow the statistical tolerance synthesis to be distributed across several graphics units analogous to the already existing CPU implementation and thus to be accelerated.For this purpose, the applied solver or alternative approaches could be investigated.

Conclusions
With the approaches presented in this study, it is possible to reduce the runtime of vector-chain-based statistical tolerance analysis to real time and thus to drastically speed up statistical tolerance optimization.From the engineer's point of view, it can be concluded that the presented implementations of statistical tolerance analysis are capable of real time execution: The calculation times of the statistical tolerance analysis with a sample size of 10,000 were all well below 1 ms and only with a sample size of more than 5 million samples did runtimes (>5 ms) exceed the real time requirement set.It has therefore been proven that statistical tolerance analyses (with a sufficient sample size) could be accelerated to real time in order to establish a continuous flow of information for the implementation of self-awareness and self-adjustment in manufacturing and thus finally promote the implementation of Industry 4.0.Furthermore, the results provide information on how the CPU and GPU architectures could be effectively used in the context of tolerance analysis and tolerance synthesis and which of the considered parameters significantly influence the runtime of the implementations.From this, concrete recommendations for future work could be derived.
The general advice to improve the runtime of statistical tolerance analyses and statistical tolerance synthesis is to use NumPy Arrays for vectorization.The results clarify the significance of an adequate choice of memory layout, especially for implementations that use a GPU.We there recommend testing which memory layout option is more suited for the code used.Another finding is that the choice of architecture depends on the complexity of the considered tolerance problem.We recommend first evaluating how many samples are needed to obtain a reasonable estimate.If a correspondingly powerful GPU is available, it is highly recommended to carry out the calculations on a GPU instead of a CPU once a certain computational complexity has been reached.This is due to the better parallel processing options and the typically higher core count of a GPU.However, if the tolerance problem is not computationally complex, the additional overhead of copying the data into video memory may exceed the benefit generated by the faster computation on the GPU.In this case, an implementation that uses a CPU architecture can be more advantageous.If the calculation is carried out using the GPU, we recommend using single instead of double precision floating-point numbers.
We hope that this work will help further improve existing approaches to statistical tolerance analysis and statistical tolerance synthesis with respect to the ongoing transition to Industry 4.0 and provide a practical guide as well as an entry point for future tolerance engineering research that focuses more on tolerance engineering with heterogeneous computer architectures and parallel computing.Finally, efficient open source alternatives to the predominantly deployed software packages are available to speed up computer simulations to real time.

Figure 3 .
Figure 3. Two assembly scenarios occurring due to the overconstraint design of the door hinge assembly.

Figure 4 .
Figure 4. Virtual reproduction of the manufacturing, assembly and inspection of each part by statistical tolerance analysis using Monte Carlo simulation.

Figure 5 .
Figure 5. Pseudocode of the implementation of the cost-driven statistical tolerance synthesis.

Figure 6 .
Figure 6.Statistical tolerance analysis using CPU: Mean runtime results with error bars.

Figure 7 .
Figure 7. Statistical tolerance analysis using GPU: Mean runtime results with error bars.

Figure 8 .
Figure 8. Performing Statistical Tolerance Synthesis on CPU (NumPy) vs. GPU (CuPy): Average runtime of the differential evolution step with error bars.** workers = −1: Distributes the populations among all available CPU cores for parallel computation, * workers = 1: A single Python process is spawned.

Table 2 .
Dimensions and corresponding tolerance specifications.

Table 3 .
Implementations of statistical tolerance analysis using a CPU.

Table 4 .
Implementation of the statistical tolerance analysis implementations using a GPU.

Table 5 .
Settings for the differential evolutionary algorithm.Distributes the populations among all available CPU cores for parallel computation, * workers = 1: A single Python process is spawned.

Table 7 .
Statistical tolerance analysis using CPU: Mean runtime results in seconds with standard error of the mean.

Table 8 .
Statistical tolerance analysis using GPU: Mean runtime results in seconds with standard error of the mean.

Table 9 .
Average runtime of the differential evolution step in seconds with standard error vs. mean.Distributes the populations among all available CPU cores for parallel computation, * workers = 1: A single Python process is spawned.