An Overview of AI-Assisted Design-on-Simulation Technology for Reliability Life Prediction of Advanced Packaging

Several design parameters affect the reliability of wafer-level type advanced packaging, such as upper and lower pad sizes, solder volume, buffer layer thickness, and chip thickness, etc. Conventionally, the accelerated thermal cycling test (ATCT) is used to evaluate the reliability life of electronic packaging; however, optimizing the design parameters through ATCT is time-consuming and expensive, reducing the number of experiments becomes a critical issue. In recent years, many researchers have adopted the finite-element-based design-on-simulation (DoS) technology for the reliability assessment of electronic packaging. DoS technology can effectively shorten the design cycle, reduce costs, and effectively optimize the packaging structure. However, the simulation analysis results are highly dependent on the individual researcher and are usually inconsistent between them. Artificial intelligence (AI) can help researchers avoid the shortcomings of the human factor. This study demonstrates AI-assisted DoS technology by combining artificial intelligence and simulation technologies to predict wafer level package (WLP) reliability. In order to ensure reliability prediction accuracy, the simulation procedure was validated by several experiments prior to creating a large AI training database. This research studies several machine learning models, including artificial neural network (ANN), recurrent neural network (RNN), support vector regression (SVR), kernel ridge regression (KRR), K-nearest neighbor (KNN), and random forest (RF). These models are evaluated in this study based on prediction accuracy and CPU time consumption.


Introduction
Electronics packaging plays an important role in the semiconductor industry. Currently, the mainstream electronic packaging structures include heterogeneous packaging, 3D packaging, system-in-packaging (SiP), fan-out (FO) packaging, and wafer-level packaging [1][2][3][4][5][6][7][8]. With the increasing complexity of packaging structures, manufacturing reliability test vehicles, and conducting ATCT experiments have become time-consuming and very expensive processes, the design-on-experiment (DoE) methodology for packaging design is becoming infeasible. As a result of the wide adoption of finite element analysis [9][10][11][12][13][14][15], accelerated thermal cycling tests are reduced significantly in the semiconductor industry, and package development time and cost are reduced as well. In a 3D WLP model, Liu [16] applied the Coffin-Manson life prediction empirical model to predict the reliability life of a solder joint within an accurate range. However, the results of finite element simulations are highly dependent on the mesh size, and there is no guideline to help researchers address this issue. Therefore, Chiang et al. [17] proposed the concept of "volume-weighted averaging" to determine the local strain, especially in critical areas. Tsou [18] successfully predicted packaging reliability through finite element simulation with a fixed mesh size in the critical area of the WLP structure. However, the results of simulation analysis are highly dependent on the individual researcher, and the results are usually inconsistent the implementation of the above AI algorithms, an AI training database was generated using a finite element simulation combined with the Coffin-Manson empirical equation for WLP reliability life cycles prediction, and this standard simulation procedure was experimentally verified before the generation of the massive simulation database.
The rest of this paper is presented as follows: Section 2 implemented the finite element method for WLP. Section 3 discusses different types of machine learning models. Section 4 is the result and discussion for the machine learning models, and finally, we ended with some concluding remarks in Section 5.

Finite Element Method for WLP
If the simulation consistently predicts the result of the experiment [52], then the simulation is an experiment; the experimental work can be replaced by a validated simulation procedure to create a large database for AI training and obtain a small and accurate AI model for reliability life cycles prediction. Once we obtain the final AI model for a new WLP structure, developers can simply input the WLP geometries, and then the life cycle can be obtained. Figure 1 illustrates this procedure. Because a huge amount of data are required to build the AI training model, this work used a two-dimensional finite element method (FEM) model for simulation. Before the database is built, the simulation process must be reliable. This work validated the simulation results with five WLP test vehicles (Tables 1 and 2). All of the sizes and specifications for different materials and the mean times to failure of the test vehicles are presented in the tables (Tables 1 and 2) [53,54]. The simulation method mainly adopted a fixed mesh size at a critical location, determined through appropriate mechanics concepts, and an empirical equation was used to validate the reliability life cycles of all test vehicles. This work used the fixed mesh size of the solder joint at the maximum distance of the neutral point of the WLP to fix the modeling pattern and simulation procedure, as proposed by Tsou et al. [18]. As shown in Figure 2, the width and height of the fixed mesh size were 12.5 µm and 7.5 µm, respectively. The solder joint geometry was generated using Surface Evolver [55].  In the simulation process, the solder material was a nonlinear plastic material. Therefore, PLANE182, which has good convergence characteristics and can deal with large deformations, was used as the solder ball element. This work used PLANE42 for other components, which had linear material properties. Table 3 presents the list of individual material properties. The Young's moduli at different temperatures of the solder joint are listed in Table 4. Figure 3 shows the stress-strain curve for an Sn-Ag-Cu (SAC)305 solder joint. The stress-strain curve [56], obtained by tensile testing and the Chabochee kinematic hardening model, was used to describe the tensile curves at different temperatures. Once the model is built, boundary conditions and external thermal loading are required for the WLP simulation.  Electronics packaging geometry is usually symmetrical; therefore, in this study, half of the 2D structure was modeled along the diagonal, as shown in Figure 4. The X-direction displacement on each node was fixed to zero owing to the Y-symmetry. To prevent rigid body motion, the node at the lowest point of the neutral axis, which is at the printed circuit board (PCB), has all degrees of freedom fixed. The complete finite element model and the boundary conditions are shown in Figure 5. The thermal loading condition used in this research was JEDEC JESD22-A104D condition G [57], and the temperature range was −40 • C to 125 • C. The ramp rate was fixed at 16.5 • C/min and the dwell time was 10 min. In a qualified design, its mean-cycle-to-failure (MTTF) should pass 1000 thermal cycles. After the simulation process is completed, the incremental equivalent plastic strain in the critical zone is substituted into the strain-based Coffin-Manson model [58] for reliable life cycle prediction. For a fixed temperature ramp rate, this method is as accurate as the energy-based empirical equation [59,60] but with much less CPU time.  The empirical formula for Coffin-Manson equivalent plastic strain model is shown in Equation (1): where N f is mean cycle to failure, C and η are empirical constants, and ∆ε pl eq is the incremental equivalent plastic strain. For SAC solder joint, C and η are 0.235 and 1.75 [61,62]. Table 5 presents the predicted reliability life cycles of the WLP structure. The results show that the difference between the FEM-predicted life cycle and experiment result is within a small range. Therefore, experiments can be replaced by this validated FEM simulation to minimize the cost and time. Compared with the experiment approach, this validated FEM simulation procedure can provide large amounts of data within much less time and can be effectively used to generate a database for AI training.

Machine Learning
Machine learning is an AI methodology that processes huge datasets to guide a computer for training, learning, and finally, building a simple regression model. In this study, several different machine algorithms were used, including ANN, RNN, SVR, KRR, KNN, and RF, implemented using the Python language. A supervised regression model, e.g., the WLP reliability life cycle prediction model, requires both input data (geometry parameters) and the corresponding output result (life cycles) for machine learning algorithms to build the final AI model of the WLP package. Once the regression model is established for a new WLP structure, the designer can simply input the WLP geometries of each component into the AI regression model to obtain the reliability life of this new WLP. This is a powerful and reliable technique for new packaging design.

Establishment of Dataset
The WLP structure consists of several components, including the solder mask, solder ball, I/O pad, stress buffer layer, and silicon chip, etc. ( Figure 6). For illustration purposes, the four most influential parameters, namely silicon chip thickness, stress buffer layer thickness, upper pad diameters, and lower pad diameters, were selected to build the AI model and predict the reliability life cycles of new WLP structures. These four design parameters were used to generate both training and testing datasets for AI machine learning algorithms. Tables 6 and 7 show the generated training dataset obtained through FEM simulation. First, the number of training features generated in this research was 576 (Table 6), and 1296 (Table 7) datasets. For testing, 54 features were selected randomly from the interpolation of the above training dataset, which can help to build the AI training model. By increasing the training dataset, AI performance improves; however, the CPU/GPU time also increases.

ANN Model
The ANN model is based on the concept of the brain's self-learning ability, mimicking the human nervous system to process information. It is a multilayer neural network, as shown in Figure 7. The model consists of three layers: the input layer, where the data are provided; the hidden layer, where the input data are calculated; and the output layer, where the results are displayed [63]. As the numbers of neurons and hidden layers are increased, the ability to handle nonlinearity improves. However, these conditions may result in high computational complexity, overfitting, and poor predictive performance. In the above ANN model, a l i is the ith activation element of the lth layer in the hidden layer. b l i is bias, a l i is equal to input value times weight W l ji and add the bias in Equation (2): From the calculation point of view, at first, the input layer data combined with bias and weight to obtain some value. The calculated input value, i.e., a l i , is substituted into activation function, e.g., Sigmoid, to be converted into a nonlinear form as an input of a l+1 i for the next layer.
where the activation function is shown in Figure 8.

RNN Model
RNN is a type of neural network that can model "time-like"-series data, and it commonly adopts a nonlinear structure in deep learning. RNN [64,65] works on the principle that the output of a particular layer is fed back to the input layer to realize a time-dependent neural network and a dynamic model. Consequently, an ANN with nodes connected in a ring shape is obtained, as shown in the left half of Figure 9. The ring-shaped neural network is expanded along the "time" axis, as shown in the right half of Figure 9, where the "time" step t and the hidden state s t can be expressed as a function of the output from the previous (s t−1 ) "time" steps and previous layers (x t ). U, V, and W denote the shared weights in RNN models during different "time" steps. Generally, the RNN series model can be divided into four types according to the number of inputs and outputs in given "time" steps; that is, one to one (O to O), one to many (O to M), many to one (M to O), and many to many (M to M). To synchronize the input features with the output results, RNN models can be subdivided into different series models, as shown in Figure 10

SVR Model
This regression method evolved from the support vector machine algorithm. It transforms data to high-dimensional feature space and adapts the ε-insensitive loss function (Equation (4)) to perform the linear regression in feature space (Equation (5)). In this regression method, the norm value of w is also minimized to avoid the overfitting problem.
In other words, f (X, w), which is the function of the SVR model, will be as flat as possible. The SVR concept is illustrated in Figure 11. The data points outside the ε-insensitive zone are called support vectors, and two slack variables, ξ i and ξ * i , are used to record the loss of each support vector. Thus, the whole SVR problem can be seen as an optimization problem (Equation (6)).
where L(y, f (X)) is the LaGrange function of a single out variable y as a function of n input variables X using a function f (X). w is the weight parameter, b represents bias, and ∅(X) is the transformation equation. C is a penalty factor that is used to control the accuracy of the regression model; if C is set to infinity, it means you are only concerned about accuracy rather than model complexity. The SVR problem can be solved easily as a dual problem, and the kernel function K x i , x j , which satisfies Mercer's condition in the objective function, is used. Here, α i and α * i are Lagrange multipliers, and data points with positive and non-zero α i and α * i are support vectors.
In order to solve the optimization problem, the regression model is built as shown in Equation (7), where b is the bias of the SVR model.
In Equation (7), the term K(X, X i ) is known as the kernel function, and X i is the training sample, with X as an input variable. This kernel function should be chosen as a dot product in the high-dimensional feature space [67]. There are numerous types of kernel functions. The commonly used kernel functions for SVR are the linear kernel, polynomial kernel, radial basis function (RBF) kernel, and sigmoid kernel.
All of the kernel functions satisfy Mercer's condition; however, the regression results of the kernels vary. Therefore, it is essential to choose the best kernel function for the SVR algorithm to obtain optimal performance.

KRR Model
KRR combines ridge regression with the kernel "trick". This model can learn a linear function in the space induced by the respective kernel and the dataset. Nonlinear functions in the original space can be used by the nonlinear kernels. The KRR algorithm also analyzes several kernels such as the RBF kernel, sigmoid kernel, and polynomial kernel to find the suitable kernel function for the WLP nonlinear dataset.
The KRR is possibly the most elementary algorithm that can be kernelized to ridge regression [68]. In this study, a linear function that models the dependencies between the covariate input variable x i and the response variable y i is found. The classic method is used to minimize the quadratic cost, as shown in Equation (8). However, for the nonlinear dataset, the lower-dimensional feature space replaces the higher-dimensional feature space; that is, X i → Φ(X i ) . To convert lower-dimensional space to higher-dimensional space, the predictive model undergoes overfitting. Hence, to avoid overfitting, this function requires regularization.
where C(W) is the cost function of the weight-decay W and W T is the initial weight required for the input samples for the KRR model. A simple and effective way to regularize is to penalize the norm of W. This is called "weight-decay", and it remains to be determined how to choose λ that is known as regularizing factor. Another way, the algorithm can be used cross-validation to avoid over-fitting. Hence, the total cost function becomes Equation (9) needs to be minimized. Therefore, the derivative of the equation must be obtained and then equated to zero.
To optimize the above cost function C, this study introduces Lagrange multipliers into the problem. Consequently, the derivation step becomes similar to that in the SVR case. After the optimization problem is solved, the resulting KRR regression model is shown in Equation (10).
where K x, x i is the kernel function of the x i training sample with x as the input variable, and α * i is the weight of the KRR model and is equal to Hence, Equation (11) is very simple and more flexible due to introducing kernel function K, λ is the regularize factor with the identity matrix I, and y is the response variable. This model can also avoid both model complexity and computational time.
One big disadvantage of ridge regression is that there is no sparseness in the α vector; that is, there is no concept of support vectors. Sparseness is useful because when a new example is tested, only the support vectors need to be summed, which is much faster than summing the entire training set. In SVR, the only source of sparseness is the inequality constraints because, according to the complementary slackness conditions, if the constraint is inactive, then the multiplier α i is zero.

KNN Model
The KNN model is a statistical tool for estimating the value of an unknown point based on its nearest neighbors [69]. The nearest neighbors are usually calculated as the points with the shortest distance to the unknown point [70]. Several techniques are used to measure the distance between the neighbors. Two simple techniques are used in this study: the Euclidean distance function d(x, y), provided in Equation (12), and the Manhattan distance function d(x, y), provided in Equation (13).
where x = (x 1 , . . . , x n ), y = (y 1 , . . . , y n ), and n is the vector size. The K neighbor point that has the shortest distance to the unknown point is used to estimate its value using Equation (14).
where w i is the weight of every single neighbor point y i to the query pointŷ [71]. The KNN algorithm defined in Equation (14) is the weighted average of the neighborhood. The simplest KNN model is the mean of the contiguity, which is obtained in the case of uniform weights, where all of the neighbor points have the same effect on the estimation w i = 1 n . In contrast, when the neighbor points are assumed to have different effects on the query point estimation, different weights can be applied. The simplest weight function is provided in Equation (15).
where d i is the distance between the unknown point and its neighbor. The weight function must reach its maximum value at zero distance from the interpolated point, and as the distance increases, the function decreases [72]. The KNN estimation shown in Equation (14) depends only on the neighbor points; therefore, it neglects the trend of the whole dataset. However, Equation (15) provides better KNN estimations because the weighted distance considers a lower number of nearest values. Finally, this KNN algorithm is more suitable for regression and classification problems according to the simplest weight function.

The RF Regression Model
RF is a collection of decision trees. These tree models usually consist of fully grown and unpruned CARTs.
The structure of the RF regression model is shown in Figure 11. This algorithm creates an RF by combining several decision trees built from the training dataset. The CART tree selects one feature from all of the input features as the segmentation condition according to the minimum mean square error method.
The RF algorithm procedure comprises three steps. In step 1, the bagging method is used to create a subset that accounts for approximately 2/3 of the total data volume. In step 2, if the data value is greater than the selected feature value, the data points are separated to the right from the parent node, otherwise to the left of the parent node ( Figure 12). Afterward, a set of trained decision trees is created. In step 3, the RF calculates the average value of all decision tree results to obtain the final predicted value.

Training Methodology
To obtain the best performance and avoid overfitting of the final trained regression model, several techniques, including data preprocessing (for standardization), crossvalidation (for parameter selection), and grid search (for hyperparameter determination), were applied during training. The AI regression model was estimated using the method in the flowchart given in Figure 13.

Data Preprocessing
The dataset values are not in a uniform range. Hence, before the machine learning model is developed, the data need to be preprocessed to standardize all of the input and output datasets and improve the modeling performance. Several data preprocessing methods, including min-max scaling, robust scaling, max absolute scaling, and standard scaling, were used in this study. Hence, of all preprocessing methods, we need to select the one method that provides the most accurately predicted output from the input dataset.

Cross-Validation
Cross-validation is the most frequently used parameter selection method. The basic idea of cross-validation is that not all of the dataset is used for training; a part of it (which does not participate in training) is used to test the parameters generated by the training set. The training data are trained with different model parameters and verified by the validation set to determine the most appropriate model parameters. Cross-validation methods can be divided into three categories: the hold-out method, k-fold method, and leave-one-out method. Owing to the huge calculation burdens of the hold-out method and the leave-one-out method, the k-fold method was chosen for this study (Figure 14). After the choice of data preprocessing method was confirmed, cross-validation was performed to avoid overfitting of the machine learning model, as shown in Figure 14. The dataset was divided into 10 parts, and each part acted as either a validation or training set in different training steps. The validation sets were also used to predict the training results.

Grid Search Technique
Grid search is a large-scale method for finding the best hyperparameter to build the training model. In order to determine the best parameter, the search range value needs to be set by the model builder. Although the method is simple and easy to perform, it is time-consuming. Therefore, to reduce the computation time, this work adopted the grid search technique to find the best hyperparameter as compared to manually searching the hypermeter, and eventually, the training model was fixed with the above hyperparameters to run the best AI model.

Results and Discussion
This section discusses the outcomes of the different machine learning algorithms used for the WLP structure design. In this study, we analyzed both the accuracy and CPU time consumption of the algorithms. Regarding accuracy, the mean absolute error (MAE) and the maximum absolute error between the FEM-predicted reliability life cycle and the AI-predicted reliability life cycle of the WLP structure were calculated in this work. We discuss both the training and testing error analysis for different datasets. Similarly, the CPU time required for every regression model to predict the WLP structure reliability life is also discussed. Training datasets comprising 576 and 1296 and a testing dataset comprising 54 WLP geometric combinations were used in this study. Table 8 presents the ANN regression results for 576 training datasets. As shown in the table, different numbers of neurons, from 10 to 500, and hidden layers, from 2 to 20, were tested. After the numbers of hidden layers and neurons were tuned, the best ANN-predicted number of life cycles was calculated. Validated against the FEM results, the best MAE of AI prediction was five cycles, and the maximum absolute error was 28 cycles, with 10 hidden layers and 200 neurons. Similarly, Table 9 presents the ANN regression results for the training dataset with 1296. From the table, the best MAE was three cycles, and the maximum absolute error was 18 cycles, with five hidden layers and 500 neurons. Therefore, the ANN regression model based on the training dataset with 1296 numbers showed better accuracy than that based on the dataset with 576 numbers. Moreover, the CPU time performances for both datasets were different (Table 10). The CPU time required for 576 datasets was 151 s, whereas 235 s was required for 1296 datasets. Thus, although the dataset with a higher number of features resulted in higher accuracy, the CPU time was higher.   10  51  142  47  137  47  125  14  57  20  47  140  47  143  12  50  9  57  50  47  127  22  35  7  42  5  43  100  23  104  6  35  4  24  5  27  200  13  70  4  31  3  22  4  27  500  6  43  3  18  3  20  3 19  Table 11 presents the results of the RNN regression models for the training dataset with 576 numbers. The RNN models considered different numbers of neurons and hidden layers. The best MAE was six cycles, and the maximum absolute error was 37 cycles, with 500 neurons and five hidden layers. Table 12 lists the RNN regression results for 1296 training datasets. From the table, the best MAE was three cycles, and the maximum absolute error was 27 cycles. Therefore, the RNN regression model based on the dataset with 1296 numbers exhibited better accuracy. However, the CPU time (698 s) was higher than that of the model based on 576 training datasets (173 s; Table 13). Given the above, the ANN regression model outperformed the RNN regression model in both accuracy and CPU time. This implies that the ANN model is more flexible than RNN because of the RNN model's complex structure.  10  125  355  44  141  11  61  6  44  20  69  215  21  103  10  54  10  57  50  33  154  5  28  4  30  4  30  100  10  51  3  28  4  32  4  32  200  4  36  4  32  6  43  4  32  500  3  27  3  39  159  506 159 506  Table 14 presents the SVR results for 576 datasets. The table presents the accuracy and CPU time analysis results for the SVR model considering different kernel functions and different hyperparameters. The best MAE for the testing data was 13, and the maximum absolute error was 55 for the RBF kernel-based model. The shortest CPU time was 0.093 s. For the training dataset with 1296 numbers, the SVR exhibited better accuracy (Table 15). For this dataset, the best MAE for the testing data was 7.3 cycles, and the maximum absolute error was 30 cycles.
However, the larger dataset had a higher CPU time requirement. Given the above SVR results, we can infer that the RBF kernel plays a more important role in obtaining good SVR performance compared with the sigmoid and polynomial kernel functions.  Table 16 presents the KRR results for 576 training datasets. The table presents the accuracy and CPU time analysis for different kernel functions with hyperparameters used in the KRR algorithm. The best MAEs for the training and testing datasets were 8.4 and 12.2 cycles, respectively, for the model with the RBF kernel function. Moreover, the KRR model with the RBF kernel function required a short CPU time (0.093 s). The KRR model with the larger dataset showed better performance (Table 17). The best MAE of the test data was 5.6 cycles, and the maximum absolute error was 24 cycles for the training dataset with 1296. However, the CPU time was higher than that for the training dataset with 576 numbers (Table 18). Similar to the SVR results, the KRR results also show that the RBF kernel exhibited better accuracy and CPU time than the sigmoid and polynomial kernel functions. The KRR algorithm outperformed the SVR model in terms of accuracy and CPU time. Meanwhile, the ANN outperformed the RNN, SVR, and KRR in terms of accuracy. However, KRR and SVR outperformed ANN and RNN in terms of CPU time, owing to the usage of more hidden layers and a greater number of neurons are used in ANN and RNN. The KNN results are shown in Table 19. The table presents the accuracy of KNN in terms of the Euclidean and Manhattan distances versus the number of nearest neighbors (K) used in this algorithm for 576 training datasets. The best MAE and maximum absolute error were 18.2 and 72 cycles with the Manhattan distance when the K value was 9. From Table 19, the algorithm performs better in terms of the Manhattan distance than the Euclidean distance under the same K value. Similarly, Table 20 shows the results of KNN based on the dataset with 1296 numbers. From Table 20, the best MAE and maximum absolute error were 7.5 and 23 cycles, respectively, with the K value as 3. Table 21 compares the cases in which the weight as the uniform is used with the Euclidean distance and the weight as the distance is used with Euclidean distance. From the table, the case of the weight in terms of distance showed better MAE, and maximum absolute error than the case of the uniform weight is used for the KNN model of a WLP structure. With the increase in the number of training datasets from 576 to 1296, the model accuracy improved, whereas the CPU times remained similar, i.e., 0.03 and 0.034 s (Table 22). Table 23 compares the results of RF regression for the two training datasets. From the table, the MAE for the test data was 26.3 cycles for the 1296 dataset and 36 cycles for the 576 datasets. Therefore, the increase in the number of training datasets improved the accuracy. However, the CPU time was higher for the larger training dataset (Table 23). Finally, error analysis and CPU time analysis were conducted for the different machine learning algorithms used to model the WLP structure. Figure 15 shows the MAE analysis with several AI algorithms for the 576-feature training dataset. From the figure, the ANN model exhibited the smallest error, i.e., 4, whereas the RF model exhibited the highest error, i.e., 35.7. The other AI model errors were close to that of the ANN model. Similarly, Figure 16 shows the MAE results for several AI algorithms with 1296 training datasets. From the Figure 16, the ANN model had the lowest MAE, i.e., 2, whereas the RF model had the highest error, i.e., 26.4, and the MAEs of the other AI models were close to that of the ANN model.  Furthermore, the CPU times required by the different AI algorithms for the WLP structure were investigated. Figure 17 presents the CPU time analysis results for different AI models based on the 576-and 1296-feature training datasets.
From Figure 17, for the 576 training dataset, the KNN model required the lowest CPU time (0.03 s), whereas the RNN model required the highest (174 s). The CPU times required by the RF, SVR, and KRR models were close to that of the KNN model. Similarly, for the 1296 training dataset, the KNN model required the least CPU time (0.034 s), whereas the RNN model required the highest CPU time (699 s). Again, the CPU times required by the RF, SVR, and KRR models were close to that of the KNN model. The ANN and RNN models required more CPU times than the KRR, SVR, KNN, and RF models because of the usage of more neurons and hidden layers to estimate the WLP lifetimes.
Eventually, it also makes a comparison between all the above algorithms as per the MAE in cycle and CPU time consumption for different training datasets, which is from 500 to 9000 datasets. Figure 18 shows the MAE in cycle vs. different training datasets for all the above models. Figure 18 demonstrates the increase in the training dataset with the decrease in the testing error for all the above algorithms. From all the algorithms, ANN, RNN, SVR, and KRR testing errors are very close to each other. However, KNN and RF accuracy performance are not as good as other algorithms because these two algorithms are more suitable for classification purposes. Similarly, Figure 19 shows the CPU time analysis with different training datasets for all the above AI models. From Figure 19, it can be seen that CPU time increases with the increase in the training dataset. ANN and RNN required more CPU time due to the usage of more neurons and more hidden layers. The SVR algorithm also required more CPU time as compared to KRR, KNN, and RF algorithms due to the usage of more hyperparameters with the grid search technique to establish the best hyperparameter.
Overall, the use of AI models may have a major impact on the electronic packaging industry. However, the result of the study confirms that a validated FEM solution procedure is crucial for generating reliable training datasets and that increasing the number of design features increases the CPU time needed to build the AI model for the WLP structure.

Conclusions
The machine learning algorithms used a large database generated by a validated FEM procedure for training analysis and obtained a structural reliability life cycle regression model for WLP. These AI regression models can predict the reliability life cycle of WLP structures when given the WLP geometry as the input. In terms of accuracy, the MAE between FEM and AI was less than 10 life cycles, which is acceptable, and the AI training CPU time consumption of several AI algorithms was small. The ANN algorithm exhibited the best accuracy, whereas the RF regression algorithm exhibited the lowest accuracy, presumably because RF is designed for classification purposes. Other algorithms such as KRR and SVR also showed good accuracy owing to the usage of the RBF kernel function. However, the KRR model slightly outperformed the SVR algorithm in terms of accuracy because of the use of fewer hyperparameters and lower model complexity. KNN is more suitable for classification purposes; nonetheless, the KNN regression model also exhibited good accuracy for the WLP structure database.
The study also investigated the CPU time required for training the above-mentioned AI algorithms to obtain a final regression model for predicting the reliability life of the newly designed WLP structure. The KNN, KRR, and RF regression models required less than 10 s, whereas the ANN, RNN, and SVR regression models required 150 to 1800 s. These times are very low compared with those required by FEM modeling and simulation, which can range from several hours to several days.
It is known by the electronic packaging community that the experiment-based design procedure may take 8 months to a year to complete one run of the WLP test vehicle fabrication and the ATCT test; this has become unacceptable for todays' advanced packaging development. Based on AI/machine learning algorithms and validated finite-elementbased simulation technology, this research developed an AI-assisted design-on-simulation technology that can effectively and accurately predict the reliability life cycle of various geometries of WLPs. In addition, after obtaining the AI-trained model of the WLP, developers only need to input geometric data of the new WLP, then the reliability life cycle can be obtained within one second. Therefore, WLP structure optimization becomes feasible because the reliability prediction of any geometric combination of WLP can be completed in a few seconds. The AI-assisted design-on-simulation technology can also be applied to other packaging types such as system-in-packaging, heterogeneous, fan-out, WLP, and 3D packaging.