Predicting Wafer-Level Package Reliability Life Using Mixed Supervised and Unsupervised Machine Learning Algorithms

With the increasing demand for electronic products, the electronic package gradually developed toward miniaturization and high density. The most significant advantage of the Wafer-Level Package (WLP) is that it can effectively reduce the volume and footprint area of the package. An important issue in the design of WLP is how to quickly and accurately predict the reliability life under the accelerated thermal cycling test (ATCT). If the simulation approach is not adopted, it usually takes several ACTCs to design a WLP, and each ACTC will take several months to get the reliability life results, which increases development time considerably. However, simulation results may differ depending on the designer’s domain knowledge, ability, and experience. This shortcoming can be overcome with artificial intelligence (AI). In this study, finite element analysis (FEA) is combined with machine learning algorithms, e.g., Kernel Ridge Regression (KRR), to create an AI model for predicting the reliability life of electronic packaging. Kernel Ridge Regression (KRR) combined with the K-means cluster algorithm provides a highly accurate and efficient way to obtain AI models for large-scale data sets.


Introduction
Reliability is an important topic in the field of electronic packaging. Solder ball reliability analysis is the key to measuring the reliability of WLP. One main cause of package failure is thermal-induced CTE (coefficient of thermal expansion) mismatch between different materials. In ATCT, the first solder ball failure usually occurs at the diagonal corner of the package; this is the location with the largest distance from the neutral point (DNP). Although traditional ATCT tests can obtain the reliability life result, they are too timeconsuming (usually several months). The long experiment time leads to decreased R&D efficiency, which cannot meet the market demand.
Finite element simulation is a feasible approach to reducing design cycles and time. Lin et al. [1] built a finite element model for WLP. The reliability life can be obtained by substituting incremental equivalent plastic strain into the Coffin-Manson model, and simulation and experiment results are in good agreement. Because the element mesh size in the upper right corner of the solder ball is crucial to the final simulation result, Cheng [2] built a 3D finite element model for an area array type package. In this study, we will build our simulation database for machine learning by following our lab modeling experiences [1][2][3][4].
For FEA, different researchers may lead to different results. Moreover, it still takes time (several days or weeks) to get the simulation results. Therefore, it is necessary to introduce machine learning to lower the training threshold of simulation, unify the results, and reduce development time. Machine learning can be divided into supervised and unsupervised learning based on the presence or absence of artificially assigned labels. Among the two

Reliability Life Prediction Model
The reliability life prediction method of the packaging structure can be mainly divided into two types: the energy-based method and the strain-based method. The Coffin-Manson model [10] used in this research belongs to the strain-based method. The incremental equivalent plastic strain is the key to evaluating the solder's reliability life. The expression of the Coffin-Manson model is shown in Equation (1).
N f is the reliability prediction life (cycle), and ∆ε pl eq is the incremental equivalent plastic strain. α and φ are empirical constants. In this study, α and φ are 0.235 and −1.75.

Ridge Regression (RR)
We often use the Least Squares Method (LSM) to solve regression problems in statistics. It is a mathematical optimization modeling method that finds the best function match of the data by minimizing the sum of squares of the error. The squared loss function of LSM is shown as Equation (2).
In Equation (2), X is the matrix expression of the data input. The row of the matrix is the number of data samples. The column of the matrix is the data dimension. y is the output. β is the equation coefficient. In LSM, our target valueβ is the value that minimizes L(β).
To do this, we need to solve the partial derivative of L(β) to β. The resulting expression of β is shown as Equation (3).β Training models using LSM enables them to fit known sample points quickly and accurately, but it may cause overfitting. In Equation (3), when there is multicollinearity in the data dimension, X T X is no longer a full-rank matrix, and it would be challenging to solve its inverse matrix directly. RR is an algorithm proposed to solve this problem. RR is essentially an improved LSM. By giving up the unbiasedness of the LSM, a more realistic mathematical model is obtained at the expense of losing some information and reducing accuracy. Equations (4) and (5) show the new loss function and target value expression. In Equations (4) and (5), k is the ridge parameter, and I is a matrix of normal numbers. As k increases, the undetermined coefficient β would stabilize, and we were looking for the smallest k value under the condition that the coefficient is stable.

Kernel Ridge Regression (KRR)
KRR [11][12][13] combines RR and kernel tricks. In many cases, it requires mapping data into high-dimensional space to improve machine learning performance. It is found that the same effect can be achieved directly by defining a function K. This function K is called the kernel function. There are three commonly used kernel functions: polynomial kernel, sigmoid kernel, and radial basis function (RBF) kernel, shown as Equations (6)- (8). As we can see, there are three parameters in the polynomial kernel. The sigmoid kernel has two parameters; the RBF kernel only has one parameter, which is its strength. The amount of calculation of KRR is significantly reduced by utilizing the RBF kernel. In this study, we focus on the RBF kernel.
In Equations (6)-(8), x i and x j represent two data, ·, · represents dot product, and x i − x j is the Euclidean distance between x i and x j .
We need to write the RR solution as an inner product to introduce the kernel function. Converting the original formula to a particular form requires the matrix inversion lemma [14].
We use Equation (10) to simplify the optimal solution of β. Let H −1 k −1 I, F X T , G −X, E I, then we obtain a new expression shown as Equation (12).
Now, it is very close to kernelization. Our remaining task is to predict y * when a new sample point x * comes in. We write β in the form of vector summation, and let α kI + X T X −1 y. Then, we rewrite Equation (12) to Equation (13).
We can find that β is just a weighted average of all samples. Thus, the predicted value for a new sample is: The predicted value is the weighted average of the inner product of the new sample and all the old samples. After converting the original formula to the inner product form, we selected different kernel functions to simplify our calculations.

K-Means Clustering
K-means [15] is an algorithm that implements cluster analysis based on the principle of minimum distance. The K value must be given in advance, representing the number of cluster centers. For each iteration, we need to calculate the mean of the sample points in the cluster to update the cluster center. K-means clustering divides the n samples into k sets so that the within-cluster sum of squares (WCSS) is the smallest. The formula we use is shown in Equation (15). The updated formula for cluster centers is shown in Equation (16).
In Equation (15), c i is the cluster center, p is one sample point, and dist(p, c i ) is the Euclidean distance from p to the cluster center.

WLP FEA Model Validation
It is assumed that the CTE difference between the substrate and the wafer is ∆α. The DNP of the solder ball is L. After selecting material parameters, ∆α is fixed. Solder balls farther from the chip's center have a greater impact on deformation mismatch due to thermal loading. In the case of WLP, the thermal loading failure usually occurs at the outermost diagonal solder ball of the package. This study uses five WLP test vehicles (TV: WLP1-5) [16][17][18] and one fan-out WLP (FO-WLP, [19]) for FEA model validation. The structure component sizes, materials, and mean-time-to-failure (MTTF) reliability life are shown in Tables 1-3 [17][18][19]. This research uses these data to verify our simulation results. In order to reduce the computational time cost, this study adopts a simplified two-dimensional finite element model and sets the following basic assumptions: each structure is with homogeneous and isotropic materials; the temperature of the structure is uniform; residual stress is not considered; and all the contact surface between different materials is considered as perfect bonding. Considering that the package body is a symmetric structure, semi-diagonal twodimensional models were used in this study to simplify the modeling and finite element analysis processes; examples can be seen in Figures 1-4, and PLANE 182 has been selected as the element type in ANSYS ( Figure 4). The model boundary condition is fixed for all nodes in the center of the structure in the x-direction. The node at the bottom of the center of the structure is fixed in the x-and y-direction to avoid rigid body motion.  Table 3. Material properties for WLP [17,18].
Solder Ball Figure  Considering that the package body is a symmetric structure, semi-diagonal two-dimensional models were used in this study to simplify the modeling and finite element analysis processes; examples can be seen in Figures 1-4, and PLANE 182 has been selected as the element type in ANSYS ( Figure 4). The model boundary condition is fixed for al nodes in the center of the structure in the x-direction. The node at the bottom of the center of the structure is fixed in the x-and y-direction to avoid rigid body motion.    Considering that the package body is a symmetric structure, semi-diagonal mensional models were used in this study to simplify the modeling and finite analysis processes; examples can be seen in Figures 1-4, and PLANE 182 has been as the element type in ANSYS ( Figure 4). The model boundary condition is fixe nodes in the center of the structure in the x-direction. The node at the bottom of th of the structure is fixed in the x-and y-direction to avoid rigid body motion.           The FEA model for WLP 1-5 includes the following materials: silicon chip; low-k layer; stress buffer layer (SBL); redistribution layer (RDL); solder ball; under bump metallurgy (UBM); copper pad; printed circuit board (PCB); and solder mask. In addition, to further simplify the 2D model, the actual model simplifies the connection between UBM and the solder ball. The element mesh size in the upper-right corner of the solder ball would affect the final simulation result. Based on our previous research experience, the mesh size of this key position is fixed; it is located on the upper-right corner of the outmost solder ball. The controlled mesh size in height and width is 7.5 µm and 12.5 µm, and is shown in Figure 4.  Table 4. The stressstrain curve for solder balls in different temperatures in this study is shown in Figure 5 [20].
where α is the back stress, σ 0 is initial yield stress, C is constant for proportional to hardening modulus, γ is the rate of decrease of hardening modulus, and ∆ε pl is increment plastic strain, individually.   5. The stress-strain curve for SAC305 [19,21].
Finally, the thermal cycling load [22][23][24][25][26] was applied to our FEA model according to the JEDEC JESD22-A104D Condition G, with a temperature range of −40 ℃ to 125 ℃. We fixed the ramp rate. It is 16.5 ℃/min, and the dwell time is 10 min. The total time for a complete temperature cycle is 40 min. The thermal cycling temperature profile is shown in Figure 6. Finally, the thermal cycling load [22][23][24][25][26] was applied to our FEA model according to the JEDEC JESD22-A104D Condition G, with a temperature range of −40°C to 125°C. We fixed the ramp rate. It is 16.5°C/min, and the dwell time is 10 min. The total time for a complete temperature cycle is 40 min. The thermal cycling temperature profile is shown in Figure 6. After eight cycles, the incremental equivalent plastic strain will be stabilized, and it can be input to the Coffin-Manson equation to calculate the reliability prediction life of WLP. The reliability life between experiment and simulation is shown in Table 5. We can see that the difference between experiment and simulation prediction reliability life falls within an acceptable range for five test vehicles. Therefore, the WLP simulation models can be trusted. This study uses a validated simulation procedure and a controlled mesh size, as well as an automatic model generation technique we developed to create a large database of different design parameters and use it for machine learning.

Supervised Learning-KRR
In this study, we select four design parameters that greatly influence the reliability of WLP in building the database. They are chip thickness (CT), SBL thickness (SBLT), up- After eight cycles, the incremental equivalent plastic strain will be stabilized, and it can be input to the Coffin-Manson equation to calculate the reliability prediction life of WLP. The reliability life between experiment and simulation is shown in Table 5. We can see that the difference between experiment and simulation prediction reliability life falls within an acceptable range for five test vehicles. Therefore, the WLP simulation models can be trusted. This study uses a validated simulation procedure and a controlled mesh size, as well as an automatic model generation technique we developed to create a large database of different design parameters and use it for machine learning.

Supervised Learning-KRR
In this study, we select four design parameters that greatly influence the reliability of WLP in building the database. They are chip thickness (CT), SBL thickness (SBLT), upper pad diameter (UPD), and lower pad diameter (LPD). Other structure parameters are fixed as WLP-2. The diagram of structure design parameters is shown in Figure 7.

Supervised Learning-KRR
In this study, we select four design parameters that greatly influence the reliability of WLP in building the database. They are chip thickness (CT), SBL thickness (SBLT), upper pad diameter (UPD), and lower pad diameter (LPD). Other structure parameters are fixed as WLP-2. The diagram of structure design parameters is shown in Figure 7.    In Table 11, we randomly pick 100 data as testing data. Five training datasets (Tables 6-10) are used to train the KRR model separately and use the testing dataset to test the model's generalization. This study applies grid search for parameter optimization, and the data preprocessing is MinMaxScaler. The best model performance for each training set is shown in Table 12. We have four criteria for measuring the quality of the model. They are maximum training difference, average training difference, maximum testing difference, and average testing difference. The model's generalizability is the main concern of this research, and the testing performance of the model is shown in Figures 8 and 9. We have four criteria for measuring the quality of the model. They are maximum training difference, average training difference, maximum testing difference, and average testing difference. The model's generalizability is the main concern of this research, and the testing performance of the model is shown in Figures 8 and 9.  From Figure 8, we can see that the maximum testing difference is very stable; it shows that our trained model will not overfit. From Figure 9, we can see that the average testing difference gradually decreases with the increase of training data; this means the accuracy of the model is increasing. Therefore, we can continue to add training data to improve the model performance. On the other hand, judging from the growth curve of the training  We have four criteria for measuring the quality of the model. They are maximum training difference, average training difference, maximum testing difference, and average testing difference. The model's generalizability is the main concern of this research, and the testing performance of the model is shown in Figures 8 and 9.  From Figure 8, we can see that the maximum testing difference is very stable; it shows that our trained model will not overfit. From Figure 9, we can see that the average testing difference gradually decreases with the increase of training data; this means the accuracy of the model is increasing. Therefore, we can continue to add training data to improve the model performance. On the other hand, judging from the growth curve of the training From Figure 8, we can see that the maximum testing difference is very stable; it shows that our trained model will not overfit. From Figure 9, we can see that the average testing difference gradually decreases with the increase of training data; this means the accuracy of the model is increasing. Therefore, we can continue to add training data to improve the model performance. On the other hand, judging from the growth curve of the training CPU time, one can conclude that using a larger dataset would cause the model to take a long time to train. In order to reduce the training time of the model, this study introduces the K-means algorithm.

Supervised/Unsupervised Machine Learning-KRR Mixed with K-Means
First, to demonstrate the effectiveness of K-means, a larger training dataset should be generated. This research mixes five training datasets (Tables 6-10) with 1296 testing data (Table 11), and removes duplicate data; the final total data is equal to 9601. We randomly pick 9000 data as training data, and 601 as testing data. When the pure KRR algorithm is used for training, the total time spent is 1340 s. The specific performance of the model is shown in Table 13. In Table 13, we can see that the model's prediction accuracy is further improved. The average training and testing differences are both under five cycles. Our target now is to reduce the model training time. Using K-means as the preprocessing step of KRR, the training data is divided into K clusters. Each cluster corresponds to a sub-model with KRR. When we input the testing data, the test data uses a K-means model to determine which cluster it belongs to, and is trained on that particular sub-model. With K = 4, we show the performance of the K-K model in Tables 14 and 15. In Table 14, "n" represents the number of training data in each cluster; the total data number is 9000. "m" represents the number of testing data in each cluster; the total data number is 601. The final maximum difference is the maximum value among all sub-models. The average training difference is the weighted average. The final training and testing results are shown in Table 15, and the accuracy is similar to the pure KRR model. In this study, the focus of the K-K model is the training CPU time. In Table 16 and Figure 10, it can be seen that as the value of K increases, the training time decreases rapidly. results are shown in Table 15, and the accuracy is similar to the pure KRR model. In this study, the focus of the K-K model is the training CPU time. In Table 16 and Figure 10, it can be seen that as the value of K increases, the training time decreases rapidly.  From Figure 10, we can find that the CPU time gradually decreases, and the trend slows down as the value of K increases. The average difference remains stable; both training and testing are under five cycles. In addition, a comparison of different machine learning algorithms using 9000 training data and 601 testing data is presented in Table 17. In Table 17, the performance of the KRR model and the SVR model is very close in terms of training time and training error. The K-K hybrid model achieves similar accuracy as the pure KRR, ANN, and SVR algorithms. However, in terms of training time, it can reach the level of the RF algorithm, which dramatically improves the training efficiency of the algorithm.

Conclusions
This study, using validated FEA, created five training datasets and one testing dataset for machine learning. In comparison with FEA, the results indicated that the KRR and the K-K machine learning algorithms are fast and effective in predicting the reliability life of WLP. With the increase of training data, the accuracy of the AI model is gradually From Figure 10, we can find that the CPU time gradually decreases, and the trend slows down as the value of K increases. The average difference remains stable; both training and testing are under five cycles. In addition, a comparison of different machine learning algorithms using 9000 training data and 601 testing data is presented in Table 17. In Table 17, the performance of the KRR model and the SVR model is very close in terms of training time and training error. The K-K hybrid model achieves similar accuracy as the pure KRR, ANN, and SVR algorithms. However, in terms of training time, it can reach the level of the RF algorithm, which dramatically improves the training efficiency of the algorithm.

Conclusions
This study, using validated FEA, created five training datasets and one testing dataset for machine learning. In comparison with FEA, the results indicated that the KRR and the K-K machine learning algorithms are fast and effective in predicting the reliability life of WLP. With the increase of training data, the accuracy of the AI model is gradually improved. However, as the AI dataset grows, training time will increase dramatically, making it necessary to reduce the training time. Using a hybrid model combining Kmeans and KRR can significantly reduce training time while maintaining similar prediction accuracy. When K is 32, we can obtain a data prediction model with an average error of around four cycles, and the overall CPU training time is 7 s, much less than the pure KRR model's 1340 s training time. As compared with ANN, RF, and SVR, the K-K model is undoubtedly fast and accurate.

Conflicts of Interest:
The authors declare no conflict of interest.