Development of ANN-Based Warpage Prediction Model for FCCSP via Subdomain Sampling and Taguchi Hyperparameter Optimization

This study aims to establish an accurate prediction model using artificial neural networks (ANNs) to effectively and efficiently predict the process-induced warpage of a flip-chip chip-scale package (FCCSP). To enhance model performance, a novel subdomain-based sampling strategy and Taguchi hyperparameter optimization are proposed in the ANN algorithm. To simulate the warpage behavior the FCCSP during fabrication, a process modeling approach is proposed, where the viscoelastic behavior of the epoxy molding compound is included, in which the viscoelastic properties are determined using dynamic mechanical measurement. In addition, the temperature-dependent thermal-mechanical properties of the materials in the FCCSP are assessed through thermal-mechanical analysis and dynamic mechanical analysis. The modeled warpage results are verified by the warpage measurement. Next, warpage parametric analysis is performed to identify the key factors most affecting warpage behavior for use in the construction of the warpage prediction model. Moreover, the advantages of the proposed sampling and hyperparameter tuning approaches are proved by comparing with other existing models, and the validity of the developed ANN-based deep learning warpage prediction model is demonstrated through a validation dataset.


Introduction
Nowadays, semiconductor manufacturing technologies have advanced rapidly, driven by the demand for smaller, faster, and more reliable electronic devices. However, this advancement has also brought up a significant technical challenge, in which the physical limit of transistor scaling creates enormous difficulties in continuing on the path of Moore's Law [1]. In the post-Moore era, the concept of "More than Moore" based on heterogeneous integration using new packaging technologies [2][3][4][5][6][7][8] is becoming more critical and demanding. Flip-chip chip-scale packaging (FCCSP) possesses the capacity of a high I/O count, miniaturization, and great electrical performance, and thus becomes one of the promising packaging solutions for realizing heterogeneous system integration.
Despite the fact that FCCSP has been one of the mainstream packaging technologies today (see, e.g., [6][7][8]), there are still several technical issues that need to be addressed, such as yield, reliability, thermal performance, and warpage. Among them, the warpage induced during the manufacturing process is particularly important as it can cause various process problems in subsequent process steps, such as handling, registration, and alignment, eventually resulting in yield and throughput losses [9,10]. It is, thus, essential to have a thorough understanding of its warpage behavior in the initial design stage. In the literature, few studies have been reported on the characterization and management of the warpage behavior of FCCSP during fabrication through theoretical analysis, such as finite element analysis (FEA), and experiments [8]. As compared to experimental approaches, theoretical analysis can be not only more efficient and cost-effective, but also capable of giving better insight into the physical mechanisms.
To solve the aforementioned challenges and even lessen the prediction uncertainty and modeling error made by less experienced engineers, researchers start seeking the integration of simulation and machine learning (see, e.g., [11][12][13][14][15][16]). To date, due to the rapid advance of computer technologies and machine learning algorithms, it has evolved into a critical tool for addressing a wide range of real-world issues, with applications covering medical diagnosis, transportation, space exploration, defense systems and various engineering fields. Deep learning is a branch of machine learning which incorporates artificial deep learning neural networks (NNs). These NNs are composed of a number of neurons, each of which performs a simple mathematical function of the inputs. By assembling these layers, deep learning models have the ability to learn complex information and features from raw data, such as images, audio, text, and sensor readings. This has allowed the development of a deep learning-based prediction model for timely and effective predictive analysis of very complex systems. The simulation-based deep learning prediction models have been extensively applied in advanced microelectronic packaging for a quick and accurate assessment of their thermal-mechanical performance, such as thermal performance [12,13] and reliability [14][15][16]. For example, Law et al. [12] developed a deep learning model for the prediction of the thermal performance of quad flat no-lead (QFN) packages using an ANN. Subbarayan et al. [14] applied an artificial NN (ANN) algorithm to build up a reliability prediction model for a ball grid array (BGA) package. Yuan et al. [15] applied an ANN-based simulation framework to investigate the solder joint reliability of a wafer-level chip-scale package, where the initial parameters of the ANN model, namely, the weights and bias, were obtained using a genetic algorithm (GA). Hsiao and Chiang [16] combined FEA together with random forest (RF) to explore the solder joint reliability of wafer-level packaging subjected to thermal cycling.
In addition to conventional gradient-based back-propagation (BP) approaches, evolutionary algorithms (EAs), such as GA, evolutionary strategy (ES), and particle swarm optimization (PSO), have been extensively applied to network topology design and connection weight adaption [17][18][19][20], mainly because of their advantages over the conventional approaches', such as conceptual simplicity and flexibility, capability to solve problems without any human expertise, and higher probability to reach a global optimum. White and Ligomenides [17] proposed a two-stage approach to explore the network topology and connection weights of an NN model by combining a GA and a BP approach. The underlying idea behind this approach is that if the GA was unable to obtain an appropriate network solution, the BP approach with an MP algorithm was further performed to locally explore the optimal weights using the calculated connection weights from the GA as initial values. A similar approach can also be found in Ding et al. [18]. Juang [19] introduced an evolutionary recurrent network for a temporal sequence production problem using an evolutionary learning algorithm based on a hybrid of GA and particle swarm optimization (PSO). Ahmadizar et al. [20] developed an ANN model using an evolutionary-based algorithm that integrates grammatical evolution (GE) for the network topology design and GA for better weight adaptation. The NN prediction model performance can be alternatively improved through hyperparameter tuning [21][22][23][24][25]. Hyperparameters are crucial for the performance of a machine learning model because they control the architecture of a neural network. Well-tuned hyperparameters can also prevent the model from overfitting or underfitting (see, e.g., [26]). In the literature, the hyperparameters were mostly tuned using trial-and-error parametric analysis (one factor at a time) [21], grid search [22], and random search [23]. The former two approaches are either unable to account for the interaction effect of hyperparameters, or computationally expensive, especially for models with a large number of hyperparameters and a huge search space. Random search could be a more efficient and cost-effective approach; however, theoretically, it is less probable to find the optimal hyperparameter setting. EAs, such as GAs [24,25], are a feasible alternative to determine the best set of hyperparameters. Even though Erpolat Taşabat and Aydin [25] found that for hyperparameter optimization, GAs can be more efficient in computation than grid search, the heuristic algorithms may fail to converge to an optimal or even good result due to their premature convergence in nature, and, in addition, they are computationally cost-ineffective due to their poor convergence. Consequently, a more effective and cost-effective hyperparameter optimization approach is preferred and needed.
According to the above literature survey, there are still very limited studies on the development of the warpage prediction model for electronic packaging, not to mention FCCSP. Thus, this work attempts to develop a prediction model using a proposed FEA-based ANN approach to facilitate an effective and quick estimate of the process-induced warpage behavior of FCCSP for use in subsequent fabrication process design. In order to upgrade model prediction accuracy and training performance, an ANN algorithm integrating a novel subdomain-based sampling strategy and Taguchi hyperparameter optimization is proposed for prediction model design and training. To simulate the fabrication process, an FEA-based process modeling approach is proposed, which takes into account the viscoelastic behavior of the epoxy molding compound (EMC) and the temperature-dependence of the thermal-mechanical properties of the materials in FCCSP. For the validation of the proposed process modeling approach, the warpage simulated results are compared against the warpage measurement data. Moreover, warpage parametric analysis is performed to characterize the crucial factors that mainly influence the warpage behavior. These characterized crucial factors are utilized for the ANN prediction model's construction. The benefits of the proposed sampling and hyperparameter tuning techniques are shown by comparison to other existing approaches. Furthermore, the feasibility of the developed warpage prediction model is evaluated using the validation dataset.

Structure and Fabrication Process of FCCSP
The FCCSP assembly under investigation is depicted in Figure 1, which mainly consists of a silicon die, an EMC, copper pillar bumps (CPBs), and a coreless substrate. The coreless substrate used in this study is a three-layer 168 µm thick embedded trace substrate (ETS), which comprises two solder mask (SM) protective layers, two prepreg (PP) dielectric layers, and three metal (Cu) layers. The schematic diagram of the cross section of the coreless substrate is shown in Figure 2. In this investigation, three FCCSP test vehicles (TV) with different geometric dimensions are discussed. Taking TV1 as an example, the die is 8.6 mm in length, 8.2 mm in width, and 200 µm thick. The die is connected on the coreless substrate using copper pillar bumps, and then the electronic assembly is fully covered by an EMC material with a size of 15 mm (length) 15 mm (width) 430 µm (thickness). Compared to TV1, the EMC thickness of TV2 and of TV3 is 450 µm. The die thicknesses for TV1, TV2 and TV3 are 200 µm, 200 µm and 175 µm, respectively. The detailed dimensions of TV1, TV2, and TV3 are shown in Table 1. Figure 3 illustrates the fabrication process of the FCCSPs, which includes two major process steps, namely die bonding process (steps 0-3) and mold cure process (steps 3-6). The die bonding process starts by mounting the silicon die on the coreless substrate using copper pillar bumps. The silicon die is aligned with the bond pads on the substrate and then heated to 260 • C to activate the solder bumps and form an electrical and mechanical connection. Once the die-bonding process is completed, a liquid-type EMC is used to encapsulate the silicon die through the mold cure process with a mold cure temperature of 175 • C. The mold cure process helps provide additional electrical insulation and environmental protection.

Theoretical Model of ANN
ANNs, data processing models, are designed to mimic the human nervous system [12]. The architecture and behavior of ANNs are inspired by the biological NNs in human brains, which process information in a parallel and distributed manner. A typical ANN model mainly comprises three layers, namely, the input layer, hidden layer, and output

Theoretical Model of ANN
ANNs, data processing models, are designed to mimic the human nervous system [12]. The architecture and behavior of ANNs are inspired by the biological NNs in human brains, which process information in a parallel and distributed manner. A typical ANN model mainly comprises three layers, namely, the input layer, hidden layer, and output

Theoretical Model of ANN
ANNs, data processing models, are designed to mimic the human nervous system [12]. The architecture and behavior of ANNs are inspired by the biological NNs in human brains, which process information in a parallel and distributed manner. A typical ANN model mainly comprises three layers, namely, the input layer, hidden layer, and output

Theoretical Model of ANN
ANNs, data processing models, are designed to mimic the human nervous system [12]. The architecture and behavior of ANNs are inspired by the biological NNs in human brains, which process information in a parallel and distributed manner. A typical ANN model mainly comprises three layers, namely, the input layer, hidden layer, and output layer, as shown in Figure 4. The input layer, i.e., the first layer of an ANN model, is primarily responsible for receiving the external inputs. The hidden layers, the intermediate layers or the neural layers between the input layer and the output layer, manage the ANN's data processing and computation. Increasing the hidden layers enhances the capability of mimicking a more complex and nonlinear features and behaviors, meanwhile raising the computational complexity and effort, and potentially causing overfitting and poor prediction performance. The output layer, the last layer of an ANN model, is in charge of providing predictions based on the computations performed in the hidden layers. The links connecting neurons in an ANN model are termed connection weights, which are to be solved through optimization. Figure 4 illustrates the process of passing information through an NN having two inputs (x 1 , x 2 ) and outputs (o 1 , o 2 ), one hidden layer with three neurons (z 1 , z 2 , z 3 ) inside, where w is the weight, b the bias of the layer, and σ the activation function of the layer. The goal of an ANN model is to modify the weights through optimization or learning process to minimize the discrepancy of the ANN outputs and the target data. Additionally, the setting of hyperparameters of an ANN model, including optimizer, number of hidden layers and neurons, activation function, learning rate, and batch size, is critical to the prediction model's performance. The most commonly used hyperparameter optimization methods include trial-and-error parametric analysis [21], grid search [22], random search [23] and EAs such as GA [24,25]. However, these methods hold various drawbacks (see Introduction). Consequently, a more effective and cost-effective approach using the Taguchi method is proposed to determine the optimal hyperparameter setting for constructing the best-fitted prediction model. layer, as shown in Figure 4. The input layer, i.e., the first layer of an ANN model, is primarily responsible for receiving the external inputs. The hidden layers, the intermediate layers or the neural layers between the input layer and the output layer, manage the ANN's data processing and computation. Increasing the hidden layers enhances the capability of mimicking a more complex and nonlinear features and behaviors, meanwhile raising the computational complexity and effort, and potentially causing overfitting and poor prediction performance. The output layer, the last layer of an ANN model, is in charge of providing predictions based on the computations performed in the hidden layers. The links connecting neurons in an ANN model are termed connection weights, which are to be solved through optimization. Figure 4 illustrates the process of passing information through an NN having two inputs (x1, x2) and outputs (o1, o2), one hidden layer with three neurons (z1, z2, z3) inside, where w is the weight, b the bias of the layer, and σ the activation function of the layer. The goal of an ANN model is to modify the weights through optimization or learning process to minimize the discrepancy of the ANN outputs and the target data. Additionally, the setting of hyperparameters of an ANN model, including optimizer, number of hidden layers and neurons, activation function, learning rate, and batch size, is critical to the prediction model's performance. The most commonly used hyperparameter optimization methods include trial-and-error parametric analysis [21], grid search [22], random search [23] and EAs such as GA [24,25]. However, these methods hold various drawbacks (see Introduction). Consequently, a more effective and cost-effective approach using the Taguchi method is proposed to determine the optimal hyperparameter setting for constructing the best-fitted prediction model.

Process Modeling
An FEA-based process modeling approach that integrates the ANSYS element death/birth technique and nonlinear FEA is introduced for effectively evaluating the warpage of the FCCSP during the fabrication process. Considering its symmetry, a quarter-symmetric FEA model of the FCCSP is adopted, where a symmetric boundary condition is imposed on these symmetric planes, i.e., the nodal displacements normal to the symmetric planes are zero. In addition, to avoid rigid body motion, the displacement of the bottom node on the intersecting line of these two symmetric planes is constrained in the z-direction. The FEA model of the FCCSP is primarily composed of a coreless substrate, an EMC, Cu pillar bumps, and a silicon die, as shown in Figure 5, together with the imposed boundary conditions. Hexahedral solid elements in ANSYS, i.e., solid 185, are adopted. Table 2 lists the number of nodes and solid elements of the FEA models associated with TV1, TV2, and TV3. The Young's modulus (E) and coefficient of thermal expansion (CTE) of the EMC, prepreg, Sn-Ag-Cu(SAC)305 solder, and solder mask are

Process Modeling
An FEA-based process modeling approach that integrates the ANSYS element death/birth technique and nonlinear FEA is introduced for effectively evaluating the warpage of the FCCSP during the fabrication process. Considering its symmetry, a quarter-symmetric FEA model of the FCCSP is adopted, where a symmetric boundary condition is imposed on these symmetric planes, i.e., the nodal displacements normal to the symmetric planes are zero. In addition, to avoid rigid body motion, the displacement of the bottom node on the intersecting line of these two symmetric planes is constrained in the z-direction. The FEA model of the FCCSP is primarily composed of a coreless substrate, an EMC, Cu pillar bumps, and a silicon die, as shown in Figure 5, together with the imposed boundary conditions. Hexahedral solid elements in ANSYS, i.e., solid 185, are adopted. Table 2 lists the number of nodes and solid elements of the FEA models associated with TV1, TV2, and TV3. The Young's modulus (E) and coefficient of thermal expansion (CTE) of the EMC, prepreg, Sn-Ag-Cu(SAC)305 solder, and solder mask are characterized using a thermal-mechanical analyzer (TMA) (TA Instruments, New Castle, DE, USA) and a dynamic mechanical analyzer (DMA) (TA Instruments, New Castle, DE, USA), and the results are displayed in Figure 6. Except for the EMC, which is assumed to be a linearly viscoelastic material, they are considered to be linearly elastic, isotropic, and temperature-dependent. In addition, the CTEs and Young's moduli of the silicon die and Cu are 2.8 ppm/ • C and 160 GPa, and 16.3 ppm/ • C and 121 GPa, respectively. According to the fabrication process displayed in Figure 3, the process modeling primarily involves the die bonding process (steps 0-3) and mold cure process (steps 3-6). At step 0, the silicon die, solder layer of the CPB, and EMC are deactivated. At step 1, i.e., heating to the die bonding temperature (260 • C), the solder layer and silicon die are activated to form a mechanical connection between the silicon die and the coreless substrate. At step 4, i.e., heating to the mold cure temperature (175 • C), the EMC is activated to simulate a fully cured EMC.
Micromachines 2023, 14, x FOR PEER REVIEW 6 of 18 characterized using a thermal-mechanical analyzer (TMA) (TA Instruments, New Castle, DE, USA) and a dynamic mechanical analyzer (DMA) (TA Instruments, New Castle, DE, USA), and the results are displayed in Figure 6. Except for the EMC, which is assumed to be a linearly viscoelastic material, they are considered to be linearly elastic, isotropic, and temperature-dependent. In addition, the CTEs and Young's moduli of the silicon die and Cu are 2.8 ppm/°C and 160 GPa, and 16.3 ppm/°C and 121 GPa, respectively. According to the fabrication process displayed in Figure 3, the process modeling primarily involves the die bonding process (steps 0-3) and mold cure process (steps 3-6). At step 0, the silicon die, solder layer of the CPB, and EMC are deactivated. At step 1, i.e., heating to the die bonding temperature (260 °C), the solder layer and silicon die are activated to form a mechanical connection between the silicon die and the coreless substrate. At step 4, i.e., heating to the mold cure temperature (175 °C), the EMC is activated to simulate a fully cured EMC.    characterized using a thermal-mechanical analyzer (TMA) (TA Instruments, New Castle, DE, USA) and a dynamic mechanical analyzer (DMA) (TA Instruments, New Castle, DE, USA), and the results are displayed in Figure 6. Except for the EMC, which is assumed to be a linearly viscoelastic material, they are considered to be linearly elastic, isotropic, and temperature-dependent. In addition, the CTEs and Young's moduli of the silicon die and Cu are 2.8 ppm/°C and 160 GPa, and 16.3 ppm/°C and 121 GPa, respectively. According to the fabrication process displayed in Figure 3, the process modeling primarily involves the die bonding process (steps 0-3) and mold cure process (steps 3-6). At step 0, the silicon die, solder layer of the CPB, and EMC are deactivated. At step 1, i.e., heating to the die bonding temperature (260 °C), the solder layer and silicon die are activated to form a mechanical connection between the silicon die and the coreless substrate. At step 4, i.e., heating to the mold cure temperature (175 °C), the EMC is activated to simulate a fully cured EMC.   EMC materials play a significant role in the thermal-mechanical behavior of electronic packaging [9]. Typically, EMC materials reveal temperature-, time-and strain-ratedependent viscoelastic behaviors (see, e.g., [10]), such as creep, stress relaxation, and even hysteresis behavior. The viscoelastic relaxation behavior is generally depicted by a generalized Maxwell model, comprising multiple Maxwell elements and an independent spring connected in parallel. This generalized Maxwell model is well approximated by a Prony series representation for fitting measured relaxation data, wherein E(t) denotes the relaxation modulus of the entire model, E i the relaxation modulus of the ith Maxwell element, E ∞ the long-term fully relaxed modulus, t the time, τ i the relaxation time, and m the total number of Maxwell elements. Based on the following relationship between the unrelaxed modulus E 0 and E ∞ , the Prony series representation of the generalized Maxwell model (Equation (1)) can be rewritten as where β i represents E i /E 0 . The time and temperature dependence of the mechanical properties of a viscoelastic material can be correlated using the time-temperature superposition principle (TTSP) [10]. More specifically, the TTSP suggests that a relaxation curve of a viscoelastic material at a specific temperature can be employed as a reference for further characterizing the relaxation curves at other temperatures by conducting a horizontal translation of the reference relaxation curve in the logarithmic time domain. The temperature translation factor λ T is normally approximated using an empirical relationship, the so-called Williams-Landel-Ferry (WLF) equation, In Equation (4), κ 1 and κ 2 are the curve fit coefficients, and T r the reference temperature. The master curve of the relaxation modulus at a reference temperature can be constructed by translating the measured frequency-dependent storage moduli at multiple temperatures along the time axis with temperature translation factors λ T . Based on the relaxation modulus at different isothermal temperatures under 1% applied strains [10], the constructed reference master curve at the glass transition temperature of the EMC is shown in Figure 7 and the fitted coefficients (β i ,τ i ) of the Prony series model with 21 terms are given in Table 3. Furthermore, the fitted coefficients κ 1 and κ 2 of the WLF model for the characterized translation factors as a function of temperature are 74.7 and 313.9, respectively.

Characterization of Process-Induced Warpage of FCCSP
The process-dependent warpage evolution of these three FCCSP test vehicles (TV1, TV2 and TV3) during the fabrication process is calculated, and displayed in Figure 8. The stress-free temperature of the substrate is set 145 • C to model the initial warpage of the substrate, i.e., about 67 µm ± 15 µm. The results show a significant rise in warpage after the die bonding process step. This dramatic increase in warpage suggests that the process temperature plays a significant role in the warpage. Furthermore, the process-induced warpage of the FCCSP is significantly reduced after the mold cure process, mainly owing to the EMC's ability to reduce the CTE mismatch between the substrate and die. This implies the capability of the EMC for suppressing the warpage. The FEA results are compared with the warpage measurement data obtained using Shadow Moiré, as shown in Table 4. Note that the warpage is measured on the bottom surface of the substrate at room temperature after the mold cure process, i.e., step 6, using a Shadow Moiré measurement technique (Akrometrix TherMoire AXP 2.0, Atlanta, GA, USA). In addition, the warpage is defined as the discrepancy between the maximum and the minimum of the z-direction deformation. The viscoelastic effect of the EMC on the process-induced warpage of the FCCSP is also examined. It is found that the warpage result after the mold cure process obtained from the process modeling approach considering the EMC viscoelastic effect shows a much more consistency with the measurement data than that without considering the effect, indicating that the EMC viscoelastic effect is essential for the prediction of the processinduced warpage. The modeled and measured warpage contour plots of the FCCSP after the mold cure process are presented in Figure 9, where the FCCSP would deform in a convex shape. Moreover, the minimal warpage takes place at the center of the FCCSP while the maximal warpage takes place at the four corners. Evidently, these two warpage contours also agree well with each other. The close agreement in warpage between the simulation (with the EMC viscoelastic effect) and the measurement clearly proves the effectiveness of the proposed process modeling approach in warpage prediction.

Identification of Key Factors Affecting Process-Induced Warpage
The influences of some geometric and material factors on the process-induced warpage of the FCCSP are investigated through parametric analysis using the validated FEA-based process modeling approach. The considered geometric and material factors are the side length and thickness of the die, side length of the package, thickness of the EMC, CTE of the EMC and substrate, and Young's modulus (E) of the substrate. It should be noted that the variation of the side length of the package would correspondingly change the side length of the substrate and EMC while keeping the dimension of the other components unchanged. These design factors are nominally varied by ±15% from their nominal values. Note that the width and length of this FCCSP package are identical, and those of the silicon die are very similar. Since there is a very comparable parametric analysis result with respect to the width and length of the package and silicon die, they are simply replaced by the "side length" of the package and silicon die for better clarity and conciseness of presentation. The parametric results of the effects of the side length of the silicon die and package, and the effect of the thickness of the die and EMC are presented in Figure 10a. The process-induced warpage is found to increase with an increasing die side length and a decreasing package side length. This is mainly because as the die's side length goes up, the mechanical stresses because of the CTE mismatch between the top layer (the composite layer of the EMC and die) and the bottom substrate become more pronounced, thereby resulting in an increased warpage. On the other hand, the decrease in package side length would also reduce the size of the substrate and EMC, which as the die size remains unchanged, would increase the proportion of the silicon die in the top layer, and resultingly enhance the CTE mismatch between the composite top layer and the substrate. Furthermore, an increased warpage can be also observed with an increasing die thickness and a decreasing EMC thickness, primarily due to the growth in the CTE mismatch between the composite top layer and the bottom substrate as a result of the increased proportion of the silicon die in the top layer. Figure 10b summarizes the parametric results of the influences of the Young's modulus and CTE of the substrate and the CTE of the EMC. It demonstrates that a decrease in the CTE and Young's modulus of the substrate would diminish the process-induced warpage, while an increase in the CTE of the EMC would lessen it. The explanations can also be found above. FCCSP while the maximal warpage takes place at the four corners. Evidently, these two warpage contours also agree well with each other. The close agreement in warpage between the simulation (with the EMC viscoelastic effect) and the measurement clearly proves the effectiveness of the proposed process modeling approach in warpage prediction.  (a) (b) Figure 9. The (a) simulated and (b) measured warpage contour plots.

Establishment of Training/Test and Validation Datasets
Based on the results of the parametric analysis, the degree of influence of these seven factors on the warpage behavior of the FCCSP after the fabrication process is ranked from the highest to the lowest as follows and as also listed in Table 5: EMC thickness, substrate CTE, EMC CTE, die side length, die thickness, substrate Young's modulus and package side length. Out of them, the top six highest-influence factors are chosen to establish the ANN prediction model for the process-induced warpage. They are considered as the input parameters in the ANN input layer, and are also used to establish the training/test dataset. A variation range of ±20% is considered for these input parameters. For the establishment of the training/testing dataset, several sampling strategies are available. Two of the most widely used ones are global structured (GS) sampling and global random (GR) sampling.
GS sampling is a well-established factorial design-based sampling strategy in which a full factorial design (FFD) of design of experiment (DOE) is utilized for the entire design domain (the design region of the factors) to create the sample data. This method exploits all the combinations of factors at all levels. It is known for its ability to achieve an even distribution of sample data across the entire design domain, which in turn could give a better assessment of the interactions among factors. Because the number of sample data increases with an increase in the number of factors, the number of sample data could be very large if the amount of input features and levels are excessive, probably leading to a high computational cost [21]. In addition, this strategy needs to reconstruct the sampling datasets using an FFD of DOE when more data are needed to obtain a more accurate prediction, thereby being less flexible in additional data generation. In contrast, the GR is a sampling strategy that randomly selects a subset of the population (sample data) from the entire population (the entire designed region of the factors). This is a very simple and straightforward method, as compared to the GS sampling strategy, since it has no need of prior knowledge about the sampling population. In addition, because of the use of randomization, this strategy would better avoid sampling and selection biases and enjoy high flexibility in generating additional data if needed. Nevertheless, this method may result in an uneven data distribution, which is not beneficial to a thorough and effective evaluation of the interactions among factors.
warpage contours also agree well with each other. The close agreement in warpage be-tween the simulation (with the EMC viscoelastic effect) and the measurement clearly proves the effectiveness of the proposed process modeling approach in warpage prediction.  (a) (b) Figure 9. The (a) simulated and (b) measured warpage contour plots.

Identification of Key Factors Affecting Process-Induced Warpage
The influences of some geometric and material factors on the process-induced warpage of the FCCSP are investigated through parametric analysis using the validated FEA-based process modeling approach. The considered geometric and material factors are the side length and thickness of the die, side length of the package, thickness of the EMC, CTE of the EMC and substrate, and Young's modulus (E) of the substrate. It should be noted that the variation of the side length of the package would correspondingly change the side length of the substrate and EMC while keeping the dimension of the other components unchanged. These design factors are nominally varied by ±15% from their nominal values. Note that the width and length of this FCCSP package are identical, and  Figure 10a. The process-induced warpage is found to increase with an increasing die side length and a decreasing package side length. This is mainly because as the die's side length goes up, the mechanical stresses because of the CTE mismatch between the top layer (the composite layer of the EMC and die) and the bottom substrate become more pronounced, thereby resulting in an increased warpage. On the other hand, the decrease in package side length would also reduce the size of the substrate and EMC, which as the die size remains unchanged, would increase the proportion of the silicon die in the top layer, and resultingly enhance the CTE mismatch between the composite top layer and the substrate. Furthermore, an increased warpage can be also observed with an increasing die thickness and a decreasing EMC thickness, primarily due to the growth in the CTE mismatch between the composite top layer and the bottom substrate as a result of the increased proportion of the silicon die in the top layer. Figure 10b summarizes the parametric results of the influences of the Young's modulus and CTE of the substrate and the CTE of the EMC. It demonstrates that a decrease in the CTE and Young's modulus of the substrate would diminish the process-induced warpage, while an increase in the CTE of the EMC would lessen it. The explanations can also be found above.

Establishment of Training/Test and Validation Datasets
Based on the results of the parametric analysis, the degree of influence of these seven factors on the warpage behavior of the FCCSP after the fabrication process is ranked from the highest to the lowest as follows and as also listed in Table 5: EMC thickness, substrate CTE, EMC CTE, die side length, die thickness, substrate Young's modulus and package side length. Out of them, the top six highest-influence factors are chosen to establish the ANN prediction model for the process-induced warpage. They are considered as the input To take into account the flexibility and feasibility of additional data generation without the need of re-establishing the sampling dataset, and achieve an even data distribution, this study proposes a subdomain random (SR) sampling strategy to construct the datasets by partitioning the whole design domain into multiple subdomains, in which a random generation of sample data is made. The schematic diagrams of the GS, GR, and SR sampling strategies are shown in Figure 11. In addition, two of their combinations (the so-called hybrid approaches), namely, GS combined with GR (hereafter termed the GSGR strategy), and GS combined with SR (hereafter termed the GSSR strategy), are also proposed. The training/testing dataset is constructed using these five sampling strategies and their results in terms of model performance are compared to each other. Four different sample data sets are considered for the GS, GR and SR strategies, i.e., 216, 324, 540, and 810. For the GSGR and GSSR hybrid sampling strategies, three additional different sample data sets, namely, those with 108, 324, and 594 samples, are generated by GR and SR, respectively, and they are further combined with the 216-sample data set generated by the GS to form three sample data sets, i.e., data sets of sizes 324, 540, and 810. For GS, the corresponding factors and levels together with the total number of sample data used in the training/test phase are presented in Table 6. The same total number of sample data are created using GR. For SR, an increase in the number of subdomains would enhance the sampling and modeling complexities. To reduce the number of subdomains, any two design factors are grouped together into one cluster, and each of them is further divided into two regions. For this six-factor design, a total of eight subdomains can be formed. Beside the training/test dataset, a validation dataset with sixty-four sample data is established using GS to verify the prediction accuracy of the trained ANN warpage prediction model. The factors and levels used in the validation phase for GS are listed in Table 7.

Hyperparameter Optimization Using Taguchi Method
The design of a machine learning prediction model consists in seeking hyperparameter optimization. In addition, the optimal values of the hyperparameters are highly problemdependent. In this investigation, the initial/untuned hyperparameter values, namely, optimizer (Adam) [27], activation function (ReLU) [15], the number of hidden layers (three) [28], and the estimate of the upper limit of the number of neurons in each layer [29], are specified based on the literature results. Note that the size of the training dataset is directly related to the number of neurons applied in each layer. The learning rate is set according to the default value of the Keras Adam optimizer, and the fold number (K) of the K-fold cross-validation method is set according to the size of the test dataset, which is 10~15% of the training data. The GS sampling strategy is used to form the training/test dataset. Moreover, the untuned hyperparameter values used for constructing the ANN models with four different training/test datasets (i.e., 216-, 324-, 540-and 810-sample data) are listed in Table 8. Min-max normalization is applied to scale these input features to a fixed range such that each feature has a comparable weight for the feature learner. The mean square errors (MSEs) of the predictions of the trained ANN models on the test data for these four training/test datasets are shown in Table 9. It is clear to see that these MSE values are considerable, indicating that there is a significant degree of discrepancy between the predictions and calculations, and, also, there is a great room to improve the ANN prediction models. To improve the model's performance, the Taguchi method is applied to determine the optimal hyperparameter values. For the six-factor, three-level experimental design problem, one two-level factor and seven three-level factors are employed in Taguchi's orthogonal array (OA), as shown in Table 10, to seek the optimal combinations of hyperparameters and levels. According to the previous literature reports, three hidden layers are found to effectively reduce the calculation time, and, meanwhile, provide good prediction accuracy [28]; the optimizer Adam could achieve good results with high efficiency on most neural network architectures [27]. Thus, they are implemented in the ANN model. The considered hyperparameters for optimization are activation function (A), batch size (B), learning rate (C), number of neurons in the first hidden layer (D), number of neurons in the second hidden layer (E), number of neurons in the third hidden layer (F), and two dummy factors. For the hyperparameter optimization, the training/test dataset with 324 sample data is used. The hyperparameters and their levels used in the Taguchi experimental design are presented in Table 11. The MSE of the predictions of the trained ANN model on the test dataset is considered as the objective of the Taguchi experimental design. The the-smallerthe-better criterion is used for the minimization of the MSE. The signal-to-noise (S/N) ratios of all the experimental runs in the OA are calculated. The mean S/N ratio for each level of control factors is summarized in the S/N response graph shown in Figure 12. The response graph is utilized to identify the most significant hyperparameters and their optimal level set for achieving an improved performance of the trained ANN warpage prediction model. From this response graph, it is found that the optimal level set of these hyperparameters is A1, B3, C3, D3, E3, and F3, i.e., exponential linear unit (ELU) activation function, a batch size of 30, a learning rate of 0.005, and 11 neurons for all three hidden layers.  1  1  1  1  1  1  1  1  1  2  1  2  2  2  2  2  1  2  3  1  3  3  3  3  3  1  3  4  2  1  1  2  2  3  1  3  5  2  2  2  3  3  1  1  1  6  2  3  3  1  1  2  1  2  7  3  1  2  1  3  2  1  3  8  3  2  3  2  1  3  1  1  9  3  3  1  3  2  1  1  2  10  1  1  3  3  2  2  2  1  11  1  2  1  1  3  3  2  2  12  1  3  2  2  1  1  2  3  13  2  1  2  3  1  3  2  2  14  2  2  3  1  2  1  2  3  15  2  3  1  2  3  2  2  1  16  3  1  3  2  3  1  2  2  17  3  2  1  3  1  2  2  3  18  3  3  2  1  2  3  2  1   Table 11. Hyperparameters and levels considered for Taguchi hyperparameter optimization.

Performance Characterization and Comparison of the Trained ANN Models on Validation Dataset
With the optimal set of hyperparameters, four ANN models are trained again using these four different training/test datasets (i.e., 216, 324, 540 and 810 sample data) generated with the GS sampling strategy. The MSEs and their standard deviations on the test data associated with the four training/test datasets are characterized and described in Table 12. For comparison, the corresponding results of the trained ANN models with the untuned hyperparameter setting are also presented in this table. Noteworthy is that for each training/test dataset, the same training/test data are utilized for both the untuned and tuned hyperparameter settings. Evidently, this demonstrates that both the prediction accuracy (MSEs) and precision (standard deviations) of the trained ANN models with the tuned hyperparameter setting are exceptionally improved for all these four training/test datasets, suggesting that the present hyperparameter optimization using the Taguchi method is an effective and feasible means of enhancing the performance of the ANN prediction model. With the same optimal hyperparameter setting, the ANN models are also trained on these four different training/test datasets generated by the other four sampling strategies (namely, GR, SR, GSGR, GSSR). In total, there are twenty trained ANN prediction models in accordance with the four training/test datasets and five sampling strategies.
orthogonal array (OA), as shown in Table 10, to seek the optimal combinations of hyperparameters and levels. According to the previous literature reports, three hidden layers are found to effectively reduce the calculation time, and, meanwhile, provide good prediction accuracy [28]; the optimizer Adam could achieve good results with high efficiency on most neural network architectures [27]. Thus, they are implemented in the ANN model. The considered hyperparameters for optimization are activation function (A), batch size (B), learning rate (C), number of neurons in the first hidden layer (D), number of neurons in the second hidden layer (E), number of neurons in the third hidden layer (F), and two dummy factors. For the hyperparameter optimization, the training/test dataset with 324 sample data is used. The hyperparameters and their levels used in the Taguchi experimental design are presented in Table 11. The MSE of the predictions of the trained ANN model on the test dataset is considered as the objective of the Taguchi experimental design. The the-smaller-the-better criterion is used for the minimization of the MSE. The signal-to-noise (S/N) ratios of all the experimental runs in the OA are calculated. The mean S/N ratio for each level of control factors is summarized in the S/N response graph shown in Figure 12. The response graph is utilized to identify the most significant hyperparameters and their optimal level set for achieving an improved performance of the trained ANN warpage prediction model. From this response graph, it is found that the optimal level set of these hyperparameters is A1, B3, C3, D3, E3, and F3, i.e., exponential linear unit (ELU) activation function, a batch size of 30, a learning rate of 0.005, and 11 neurons for all three hidden layers.   After the ANN models were suitably trained, the aforementioned validation dataset with sixty-four sample data generated with the GS sampling strategy was further used to assess and compare the performance of these twenty trained ANN warpage prediction models. The corresponding prediction performances of the ANN models with the tuned hyperparameter settings are summarized in Table 13 and Figure 13, in terms of the difference in the average warpage with standard deviation and the maximum warpage between the calculations and predictions. The following facts can be observed from this table. First of all, it is clear to see that a larger number of training/test data tends to yield a better prediction result in terms of both the average warpage and maximum warpage for all these five sampling strategies. This result is aligned with the literature findings, such as Panigrahy et al. [21]. Then, among these five sampling strategies, without a doubt, the GS strategy would have the best prediction performance, irrespective of the training/test datasets applied, due to its even data distribution, while the GR would obtain the worst prediction results due to an uneven data distribution. It is, however, pointed out that even though the GS can obtain the best prediction results, it needs to completely reconstruct the whole sampling dataset using an FFD of DOE when more sample data are in demand for better prediction accuracy, thus requiring a much higher computational and modeling effort in data generation. Next, it is interesting to see that the proposed SR sampling strategy tends to demonstrate a superior prediction capability than the GR, and even the GSGR, especially in the average warpage difference between the calculations and predictions, despite having a poorer performance than the GS. Apart from that, the proposed GSSR hybrid sampling strategy outperforms not only the GR and GSGR but also the proposed SR. Finally, the GS and the proposed GSSR provide a very comparable prediction performance, but the latter is comparatively much more flexible in producing more sample data and is also more computationally cost-effective.  Figure 13. Influences of sampling strategies on prediction performance of the ANN models.

Conclusions
This study successfully establishes an ANN-based deep learning warpage prediction model using a novel subdomain-based sampling technique and Taguchi hyperparameter optimization to facilitate the process-induced warpage prediction and design of the FCCSP in the initial design stage. To characterize the process-dependent warpage behavior of the FCCSP, an FEA-based process modeling approach that takes into account the viscoelastic behavior of the EMC material. The effectiveness of the proposed process modeling approach is extensively demonstrated by comparing the simulated results with the measured data. The validated process modeling approach is subsequently applied in both the parametric analysis for exploring the key factors most affecting the processinduced warpage behavior, and the construction of the warpage prediction model using the ANN. The superiority of the proposed sampling and hyperparameter tuning techniques is extensively justified by comparing with other existing models, and the applicability of the constructed warpage prediction model is well confirmed using the validation dataset. Several essential conclusions are deduced below.

Conclusions
This study successfully establishes an ANN-based deep learning warpage prediction model using a novel subdomain-based sampling technique and Taguchi hyperparameter optimization to facilitate the process-induced warpage prediction and design of the FCCSP in the initial design stage. To characterize the process-dependent warpage behavior of the FCCSP, an FEA-based process modeling approach that takes into account the viscoelastic behavior of the EMC material. The effectiveness of the proposed process modeling approach is extensively demonstrated by comparing the simulated results with the measured data. The validated process modeling approach is subsequently applied in both the parametric analysis for exploring the key factors most affecting the process-induced warpage behavior, and the construction of the warpage prediction model using the ANN. The superiority of