A High-Generalizability Machine Learning Framework for Analyzing the Homogenized Properties of Short Fiber-Reinforced Polymer Composites

This study aims to develop a high-generalizability machine learning framework for predicting the homogenized mechanical properties of short fiber-reinforced polymer composites. The ensemble machine learning model (EML) employs a stacking algorithm using three base models of Extra Trees (ET), eXtreme Gradient Boosting machine (XGBoost), and Light Gradient Boosting machine (LGBM). A micromechanical model of a two-step homogenization algorithm is adopted and verified as an effective approach to composite modeling with randomly distributed fibers, which is integrated with finite element simulations for providing a high-quality ground-truth dataset. The model performance is thoroughly assessed for its accuracy, efficiency, interpretability, and generalizability. The results suggest that: (1) the EML model outperforms the base members on prediction accuracy, achieving R2 values of 0.988 and 0.952 on the train and test datasets, respectively; (2) the SHapley Additive exPlanations (SHAP) analysis identifies the Young’s modulus of matrix, fiber, and fiber content as the top three factors influencing the homogenized properties, whereas the anisotropy is predominantly determined by the fiber orientations; (3) the EML model showcases good generalization capability on experimental data, and it has been shown to be more effective than high-fidelity computational models by significantly lowering computational costs while maintaining high accuracy.


Introduction
Applications of short fiber-reinforced polymer composites (SFRPCs) have been rapidly growing in fields such as bionic manufacturing, automotive, and aviation industries due to their lightweight, high strength, and ease of processing and manufacturing [1,2].Significant advantages of SFRPCs over continuous fiber composites are that they are easily produced materials, which are more suitable for rapid and high-volume manufacturing in complex geometries with lower expenses.
The mechanical performance of SFRPCs strongly depends on the microstructural parameters, such as the Young's modulus and Poisson's ratio of the fibers and matrix, fiber dimensions and volume content, and the alignment distribution of fibers [3].In particular, the orientation of fibers and the number of fibers aligned with the loading conditions can be customized by the injection molding process in manufacturing [1].Engineers and scientists can evaluate materials' functional characteristics before manufacturing, cut down on pointless experimental testing, streamline the entire design process, and lay a crucial foundation for expanding their applications and improving material performance by accurately predicting the mechanical performance of SFRPCs.
The orientation distribution of the fibers of SFRPCs can be experimentally examined using microscopic techniques like optical diffraction, electron microscopy, and X-ray radiography [4,5].Image-informed statistical models are accordingly proposed to address the fiber distributions such as the diffusion orientation distribution function [6].However, the complexity and diversity of fiber orientation states also present the primary challenge in aspects of constitutive relations [7], computational models [8], and numerical implementations [9].
Great efforts have been devoted to performing multiscale simulations utilizing Representative Volume Element (RVE) models to involve more accurate distributed fibers [10].RVEs often fail to effectively reflect the intricate spatial fiber orientation, and increasing algorithm difficulties occur with rising fiber volume content [11,12].The high computational time cost of Finite Element Analysis (FEA) presents another significant hurdle in simulations.The two-step homogenization method is proposed as an alternative effective approach [3,13].The hybrid framework employs an Orientation Averaging (OA) approach to derive the mechanical properties on the basis of analysis of RVEs involving unidirectional fibers.This method significantly reduces modeling complexity and offers broader applicability.Nevertheless, it still cannot avoid the time-consuming nature of finite element calculations [14].
In recent years, machine learning (ML) techniques have been widely adopted in relevant fields of biochemistry [15], material science [16,17], and mechanical performance analysis [18,19].with an end-to-end prediction paradigm, using a simulated dataset is commonly reported [19,20].As suggested in [21], by providing enormous amounts of relevant data, numerical modeling naturally complements the ML technique and aids in creating reliable data-driven models.Simulation data with composite RVEs have been widely used to develop ML surrogates for fast prediction of the homogenized properties of composites [22][23][24].
The model selection is the key to determining the model's performance, which varies depending on the complexity and heterogeneity of the microstructures.The simplest tree models and supported vector regressions (SVR) are efficient in dealing with homogenized elastic properties of 2D materials with idealized well-arranged microstructures [25].Meanwhile, deep neural networks (DNN) have demonstrated efficiency in discovering distinct structures and learning the nonlinear relationship between the feature vector of microstructures and the expected effective mechanical properties [14,26,27].Advanced image-based convolutional neural networks (CNN) and generative adversarial networks (GAN) display their strong capability in predicting the stress-strain curve of composites directly from RVEs that embed cracks and voids [28,29].
For composites with short fiber inclusions, CNN and DNN models are constantly used to investigate homogenized properties from the three-dimensional diverse spatial microstructures of fibers.Ref. [30] suggested a CNN model that takes the 2D RVE image as input with a different range of Young's modulus of carbon fibers and neat epoxy, and obtains output as a visualization of the stress components.Ref. [14] utilized a DNN model that incorporates a micromechanical model to investigate the homogenized elastic properties of short fiber-reinforced composite.On one hand, training neural networks, particularly the CNN algorithm, is very time-consuming.On the other hand, existing works pay more attention to finding super-machine-learning models with capabilities beyond human limits while ignoring the key interpretability [31,32], resulting in the unknown mechanism of how the features modulate the machine-learning models.Lack of interpretability makes the data-driven models untrustworthy and would be problematic in model generalization [33,34].
The selection of appropriate machine learning models and efficient explanation strategies are receiving more attention [35].Ensemble machine learning (EML) [36] is a general meta-approach machine learning that seeks better predictive performance by combining the predictions from multiple weak simple learners, such as tree models, SVR, and Gradient Boosting models.The EML model has demonstrated its application in accelerating composite material performance investigation [37,38] and optimization design [39,40].
Regarding the model interpretability methods, the SHapley Additive exPlanations (SHAP) method has gained increasing popularity, which explains how each feature influences the model and allows local and global analysis [41].SHAP analysis has offered valuable insights into the disciplinary fields of composite performance analysis [19,37,42,43] and sensor fault detection [44], as well as manufacturing processes [45].
In this study, we propose an interpretable and high-generalizability data-driven model for accurately predicting the homogenized mechanical properties of SFRPCs.We utilize a staking algorithm of three base learners of ET, XGBoost, and LGBM.The dataset is constructed by numerical simulations using composite RVEs that are integrated with the two-step homogenization method.We comprehensively assess the model's accuracy, efficiency, interpretability, and generalization capability.In particular, a SHAP analysis is performed to extensively evaluate the influence features and the underlying mechanisms on predictions.The model's capacity to generalize is thoroughly investigated by including experimental data that take into account the real fiber distribution of SFRPCs as reported in the literature [46,47].

Material and Method
2.1.Two-Step Homogenization Procedure 2.1.1.Orientation Tensor for Describing Fiber Distribution Figure 1 depicts the fiber orientation within a unit sphere Ω.Within a Cartesian coordinate system, the orientation vector p can be properly defined using two angular values [θ, α] [48]: Figure 1.The probability distribution of fibers in a unit sphere Ω the one fiber direction can be expressed using two angle values of α and θ.
The probability distribution function ψ( p) can be used to address the orientation of all fibers within the region [48]: such that the fiber distributions with a specific region ω can be statistically described via a second-order orientation a, which is derived from the probabilities: The tensor a possesses properties of symmetry with a ij = a ji , and it holds a trace equal to 1.And the orientation tensor can be further decomposed into a diagonal tensor Λ and a rotation one R: where , γ 2 and γ 3 are rotation angles around axis of e1, e2 and e3, respectively.

The Two-Step Homogenization Method
The process of the two-step homogenization method is illustrated in Figure 2. Composites having aligned short fibers with uniform length and mechanical properties are first considered to build the unidirectional RVEs (noted as UD RVEs).Accordingly, their homogenized properties can be obtained via EF simulations.Afterward, OA is applied to UD RVEs with orientation tensors to calculate the properties of composite RVEs with distributions of fiber orientation, without modeling of RVEs with realistically distributed fibers [3,9].Derivation of the stiffness tensor can be obtained as [13]: where C ijkl represents the stiffness matrix of the desired arbitrary orientation RVE, a ij and a ijkl denote its second-order and fourth-order orientation tensors respectively, and δ ij is the Kronecker delta.B i (i = 1 − 5) in Equation ( 6) are parameters that are derived from the stiffness matrix of UD RVE C UD ijkl .The 21 components of the stiffness matrix C can be acquired from Equations ( 5) and ( 6): The homogenized mechanical properties of SFRPCs are mainly determined by the properties of matrix and fiber, as well as the microstructure parameter of fibers [2].For this specific problem, we intend to make the prediction of the composite matrix using 12 features, with the selection based on the strategies discussed in [7,14].
The 12 input features include 7 properties [E m , ν m , E f , ν f , d, a, VF], and 5 individual components of the orientation tensor in Equation (3) for indicating the distribution of the short fibers.E and ν denote Young's modulus and Poisson's ratio, respectively, with the subscript m and f representing the matrix and fiber.d, a, and VF describe the fiber diameter, the ratio of fiber length to diameter, and the volumetric fraction of fibers.

Data Preparation
The dataset is constructed by numerical simulations using composite RVEs with the two-step homogenization method.
Firstly, 180 sets of UD RVEs are set up for parametric simulations in ABAQUS 6.13.The parameter space in UD RVE simulations is designed as in Figure 3.The parameter pairs for matrix Young's modulus and Poisson's ratio (Figure 3a), fiber Young's modulus and Poisson's ratio (Figure 3b), fiber diameter and fiber aspect ratio (Figure 3c), and fiber volume fraction and diameter (Figure 3d) are chosen [14].The fiber volume fraction VF follows a normal distribution and the rest of the parameter pairs obey uniform distributions.
Secondly, for each UD RVE, 60 different orientation tensors are applied using the OA method, yielding a total of 10,800 sets for calculation of the homogenized mechanical properties.The uniformity of orientation distributions is explored by creating diagonal tensors initially and then employing the rotation angles.
In particular, we choose ten diagonal tensors, three of which represent unidirectional dis- and three-dimensional isotropic distribution (a =   0.33 0.33 0.33

 
).The remaining seven tensors symbolize randomly distributed fibers in three-dimensional space.Thus, for each set of UD RVEs, 60 different orientation tensors can be obtained by randomly rotating each diagonal tensor five times.Additionally, to prevent excessive concentration of angles uniform distribution is applied to γ 1 and γ 3 in the interval (0, π), and the following formula is used to sample γ 2 : where t is selected randomly in interval (0, 1). Figure 4 displays the distribution of diagonal tensor and rotation angles, created according to the process explained above.In Figure 4a, greater scatterings of data points are seen at the triangular surface's vertices, particularly in the case of a planar or 3D-random surface.This is a result of the limitations of the parameter space and the random generation of fiber orientation sets as in Equation ( 8). Figure 4b shows the randomness of the first 2000 sets of angles.Three angles in each set are used to rotate the diagonal orientation tensors around three axes.It can be found that there are no significant gaps or clusters of data points in the distribution.

Data Analysis
The dataset with 10,800 samples is collected by the hybrid simulation framework using the two-step homogenization algorithm.Figures 3 and 4 visualize the distribution of the input features following the statements in [14].Figure A1 in Appendix A further describes the distribution of each parameter, and Figure A2 shows the correlation matrix that is obtained using a correlation study analysis.
The correlation analysis is presented via the criterion based on the Pearson Correlation Coefficient (PCC) [49], which is a measure of the linear dependence between two random variables.The PCC has a number between -1 and 1 that measures the strength and direction of the relationship between two variables.As noted in Figure A2, the input features are basically independent, while the tensor variables of a 11 and a 22 in Figure 4 implying negative but still weak correlations.This is aligned with the derivation of the diagonal tensor components in the two-step homogenization method, and the understanding of the mechanical definition in the knowledge about fiber-reinforced polymer composites [50].However, as stated in [51], PCC cannot reflect a certain correlation with each other.It is therefore necessary to apply the optimal ML model combined with the ML explainer based on SHAP to investigate the internal relations between model outputs and input features.
More knowledge and mathematical theories for PCC can be found in [36].

Ensemble Machine Learning
This study uses a stacking ensemble model [37,38], which involves fitting numerous model types to the same data and using a different model to learn how to combine the predictions most effectively.

Base Learners
Enlightened by reference works that are devoted to the homogenization problem within composites that are reinforced with different fiber types in [12,14,[22][23][24][25]28,37,52], as we discussed in the introduction section, we have experimented with a variety of base models based on the dataset, ranging from the basic SVR, RF, and DT models to more complicated ET, XGBoost, and LGBM models.
The preliminary comparison investigation also found that the basic yet interpretable model does not perform well on this challenge (R 2 values less than 0.85), given the distribution of short fibers introduces more features that influence the anisotropy of the composites.This is consistent with the findings of earlier studies that used simulated datasets to predict composite homogenization.We therefore subsequently decided on three options for designing the EML model: ET, XGBoost, and LGBM.
In particular, the ET algorithm is an enhancement of the Random Forest algorithm that employs a "Bagging" technique and consists of numerous decision tree learners [53,54].Both XGBoost and LGBM use a gradient-boosting framework.LGBM performs leaf-wise (vertical) growth as opposed to level-wise (horizontal) growth in XGBoost, which produces more loss reduction and, as a result, improved accuracy while being faster [55].As such, those models are trained on the unchanged train dataset, ensuring that they make different assumptions and, in turn, have less correlated prediction errors.
It should be noted that although the base learners are complicated models, they are far simpler than neural network-based models.Appendix B summarizes the preliminary knowledge for the base learners, such as backgrounds, objective functions, and loss functions.In terms of models such as SVR, RF, and DT models, which could have been learned in-depth in Reference [36], the basic introductory knowledge will not be addressed in this work.Furthermore, functional interfaces for multiple classification, regression, and clustering methods are available in the machine learning library Scikit-learn [56].The three base learners and models of SVR, RF, and DT are also included.Users can simply utilize the library to create models using the Python programming language after they have a basic understanding of the basic mathematical concepts.

Stacking Mode
As shown in Figure 5, the stacking model takes into account diverse base learners, trains them concurrently, and then combines them by training a meta-learner to produce a prediction based on the predictions of the various base learners.The base learners are trained using the original dataset, in which the 5-fold crossvalidation is employed to avoid overfitting.Each base learner has different hyperparameters that need to be optimized.An ensemble learning model inputs the predictions as the features and the target is the ground-truth values in the dataset, it attempts to learn how to best combine the input predictions to make a better output prediction [36].
This work trains the EML model on the constructed dataset to predict the 21 components of the stiffness matrix based on the 12 input features.

The SHAP-Based Interpretation Analysis
Low interpretability is often caused by machine learning models' complexity and highly non-linear architecture.Compared to conventional approaches that just specify what feature is crucial but fail to explain how that feature influences the predictions [57,58], the SHAP method offers a more effective means for evaluating the "black box" of machine learning models [41].Based on cooperative game theory, the method uses SHAP values to quantify the impact of individual features by calculating the contribution among various permutations of feature combinations.
The explanatory model f (x) in response to a input feature x can be expressed as [41]: where ϕ 0 is a constant of " base value" depicting the average of all predicted values, and N is the total number of input features, and x i is a binary feature value where x i = 1 represents a present feature and x i = 0 represents absent features.ϕ i is the SHAP value of the i-th feature that measures the contribution of the i-th feature from a composition of all features.
The SHAP analysis is compatible with various model types.In SHAP, each feature is assigned an importance value, and the influence of all the features on the predicted responses can be assessed via global and local interpretations.According to Equation ( 9), the global interpretation uses SHAP values to depict feature relevance with a positive or negative impact on predictions, whereas the local interpretation generates SHAP values and illustrates how the fed-in features contribute to that specific prediction [32,44].

Hyperparameter Optimization and Model Training
We build, train, and optimize the EML model with the PyTorch package in Python [59].The model hyperparameter adjusting performs a crucial factor in the prediction performance of the ML model.Selecting the optimal values for the hyperparameters of a machine learning algorithm is a critical step for developing an accurate and trustworthy model [60].This work implements hyperparameters optimization, using the grid search method [61,62], to detect all possible values in the search space in order to find the optimal structure for the three base learners.Additionally, 5-fold cross-validation is used to overcome the issue of overfitting and is combined with a hyperparameter tuning technique based on grid search to identify the optimal values.
Basic knowledge of ML techniques of grid search and K-fold cross-validation are provided in Appendix C.1 and C.2, respectively.For the three base learners, each algorithm had its unique hyperparameter to be optimized, with a detailed description provided in Appendix C. 3.
The optimal values of hyperparameters of the three base learners are listed in Table 1.

Validation of Dataset Based on the Two-Step Homogenization Method
The method utilizes a statistical tensor to effectively involve the fiber orientations in homogenization calculations [2,14,63].A brief validation analysis is first conducted to ensure the accuracy of the two-step homogenization method.
Figure 6 demonstrates one UD RVE and four RVEs with different fiber rotation angles.The microstructural parameters for UD RVE (specified as α = 0, β = 0) in Figure 6a and the simulated homogenization mechanical properties are given Tables 2 and 3, respectively.Table 2.The microstructural parameters for the UD RVE in Figure 6a.The homogenized mechanical properties for RVEs in Figure 6b-e are first calculated using the two-step homogenization method.Meanwhile, full FEA simulation results using the RVEs are conducted in ABAQUS to verify the above hybrid approach, with the comparison shown in Table 4.
It can be seen that, compared to the FEA-determined results, the average error of the calculated results using the two-step homogenization method is 0.82%.Hence, the homogenized mechanical properties of the SFRPC RVEs with the considered fiber orientations derived by the two-step homogenization framework are validated as reliable and effective.

Evaluation Metrics
The coefficient of determination R 2 , mean squared error (MSE), and mean absolute percentage error (MAPE) are adopted as evaluation metrics, with the following formulas [64]: where y i is the ground-truth of the i-th sample point in the database, ŷi is the predicted values by the EML model, ȳ is the average value of all sample, and n is the number of sampels.

Accuracy Comparison between the EML and Base Learners
The EML model combines three basic learners to make predictions of the matrix with components from 12 input features.Table 5 summarizes the comparison of the prediction accuracy of the EML model with three base learners on both the training and testing set.On an overall basis, the model's accuracy on the training dataset is higher.Among the base learners, the LGBM model achieves the highest R 2 values of 0.984 and 0.946 on the train and test datasets, respectively.It also performs the best in the two other evaluation metrics of MSE and MAPE.The three base learners are outperformed by the EML model in each evaluation metric (R 2 = 0.988, MSE = 2.545 × 10 6 , MAPE = 0.567% on train, and R 2 = 0.952, MSE = 1.260 × 10 16 , MAPE = 0.906% on test), indicating an effective enhancement of the prediction accuracy.Figure 7 further visualizes the computed R 2 with the scatter plots of predicted vs. ground truth values of matrix component C 15 using different machine learning models.It can be seen that the ET model is the least accurate predictor of the three base learners, with R 2 values on the train and test sets of 0.951 and 0.894, respectively.In contrast, the R 2 value for the EML model on the train and test sets, respectively, is promoted to 0.969 and 0.908, making it the best fitter.

Model Performance of the EML Model on a Testing Sample
To fully evaluate the model's efficacy in making predictions on the 21 components of matrix C, Table 6 summarizes the evaluation metrics of R 2 , MSE, and MAPE of the EML model's outputs on a testing sample.It can be observed that the EML model predicts 9 features with high precision, such as C 11 , with a R 2 value greater than 0.999.The prediction accuracy is slightly reduced but still reaches a high level of R 2 > 0.90 for the remaining 12 outputs, such as C 14 , maintaining the appearance of a well-fitted model.The variation in prediction accuracy is commonly seen in developing data-driven models, which is attributed to the dataset construction, and the selection of input features [37].Overall, the EML model showcases superior accuracy in terms of using 12 fed-in features and predicting the matrix of the polymer composites.

Model Interpretation via SHAP Analysis
In order to provide a comprehensive overview of how anisotropic mechanical properties are influenced by microstructural variables, the original output stiffness matrix C in Equation ( 7) has been translated into homogenized Young's modulus E, Poisson's ratio ν, and Shear modulus G.Most concerned Young's modulus E 11 , E 22 , and E 33 , which characterize the anisotropy of SFRPCs, are specifically subjected to global and local SHAP analysis [11].

Global Interpretation
Figure 8 presents a global interpretation of feature importance analysis for E 11 , E 22 , and E 33 .Figure 8a-c are bar plots of SHAP values, which illustrate the importance rank, and direction of each input feature.As shown, the homogenized Young's modulus is primarily influenced by three factors: the Young's modulus of E m , E f , and the fiber content VF.Notably, all of these factors exert a positive influence on the homogenized Young's modulus.This phenomenon arises from the fact that E m and E f represent vital mechanical properties of composite material constituents, while VF governs the proportion of strengthening components.In general, enhancing both the mechanical properties of constituents and increasing the content of reinforcement components contributes to the improved homogenization performance of the composite material.The orientation tensor components a 11 , a 22 , and a 33 play a pivotal role in determining the anisotropy of composites.It can be observed that Young's modulus E 11 in Figure 8 is markedly and positively impacted by the feature a 11 .This occurs because a 11 characterizes the distribution of fiber orientation in the 1-direction, with larger values of a 11 indicating a greater presence of fibers aligned in the 1-direction.Similar findings show that the features defining fiber orientation a 22 and a 33 appropriately determine anisotropy for E 22 in Figure 8b and E 33 in Figure 8c.It is worth noting that, in Figure 8c, a 11 and a 22 display negative influences on E 33 , which attributes to the relation between the diagonal tensor of a 33 = 1 − a 11 − a 22 , as described in Figure 4.
Figure 8d-f are represented as beeswarm plots that highlight the significance of the feature in relation to the actual relationships with the predicted outcomes.The beeswarm plots reveal the same feature information as the bar plots.The horizontal position of the dot is determined by the SHAP value of that feature, and the dots "pile up" along each feature row to rank the feature's importance.The color code represents the original numerical value of a feature.For example, the high value of E m and E f will increase the value of Young's modulus in each direction.
Both the bar plots and beeswarm plots indicate that the other features, including Poisson's ratio, fiber length-diameter ratio, and the orientation tensor's deviatoric components, have been identified to support the predictions while having fewer influencing effects than the other aspects.
The SHAP-based explanation is consistent with theories and knowledge from mechanism analysis in simulations [11] and experimental findings [46].

Local Interpretation
The goal of SHAP local interpretability is to describe the contributions made by each input feature in terms of how individual predictions are made.
We select two testing samples, and Figure 9 visualizes the positive and negative contributors in predicting E 11 , E 22 , and E 33 .The red arrows indicate a positive impact of the corresponding input feature on the output, while the blue arrows signify a negative impact.According to Equation ( 9), the length of the arrows reflects how much the input feature has influenced the model.For Sample (a), the small values of E m = 5.590 GPa, E f = 18.630GPa, VF = 0.110 show negative impact and results in comparable small magnitudes of E 11 , E 22 , and E 33 .In comparison, the diagonal tensor component of a 11 , a 22 , and a 33 accordingly promotes Young's modulus which signifies a positive impact.
Sample (b) contains a scenario that is the opposite of Sample (a).For Sample (b), the local analysis quantifies the large inputs of E m = 11.970GPa, E f = 94.960GPa, VF = 0.275, leading to rather small magnitudes of E 11 , E 22 , and E 33 .By contrast, features that denote the fiber orientation (a 11 , a 22 , and a 22 ) show negative impacts.
It can be found that the local analysis' findings are consistent with the global interpretations, while it details the modulating mechanism of the influencing features in each individual case.Understanding the feature contributions and their underlying modulating mechanism can also provide an optimized strategy to customize the composite properties by following the local analysis [65].

Model Generalization on Experimental Data
On the prepared simulation dataset, the trained EML model exhibits excellent accuracy and efficiency with the predictions being interpretable.To determine the model's generalizability, we further apply the EML model to the unknown dataset obtained from experiments.
The polymer composite, known as PA15 due to 15% fiber weight fractions orginiates from [46], is a composite made of glass fibers and a polymer matrix.Another set of experimental data is derived from [47], and the material is referred to as PA6GF35, a polyamide-6 matrix with 35% (by weight) glass-fiber-reinforcement. Table 7 lists the microstructural parameters of the two sets of polymer composites, including the elastic properties of matrix and fiber, as well as the measured fiber orientation tensor.Using the given parameters, we employ the EML model to predict the macroscopic mechanical properties of those short fiber-reinforced composites.For PA6GF35, Figure 11 illustrates a comparison between the experimental results of Young's modulus E 11 and E 22 and the EML-predicted results.Meanwhile, two high-fidelity computational models reported from [47] are included for prediction comparison, namely the "Mean data" and "Fixed length".It can be noted that the EML model agrees well with the experimental results with a maximum error of less than 3%, with the predicted E 11 and E 22 closely aligned with the high-fidelity models.
In terms of prediction efficiency, as shown in Table 8, using a mesh of 19,937 elements, the Digital-FE requires 2911 s to complete relevant calculations.In contrast, the EML model significantly lowers the cost of computing and produces predictions in less than one second.
The EML technique displays a highly generalized machine learning model inside the two experimental scenarios with polymer composite reinforced with various fillings of glass fibers.Furthermore, it has been demonstrated to be significantly more cost-effective than high-fidelity computational models while maintaining excellent accuracy.

Conclusions and Outlook
This study proposes a novel interpretable and high-generalizability ensemble learning model to efficiently predict the homogenized properties of SFRPCs.A micromechanical model of a two-step homogenization algorithm is integrated to construct a diverse and representative dataset, in which the random-distributed fibers are efficiently implemented.We perform hyperparameter optimization for each base learner of ET, XGBboost, and LGBM.The EML model is comprehensively assessed using metrics of accuracy, efficiency, interpretability, and generalization capabilities.The following conclusions can be drawn: (1) The two-step homogenization algorithm is validated as an effective approach with which to consider different fiber orientations, which favors the creation of a trustworthy dataset with a variety of the chosen features.
(2) The EML model outperforms the base members of ET, XGBoost, and LBGM on prediction accuracy, achieving R 2 values of 0.988 and 0.952 on the train and test datasets, respectively.(3) According to the SHAP global analysis, the homogenized elastic properties are significantly influenced by E m , E f , and VF, whereas the anisotropy is predominantly determined by a 11 , a 22 , and a 33 .The SHAP local interpretation distinguishes the key modulating mechanism between the key features for individual predictions.(4) The EML algorithm showcases a highly generalized machine learning model on experimental data, and it is more efficient than high-fidelity computational models by drastically reducing computational expenses and preserving high accuracy.
The paradigm proposed in this study holds potential applications to customize the fiber arrangements with a different filling of fibers in polymer structure and, as a result, offers manufacturing process optimization techniques.The model is first evaluated for generalizability on the basis of two experimental cases.It can be further packaged as a Graphical User Interface (GUI) in order to improve its practical utility [66,67] on the topic of analyzing homogenized polymer composites with short fibers.
The limitations are also acknowledged.Current work focuses on analyzing the elastic properties of the SFRPCs by considering one type of reinforcing fiber in the polymer structure.The two-step homogenization method utilizes a statistical tensor to effectively involve the fiber orientations in homogenization calculations, while the cross-linking of fibers, the mechanical interactions, and the morphology of fiber curvatures are ignored [68,69].Hybrid composites contain more than two types of fiber fillings in polymer composites [70] , mechanical interactions between fibers and polymer matrix, matrix defects, voids, and failure should be considered in future investigations [71].The PCC has a number between -1 and 1 that measures the strength and direction of the relationship between two variables [36].

Appendix A. Input Feature Distribution and Correlation Analysis
in [77]).Given that the parameter space for some parameters in the machine-learning method may include spaces with real or unlimited values, it is exceedingly likely that users will need to set a boundary in order to use a grid search.Grid search suffers from high dimensional spaces, but often can easily be parallelized since the hyperparameter values that the algorithm works with are usually independent of each other.
Please see [36,77] for more information on the grid search method for machine learning models.
• n_estimators controls the number of trees in the model.Increasing this value generally improves model performance, but can also lead to overfitting.• max_depth is used to limit the tree depth explicitly.• learning_rate determines the step size at each iteration while moving toward a minimum of a loss function.• gamma is the minimum loss reduction to create a new tree-split.• min_child_weight is the minimum sum of instance weight needed in a child node.• subsample denotes the proportion of random sampling for each tree.• colsample_bytree is the subsample ratio of columns when constructing each tree.
The three main hyperparameters in XGBoost are the learning rate, the number of trees, and the maximum depth of each tree.These hyperparameters control the overall behavior and performance of the model.
• num_leaves is the maximum number of leaves, the main parameter to control the complexity of the tree model.• min_data_in_leaf very important parameter to prevent over-fitting in a leaf-wise tree.• max_depth is used to limit the tree depth explicitly.• n_estimators is the number of boosted trees to fit.• learning_rate controls the step size at which the algorithm makes updates to the model weights.• colsample_bytree is the subsample ratio of columns when constructing each tree.• subsample dictates the proportion of random sampling for each tree.

Figure 2 .
Figure 2. Illustration of the two-step homogenization method: calculation of homogenized mechanical properties of RVEs considering.(a) unidirectional fiber distribution in two dimensions; (b) unidirectional fiber distribution in three dimensions; (c) random fiber distributions in two dimensions; (d) random fiber distributions in three dimensions.

Figure 3 .
Figure 3. Parameter pairs for UD RVE simulations: (a) elastic properties of E m and ν m ; (b) elastic properties of E f and ν f ; (c) fiber diameter d and aspect ratio a; (d) fiber volume fraction VF.

Figure 5 .
Figure 5. Illustration of the stacking algorithm.

Figure 7 .
Figure 7. Scatter plots of R 2 of C 15 between the FEA-determined ground truth values and the predictions from models of ET, XGBoost, LGBM, and the EML.

Figure 8 .
Figure 8.The global interpretation of feature importance analysis for Young's modulus of E 11 in (a,d), E 22 in (b,e), and E 33 in (c,f).Bar plots in (a-c) rank the order of feature importance with the color code of red indicating positive impact and blue implying negative impact.Beeswarm plots in (d-f) highlight the significance of the feature in relation to predicted outcomes.

Figure 9 .
Figure 9. Local SHAP analysis for the E 11 , E 22 , and E 33 predictions of two testing samples: (a) Sample a and (b) Sample b.

Figure 10 .
Figure 10.Model generalization on PA15: comparison between EML predictions with experimental results and Digimat-FE calculations.

Figure 11 .
Figure 11.Model generalization on PA6GF35: comparison between EML predictions with experimental results and two high-fidelity computational models.

Figure A1 .
Figure A1.Distribution of input features for 10,800 samples in the dataset.

Figure A2 .
Figure A2.Correlation matrix analysis of the input features in the dataset via PCC measurement.The PCC has a number between -1 and 1 that measures the strength and direction of the relationship between two variables[36].

Table 1 .
The optimal hyperparameters for the base learner models using grid search.

Table 3 .
The simulated homogenized mechanical properties for the UD RVE in Figure6a.

Table 4 .
Validation of simulations obtained via the two-step homogenization method and full FEA simulations on composite RVEs with different fiber orientations.

Table 6 .
Accuracy evaluation of predictions by the EML model on a testing sample.

Table 7 .
The microstructural parameters of SFRPCs of PA15 and PA6GF35.Figure10compares the experimental measurements of E 11 , E 22 , E 33 and G 23 , with the predictions from the EML model, and FEA results using Digimat software.The proposed EML model performs well in predicting elastic properties, with a maximum relative error of less than 10%.

Table 8 .
Model efficiency comparison between the EML model and Digimat-FE calculations.