1. Introduction
Seismic design is one of the most important aspects in the design of high-rise buildings. Various theories and technologies have been put forward to improve the seismic performance of high-rise buildings, e.g., conceptual design and performance-based design method [
1], the application of energy dissipation devices [
2], steel–concrete composite structures [
3], etc. At the same time, the assessment of the seismic performance of the design outcome is of equal importance, as it serves as a criterion to judge whether the design target can be achieved under specific seismic scenarios.
The next-generation seismic performance assessment method was published by the US Federal Emergency Management and Agency (FEMA) as FEMA P-58 [
4] in 2012. This methodology incorporates the probabilistic approach to characterize the performance at the component level, considering both structural and non-structural components, and the seismic performance of the whole structure is quantified with economic and social losses such as repair cost, repair time, injuries and casualties through the loss functions of component damage. Following the FEMA P-58 methodology, several assessment standards were developed, including the REDi Rating System by ARUP [
5], the USRC Rating System by US Resiliency Council [
6], and the Chinese Standard for seismic resilience assessment of buildings (GB/T 38591-2020) [
7]. These standards enable detailed analyses of the consequences of earthquakes, and provide useful tools for risk management in designing new buildings or renovating existing buildings [
8].
It is noted that in the current seismic resilience assessment systems, the determination of damage states of structural components and the corresponding losses primarily relies on their deformation under seismic actions. Consequently, establishing the deformation limits and developing the fragility curves for components under different damage states are essential for this procedure [
9]. In FEMA P-58 and GB/T 38591-2020, the values for reinforced concrete components and steel components have been provided, based on a collection of experimental results. A methodology for developing such parameters using a test database is presented in FEMA P-58.
The methods for predicting deformation capacity of reinforced concrete (RC) members date back to the 1950s, known as the plastic hinge analogy [
10], which is still used in the New Zealand Engineering Assessment Guidelines [
11]. Then, regression-based empirical formulas were proposed by Pujol et al. [
12], Haselton [
13], etc. These mechanics-based or empirical formulas generally exhibit significant scatter when predicting the drift ratio [
14]. Moreover, for steel–concrete composite columns (with steel section encased in concrete) widely used in high-rise structures, such data is still lacking.
Earlier research has employed a parametric regression approach to study this problem, based on either a large-scale parametric finite element analysis or a collection of test databases. For example, Hu [
15] established a database of 6379 concrete-filled steel plate composite shear walls by finite element analysis and derived simplified formulas for calculating the ultimate curvature with data regression; Cui [
16] established a database of 103 RC beams, 469 RC columns and 236 RC shear walls, and developed failure mode classification and deformation capacity regression formulas based on data-fitting. Fu [
17] investigated the deformation capacity of SRC members by establishing a database including 246 SRC columns. However, due to the complex and high-dimensional coupling effects of the structural parameters, these semi-empirical models often lead to a compromise between physical simplicity and predictive accuracy.
Machine learning (ML) algorithms are capable at multi-dimensional nonlinear classification and fitting problems, providing a powerful tool for analysis of the highly complex and stochastic experimental results in civil engineering. In recent years, ML methods have been extensively investigated for multiple purposes, including failure mode classification [
18,
19,
20,
21], strength prediction [
22,
23,
24], deformation capacity prediction [
25,
26], backbone curve prediction [
27,
28], analysis of full loading process [
29,
30,
31], etc. It has been proven that ML models not only offer higher predictive accuracy, but also provide deep insights into the underlying failure mechanisms, effectively bridging the gap between data-driven black-box models and structural mechanics.
To facilitate the seismic resilience assessment of steel–concrete composite structures, this paper investigates the failure mode classification and lateral deformation capacity prediction of the most widely used type of composite component, i.e., steel-reinforced concrete columns. In this study, a comprehensive experimental database of SRC columns is established, covering a wide range of design parameters such as axial load ratio, encased steel ratio, and confinement configurations. Two ML frameworks and four mainstream ML models are developed to handle the complex nonlinear mappings between structural features and deformation at critical characteristic points. Statistical analysis, feature selection and hyperparameter optimization are conducted to ensure the predictive accuracy and robustness of models. Furthermore, the SHapley Additive exPlanations (SHAP) and feature importance analysis are employed to investigate the governing factors across different loading stages. This research offers a rich dataset of experimental results concerning the deformation capacity of SRC columns and a robust and reliable tool for the seismic resilience evaluation of high-rise composite structures.
3. Preliminary Analysis of Inter-Variable Correlation
3.1. Pearson’s Correlation Analysis
To preliminarily assess the interrelationships among various parameters in the database and the influence of input parameters on output parameters, the Pearson correlation coefficients between all parameter pairs were calculated with Equation (1):
where
X and
Y are the two parameters under consideration, cov(
X,
Y) denotes their covariance,
σX and
σY are the standard deviations of
X and
Y, respectively,
μX and
μY are the means of
X and
Y, and
E represents the expectation operator. The Pearson coefficients are presented in the heatmap in
Figure 4 to highlight significant parameter interactions (only variable pairs with absolute Pearson coefficients greater than 0.1 are displayed). A larger absolute value indicates a stronger linear correlation between two parameters, where positive and negative values represent positive and negative linear relationships respectively.
From the Pearson correlation coefficients, both the deformation limits for both yielding and ultimate points (θy and θu) increase with the strength of encased steel and longitudinal reinforcement (fak and fyl), while decreasing with higher concrete strength and axial compression ratios (fck and nt); moreover, enlarging cross-sectional dimensions (b and h) and increasing stirrup confinement (λv) are beneficial for improving the ultimate deformation capacity of SRC columns. These observations align with widely accepted domain knowledge, indicating that the experimental data collected from different literature sources can reflect a consistent influence pattern of input parameters on deformation capacity.
On the other hand, it is observed that the composite parameters formed by combining multiple input parameters often exhibit high correlation with their constituent parameters, and both composite and individual parameters may significantly influence output parameters. For example, the Pearson coefficients between
αl and its constituent parameters (
ρl,
fyk and
fck) are notably high, as
αl is mathematically derived from these three parameters. Meanwhile, similar to its constituent parameters,
αl also exhibits high correlation with deformation limits of both yielding and ultimate point. This phenomenon is known as multicollinearity in ML, where highly correlated features introduce redundancy in the training process of models, often adversely affecting model performance. To mitigate this issue, feature selection techniques must be employed for dimensionality reduction, as detailed in
Section 4.2.
3.2. Statistical Analysis of the Influence of Failure Modes on Deformation Limits: Pearson’s Correlation Analysis
Since the failure mode is a categorical parameter rather than a numerical parameter, it cannot be directly used to calculate Pearson correlation coefficients. Therefore, this section separately analyzes the influence of failure modes on deformation limits. The boxplots of deformation limits of the specimens at yielding and ultimate points, categorized by failure modes, are plotted in
Figure 5. It is observed that the deformation limits of shear-bond failure, shear-compression failure, and flexural-shear failure show minor differences, while the deformation limits of flexural failure exhibit significantly greater variation compared to the other three failure modes. Notably, the deformation limits of flexural-shear failure at yielding and ultimate points do not fall between those of shear failure and flexural failure, but instead are close to the values of shear failure; this is primarily because the encased steel section effectively enhances the shear capacity and energy dissipation capability of SRC columns, mitigating the brittle failure characteristics typically associated with shear-dominated failure modes.
Furthermore, quantitative analysis of the differences in deformation limits among various failure modes is conducted using analysis of variance (ANOVA) followed by Tukey’s honestly significant difference (HSD) post hoc test.
ANOVA is a parametric statistical procedure to compare means across three or more independent groups. By partitioning total variance into between-group and within-group components, the test statistic F is calculated as the ratio of between-group mean squares (MSB) to within-group mean squares (MSW), i.e.,
in which
SSB is the sum of squares between groups,
dfbetween is the degree of freedom between groups (
dfbetween =
k − 1, where
k is the number of groups),
ni is the sample size of group
i,
is the mean of group
i and
is the mean of all data;
SSW is the sum of squares within group,
dfwithin is the degree of freedom within group (
dfwithin =
N −
k, where
N is the total sample size), and
Xij is the
j-th sample in group
i. A significantly large
F-value (typically for
p < 0.05) indicates rejection of the null hypothesis (
H0:
μ1 =
μ2 = … =
μₖ) in favor of the alternative (
H1: at least one mean differs).
When ANOVA indicates significant differences among the means of different failure modes, Tukey’s HSD test is employed to conduct pairwise comparisons of group means. The critical
HSD value is calculated by
where
qα,k,N−k is the critical value from the studentized range distribution (depends on the number of group
k, degrees of freedom
df =
N −
k and significance level
α;
n is the sample size per group (harmonic mean is used if the sizes of the groups under comparison are unequal). If the absolute difference between the means of two failure modes exceeds the critical
HSD value (
), the difference is statistically significant (
p <
α).
The results of the above analysis are presented in
Table 3. The results of ANOVA indicate that at least one failure mode exhibits significantly different deformation limits for both yielding and ultimate points, compared to other failure modes. Furthermore, Tukey’s HSD test reveals that the deformation limits of flexural failure show statistically significant differences from those of shear-compression failure and flexural-shear failure, for both yielding and ultimate points. This result suggests that distinguishing failure modes may be beneficial for predicting deformation limits, especially for specimens exhibiting a flexural failure mode. For the other failure modes, although their deformation limits are relatively close, the fact that different failure modes reflect distinct failure mechanisms leads to the necessity to predict each failure mode for the classification model.
3.3. Integrated Analysis on the Influence of Input Parameters and Failure Modes on Deformation Capacity and Implications on the Prediction Procedure
To visualize the effects of input parameters and failure modes on the deformation capacity, the scatter plots of
θy and
θu with some of their highly correlated parameters are shown in
Figure 6, in which the data points of different failure modes are distinguished by different markers. The Pearson correlation coefficient
r between each input parameter and the target parameter is also displayed above each figure.
From
Figure 6, it can be observed that, without distinguishing failure modes, although the data exhibit a certain degree of dispersion, the general variation trends of
θy and
θu with respect to the input parameters are relatively clear, reflecting the influence mechanisms of these variables on
θy and
θu.
On the other hand, however, it can be seen from each figure that distinctions of data points from different failure modes are comparatively minor, indicating that the influence of a single input parameter on deformation capacity seems weakly dependent on the failure mode, which is inconsistent with the conclusions drawn in the preceding section. Therefore, it is necessary to further discuss the importance of distinguishing failure modes when predicting deformation capacity.
Summarizing the results of preliminary statistical analysis in this section, two ML framework will be discussed in this study: the first is a two-stage framework, which predicts the failure mode of SRC columns using classification algorithms and then predicts deformation limits using regression algorithms, taking the predicted failure mode as an additional input; the second framework is an end-to-end framework which predicts the deformation limits directly from input parameters, skipping the failure mode classification step.
4. Machine Learning Framework
The flowchart of the ML framework implemented in this study is shown in
Figure 7. First, feature selection is executed to eliminate redundant variables and prevent overfitting induced by multicollinearity. Then, different predicting procedures and ML models are evaluated to determine the best-performing configuration. Afterwards, the hyperparameters of the chosen model are optimized using grid search with five-fold cross-validation to improve its accuracy and stability. Finally, model interpretation is performed to verify the consistency of the data-driven predictions with domain knowledge.
4.1. Introduction to the Machine Learning Models and Performance Metrics
Considering the relatively small data volume (around 300 SRC column test specimens), numerous input parameters (17 input parameters, including both individual and composite ones) and ambiguous nonlinear relationships among variables (test results shows great dispersion due to coupled parameter interactions), four ML models, i.e., Support Vector Machine (SVM), Artificial Neural Network (ANN), eXtreme Gradient Boosting (XGBoost) and Random Forest (RF), are implemented in this study. The technical backgrounds of the models used in this study can be found extensively in ML-related research works and thus are not detailed here; only the main reasons for choosing the four typical models are listed:
SVM [
74] is a supervised learning algorithm for classification and regression, which tries to find an optimal hyperplane to maximize the margin between classes. It is efficient in dealing with high-dimensional small-scale datasets and can deal with nonlinear problems via kernel tricks.
ANN [
75] is included due to its capacity for modeling complex nonlinear relationships through layered architectures, which is suitable for automatically capturing the nonlinear relationships among various input parameters of SRC columns to predict the output variables.
XGBoost [
76] and RF [
77] represent two main branches of ensemble learning, which seeks to improve the accuracy and generalization capability of models by combining multiple simple base learners. XGBoost is an optimized gradient boosting algorithm that sequentially trains decision trees to correct errors from previous models, thus improving the precision of prediction, while RF trains parallel decision trees independently via bootstrap sampling (random subsets of data with replacement), aggregating their predictions through voting or averaging to enhance robustness and reduce overfitting.
To compare the performance of the aforementioned models, the evaluation metrics shown in
Table 4 are employed.
4.2. Feature Selection
As mentioned in
Section 3.1, since the original input parameter set of the collected experimental data includes both single features and composite features, in order to avoid a reduction in model accuracy caused by multicollinearity, feature selection combining Pearson correlation coefficients and variance inflation factor (VIF) is conducted.
Based on the Pearson correlation coefficients shown in
Figure 4, pairwise selection is conducted. For any pair of features exhibiting correlation coefficients exceeding a predefined threshold (0.85 is used in this study), only the feature with the higher correlation coefficient with the target parameter is retained. This step aimed to identify information redundancy pairwise, where two variables provide nearly identical information. Then, VIF is calculated for the remaining features to detect complex redundancy. The VIF is calculated with the following steps: for the considered input feature
Xi, supposing it can be fitted by a linear combination of the rest of the input features, the coefficient of determination of this linear combination
is calculated, and then VIF is obtained by
Features exhibiting a high VIF can be considered to be largely explainable by linear combinations of other features, thereby constituting redundant information. Removing such features is therefore a reasonable strategy. This strategy can be applied iteratively to obtain a more concise feature set.
Figure 8a shows the VIF of all input features before selection.
Combining the information from
Figure 4 and
Figure 8a, a systematic feature selection is conducted. Although
d,
h and
b provide complete geometric information,
d and
b are excluded due to their excessive VIF (>10), which indicates redundant geometric information. Similarly,
ρsv and
αl are excluded, as their physical roles were effectively represented by
ρv and the combination of
ρl and
fyl, respectively. The final feature set prioritizes dimensionless mechanical indices (e.g.,
nt and
λ), ensuring the model is both statistically robust and physically consistent. After feature selection, 10 input features are left to train the ML classification and regression model. These features are listed in
Table 5. The VIF of the remaining features is shown in
Figure 8b. It is observed that the VIF values for all the parameters have been reduced to below 5.0, indicating negligible multicollinearity.
4.3. Training and Results of Machine Learning Models
4.3.1. Failure Mode Classification
In this study, the ML models are implemented with Scikit-learn (sklearn) [
78], a robust and comprehensive open-source framework for predictive data analysis. For the classification models, the hyperparameters listed in
Table 6 are initially used, and the rest are kept at the default value of the program.
It is noted that the distribution of failure modes in the test database is highly imbalanced (
Figure 1), with nearly half of the specimens exhibiting flexural failure, while the number of specimens exhibiting shear-bond failure and flexure-shear failure is relatively small. This is consistent with the actual proportion in engineering practice, where engineers prefer flexural failure with good ductility. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) is employed on the training set. This technique generates new instances of minority classes (e.g., SB and FS) between a real minority sample and one of its nearest neighbors by linear interpolation, thereby filling the feature space of the minority class, so that the classifier can learn a more robust decision boundary.
For the SVM and ANN algorithms, distance calculations or gradient-based optimization is conducted in model training. Therefore, these models are dependent on the geometry of the feature space, and standardization on the input and output features is necessary to prevent the features with larger numerical scales from dominating the loss function or causing gradient saturation. Z-score standardization is conducted with the StandardScaler function provided in the sklearn package to transform the data of each parameter to have a zero mean and unit variance, i.e., where Z and x are the data after and before standardization, μ and σ are the mean and variance of the data before standardization.
The test database is randomly partitioned into a training set and a testing set with a ratio of 8:2. The ML models are trained using the training set (249 samples), while the remaining 20% of the data (63 samples), previously unseen by the models, is utilized to independently validate the models’ performance. The confusion matrices of four models on the testing set are shown in
Figure 9, with the accuracy and macro-F1 score marked on top of each matrix.
From
Figure 9, it is observed that all four models can reach a relatively high global accuracy and micro-F1 score, indicating that ML models can effectively distinguish the failure modes of SRC columns with the selected input features. Notably, the XGBoost model demonstrated the best predictive capability, achieving the highest classification accuracy of 0.968 and a macro-F1 score of 0.957. Specifically, the ensemble models (XGBoost and RF) exhibited exceptional precision in identifying ductile failure (F), with a 100% accuracy. For the brittle failure modes (SB and SC), the XGBoost model outperformed others, accurately capturing the minority class features facilitated by the SMOTE algorithm. The trained XGBoost model is therefore used in the two-stage framework, whose prediction result will be used as an additional input feature in the regression models of the second step.
4.3.2. Deformation Limit Prediction
For the regression models, the hyperparameters listed in
Table 7 are used. The standardization process for both input and output features, and the random partition ratio of training and testing sets, are identical to those of the classification step.
Figure 10 and
Table 8 show the regression results of
θy and
θu generated with the four models in the two-stage framework. The red dashed line in the figure represents where the predicted values are equal to the experimental values. Overall, the proposed two-stage framework exhibits high predictive performance, with the majority of data points located closely around the equality line (
y =
x). For
θy, all four models achieve
R2 values on testing set ranging from 0.837 to 0.882, and RF model outperforms others with the highest
R2 of 0.882 and lowest
RMSE of 0.00150; for
θu, the models also show high predictive capacity, with
R2 values on testing set exceeding 0.837 across all the models, and XGBoost model can reach
R2 of 0.893 and
MAPE of 16.40%, providing a highly reliable tool for ultimate deformation capacity analysis.
The robustness of the two optimal models is further evaluated by analyzing the R2 degradation between the training and testing sets. For θy, the RF model exhibited an R2 degradation of 0.085, and for θu, the XGBoost model exhibited an R2 degradation of only 0.101; both show satisfactory robustness. Together with the high testing performance (R2 near 0.9), it is confirmed that the model has effectively captured the mechanical laws governing the yielding and ultimate limit of SRC columns rather than overfitting to specific training samples.
4.3.3. End-to-End Framework
This study also proposes an end-to-end framework to directly predict the deformation capacity of SRC columns from input parameters, and the results are shown in
Figure 11 and
Table 9. Again, the models showed high predictive accuracy and satisfactory robustness. Similar to the case of the two-stage framework, the RF and XGBoost models are the optimal models for
θy and
θu with
R2 values on the testing set of 0.879 and 0.895, respectively.
4.4. Comparison of the Two Frameworks
The performance of optimal models in two-stage and end-to-end frameworks is compared in
Table 10. In contrast to the existing literature where a two-stage framework often yields significant accuracy improvement, it is observed that only a slight improvement occurs in the two-stage framework in this study. This can be explained by the fact that the existence of encased steel sections largely enhances the deformation capacity under shear failure, thereby blurring the boundary between brittle and ductile failure modes. On the other hand, an end-to-end framework avoids the error propagation from the classification to the regression stage, and is also capable of implicitly capturing the underlying mechanism of deformation, thus giving highly accurate and robust predictive results. Therefore, the end-to-end framework developed in this study is recommended for the prediction of the deformation capacity of SRC columns, and is discussed in the following part.
4.5. Hyperparameter Optimization
To ensure the predictive stability and to mitigate the risk of overfitting, the hyperparameter optimization process is conducted on the optimal machine learning models determined in the end-to-end framework, i.e., the RF model for θy and the XGBoost model for θu prediction. The tuning strategy employs a combination of five-fold cross-validation (CV) and grid search.
For each model, a multi-dimensional search space (grid) is defined, including key hyperparameters such as learning rates, tree depths, and regularization coefficients. During the five-fold CV process, the training dataset is partitioned into five subsets; for each combination of hyperparameter, four subsets are used for model fitting while the remaining subset serves as a validation set. This procedure is repeated five times for every parameter combination to obtain a stable average performance metric (
R2 is used in this study as the target of optimization). Finally, the hyperparameters leading to better performance are used to train a regression model using the whole training set. The search space and finally chosen hyperparameter for the two models are listed in
Table 11.
The performance of models with tuned hyperparameters is compared with that of initial hyperparameters (baseline model) and the main results are listed in
Table 12. It is found that the tuned model yields a higher
R2 in CV, and a slight degradation in predictive accuracy on the testing set (e.g., decrease in
R2 and increase in
RMSE). In fact, the reduction in the
R2 observed on the testing set (e.g., from 0.878 to 0.856 for
θy) indicates a successful mitigation of initial overfitting. The narrowing gap between CV and testing performance suggests that the tuned models are more robust against the inherent stochasticity of the SRC experimental database. With a
MAPE remaining consistently below 17%, the tuned models demonstrate a high level of predictive accuracy. Therefore, the tuned models are finally used to predict the seismic deformation capacity of SRC columns.
Moreover, from
Table 12, it is also observed that both the RF and XGBoost models can achieve a higher
R2 on the independent testing set compared to the average
R2 obtained during the five-fold CV on the training set. This typically suggests that some of the models trained in a five-fold partition may have been adversely affected by experimental outliers, thereby the average CV score is penalized, i.e., the evaluation remains sensitive to the specific data split. To mitigate the bias introduced by a single K-fold partition and provide a thorough assessment of the model’s stability, ShuffleSplit CV is subsequently implemented. By executing 100 independent random shuffles and splits of testing sets, ShuffleSplit provides a sufficiently large sample size to observe the probability distribution of the model’s performance on diverse data combinations.
The results of ShuffleSplit are shown in
Figure 12. The RF and XGBoost models show satisfactory predictive accuracy, with a mean
R2 of 0.777 and 0.855, respectively. Especially for the tuned XGBoost model for
θu, while occasional unfavorable data partitions yield
R2 values as low as 0.65, the majority of the 100 iterations are concentrated within the 0.82 to 0.90 range, indicating a high accuracy and generalization capacity. For the RF model for
θy, the wider distribution of
R2 indicates that the prediction of yielding drift is more sensitive to noise in data.