Gaussian Process Regression-Based Structural Response Model and Its Application to Regional Damage Assessment

: Seismic activities are serious disasters that induce natural hazards resulting in an incalcu-lable amount of damage to properties and millions of deaths. Typically, seismic risk assessment can be performed by means of structural damage information computed based on the maximum displacement of the structure. In this study, machine learning models based on GPR are developed in order to estimate the maximum displacement of the structures from seismic activities and then used to construct fragility curves as an application. During construction of the models, 13 features of seismic waves are considered, and six wave features are selected to establish the seismic models with the correlation analysis normalizing the variables with the peak ground acceleration. Two models for six-floor and 13-floor buildings are developed, and a sensitivity analysis is performed to identify the relationship between prediction accuracy and sampling size. A 10-fold cross-validation method is used to evaluate the model performance, using the R-squared, root mean squared error, Nash criterion, and mean bias. Results of the six-parameter-based model apparently indicate a similar performance to that of the 13-parameter-based model for the two types of buildings. The model for the six-floor building affords a steadily enhanced performance by increasing the sampling size, while the model for the 13-floor building shows a significantly improved performance with a sampling size of over 200. The results indicate that the heighted structure requires a larger sampling size because it has more degrees of freedom that can influence the model performance. Finally, the proposed models are successfully constructed to estimate the maximum displacement, and applied to obtain fragility curves with various performance levels. Then, the regional seismic damage is assessed in Gyeonjgu city of South Korea as an application of the developed models. The damage assessment with the fragility curve provides the structural response from the seismic activities, which can assist in minimizing damage.

learning technique was applied for a six-floor building and a 13-floor building to improve the estimation models. The proposed models are based on the peak ground acceleration (PGA). Different sampling sets, such as 100, 150, 200, 250, 300, 350, and 400, for the number of the waves were investigated to determine the appropriate sampling size in establishing the models using various statistical indices.

Estimation Model Based on Gaussian Process Regression
In this study, we employed GPR (also known as Kriging) to develop a seismic model and to analyze the model performance in estimating the maximum displacement of the structure and the seismic fragility curves. GPR analysis is a type of regression analysis that is adopted when the dependent variable follows the Gaussian process [20,25,26]. Phan et al. [27] and Hoang et al. [28] used the GPR technique for seismic vulnerability and seismic fragility analysis for storage tanks and reinforced concrete highway bridges. Jung et al. [20] also investigated several methods in estimating the maximum displacement and found that the GPR approach showed the best performance. GPR has the structural characteristic of machine learning techniques, assuming that the model output, Y(x), indicates a realization of a Gaussian process. The formulation of the model can be obtained as follows: where Y(x) means the unknown function of interest, indicates a vector of unknown repression coefficient, f(x) implies a vector of known regression function, and Z(x) is a realization of the Gaussian process based on zero mean and nonzero covariance. The Z(x) can be expressed as where indicates the process variance and − implies the spatial correlation function using unknown and known correlation parameters . The correlation between ( ) and ( ) means the function of a distance between and . The output, ( ), of the model can be obtained using the input points = ( , , … , ) and the corresponding computer output = ( ( ), ( ), … , ( ) ) . The related equation can be formulated as where f = ( ) is the regression function vector for the estimated data and F = ( ) is the regression function matrix of the training data. r is the vector of the correlation function between Y and Y(x ) and R indicates the correlation matrix of Y . The model can interpolate the n data points based on the various basic functions and correlation functions. With Equation (3), the best linear unbiased predictor of Y(x ) can be gained as follows: Additionally, the mean standardized error for the estimate, Y (x ), can be obtained from The unknown parameter (θ) should be estimated to obtain the model by solving an optimization problem. The maximum likelihood estimation approach is used to determine the parameters (β, σ , and θ) maximizing the quantity in Equation (3). The input variables and the output variable derived from the model are described in Section 3.1. The cross-validation method considering the training data and testing data for the model process is explained in Section 2.3. A comprehensive description of the GPR procedure was presented by Williams and Rasmussen [29].

Seismic Probabilistic Risk Assessment
Once the maximum displacement is estimated using the GPR method, the seismic probability risk assessment can be performed based on the seismic fragility curve. The fragility curve is applied for the structural safety analysis [30]. These curves can be used to evaluate seismic structural performance and obtain the probability of a structure's safety or failure against natural disasters.
To generate the fragility curve, we performed a simulation analysis by increasing the magnitude of earthquakes in buildings. Then, the maximum displacement derived from the simulation was validated using the immediate occupancy (IO), life safety (LS), and collapse prevention (CP) levels. With the maximum displacement, the fragility curve is expressed as a log-normal distribution function, which can indicate the probability of seismic damages related to seismic intensity. The distribution function within the intensity ranges implies the function of the seismic intensity, indicating the performance of the seismic analysis. In the analysis, the maximum likelihood estimation method is used to obtain statistical parameters of the fragility curve, i.e., mean and standard deviation of log-normal distribution function. The maximum likelihood function can be expressed as follows: where (•) represents the fragility curve generated from the maximum displacement and implies the seismic intensity, the peak ground acceleration (PGA) adopted in this study. Here, = 1, if a seismic damage occurs at the i th site, whereas = 0 when there is no damage to a structure. N indicates the total number of sites observed for the investigation. In the equation, (•) can be written as follows based on the following log-normal assumption: where represents the PGA, c denotes the median value in the fragility curve, ζ indicates the standard deviation of the curve, and Φ[•] denotes the function of the standardized normal distribution.

Evaluation Criteria
To evaluate the model performance, cross-validation resampling techniques have been commonly used by comparing observed variables with estimated variables [20,[31][32][33]. In this study, the k-fold cross-validation approach is applied for validating the displacement estimation by obtaining boundaries of errors and estimated errors [34]. When k represents a value of 10 as used in this study, 10% of the data are temporarily excluded for testing and the remaining data is used for training by structuring a model. In other words, the data are divided into 10 groups of the same or similar size. One of the groups is designated as the test set, and remaining groups are used for the training set. Each group is repeatedly excluded in the 10 groups considering the error variability. The estimated displacement is then evaluated based on statistical indices with the measured displacement to conduct a model performance analysis. Several studies were conducted for sensitivity analysis based on k-fold cross-validation, including 10-fold cross-validation [35,36]. This 10-fold cross validation was also used in this study because other cross-validation methods, such as Jackknife, apparently reduce the accuracy of the model owing to the large number of data sets.
The evaluation methods employing statistical analysis used in the study are the Rsquared (R 2 ), the root mean squared error (RMSE), the Nash criterion (NASH), and the mean bias (BIAS). We attempted to evaluate the estimated values based on four criteria-R 2 , RMSE, NASH, and BIAS. These criteria can assess the accuracy and bias of the model's results. In this study, we used the RMSE for the model accuracy instead of the MAE, because the RMSE were applied for the evaluation of model performance in several studies as one of the general methods in evaluation criteria [20,32,37,38]. The indices are calculated using the following equations: where RSS is the residual sum of squares and TSS is the total sum of squares. The total number of data is expressed as n, the at-site estimate for location i is denoted as , and the estimate based on the proposed model for location i is presented as . A simple diagram for estimating the maximum displacement and obtaining the fragility curves is shown in Figure 1.

Feature Selection of Seismic Waves
Thirteen seismic wave properties were considered in establishing the GPR model to estimate the maximum displacement. The seismic waves obtained using the magnitude of over 6.5 with far-fault and near-fault in the Pacific Earthquake Engineering Research Center (PEER) were investigated. The considered features include Arias intensity, cumulative absolute velocity (CAV), characteristic intensity, the difference between the maximum seismic wave and the minimum seismic wave, total cumulative energy (ECUM), Fourier amplitude spectrum (FM), PGA, peak ground velocity (PGV), peak ground displacement (PGD), peak velocity and acceleration ratio (PVA), duration (5% and 95%), significant duration (Td), and mean period (Tm). The total number of the seismic waves was 400, and the seismic wave characteristics were calculated based on the scaling of PGA in the present study. The 400 waves are normalized and scaled using the PGA. The nonlinear examination with the normalized PGA is then conducted for each of the seismic intensity levels ranging from 0.0g to 3.5g for 150 points. Figure 2 presents the 400 seismic waves for the response spectrum used in the study. Table 1 shows the 13 seismic wave features for the statistical analysis. Generally, there are two types of seismic intensity, namely, PGA and spectral acceleration (SA). In this study, PGA was selected as earthquake intensity. The PGA is a magnitude of the ground acceleration that occurred during the earthquake, and SA is a magnitude of the degree of shaking considering a building's natural frequency. The PGA is adopted in this study because it can easily express how much the seismic wave shakes the ground. Detailed procedures for the estimation of the wave feature are presented in the analysis by Papazafeiropoulos and Plevris [37].  Based on the aforementioned 13 wave properties obtained for this study, a correlation analysis among the features was performed to determine important variables for structuring the seismic model. The Pearson correlation coefficient was used for the correlation analysis. The Pearson correlation coefficient is used to examine the correlation between different variables. Jung et al. [20] carried out the principal component analysis (PCA) by extracting new features to estimate the maximum displacement, and they identified that the PCA provides the best estimation. Therefore, we focused on the correlation analysis based on the Pearson correlation in this study. Table 2 presents these 13 seismic wave characteristics and Table 3 shows the results based on the Pearson correlation coefficient. In Table 3, O indicates "use", × means "not use", and ▲ implies "not use due to the extra simulation". The feature with the symbol of ▲ is not used although the correlation is high because the extra simulation is required. On the basis of the correlation analysis, we selected Arias intensity, CAV, ECUM, FM, PGA, and Tm. The PGV and PGD are excluded because additional calculation (i.e., time history dynamic analysis) is required to use these variables. The selected six features were applied for the development of the seismic model, and the proposed model was compared with the model based on 13 features to assess the performance of the former. Additionally, sampling sets based on various numbers of the seismic waves, including 100, 150, 200, 250, 300, 350, and 400, were used for sensitivity analysis. The bold values indicate the high correlation (more than 0.80). O indicates "use", × means "not use", and ▲ implies "not use due to the extra sim-ulation".

Six-and Thirteen-Floor Buildings
For the maximum displacement estimation and development of fragility curve, we used two types of steel moment-frame buildings with six floors and 13 floors. The sixfloor building was designed with the 1973 Uniform Building Code (UBC) requirements in 1976 [34,39]. The building is characterized by a rectangular plan of 36.6×36.6 meters with 25.3 meters in height, and an 8.2 centimeter thick concrete slab. Features of sections are estimated using A-36 steel and the yield stress of 303MP. The weight of the structure is approximately 34,644kN. The plan view and member types for the six-floor building is shown in Figure 3a. Detailed descriptions of the target building were presented by Kunnath et al. [40] and Kalkan and Kunnath [41].
The thirteen-story structure is considered as a second illustrative example in this study. It is composed of a basement and 13 floors above the ground. This structure was built in 1975 and designed according to the 1973 UBC requirements. The floor plan of the structure is shown in Figure 3b with elevation view. The building has a rectangular plan of 48.8×48.8 meters. The exterior frames are moment-resisting frames, and the interior frames are for load bearing. It has been instrumented as part of the Strong Motion Instrumentation Program for six-floor and 13-floor buildings [41][42][43]. Additionally, Song et al. [44] examined ground motion intensity measures and component seismic demand parameters to conduct a probabilistic seismic demand analysis of bridges. In the study, we used the Opensees to carry out the nonlinear dynamic analysis and perform the Newmark beta method for structural analysis. This study focused on developing a structural response estimation model. Detailed structural analysis, including the implementation of a nonlinear time history analysis, the type of method used, nonlinear material parameters and its model, and the nonstructural components, plays an important role in the structural analysis. However, these aspects are beyond the scope of this study. Therefore, the details of related information are briefly mentioned herein. The data for the analysis are obtained from https://www.quakelogic.net/research (accessed on 2020.06.01).

Six-Floor Building Analysis for Development of the Rapid Prediction Model
In this study, the simulation model of the buildings was developed based on the designed data, and the movement of the real building can be represented by the model by adjusting the parameters. The purpose of the study was to develop the estimation model using the maximum displacement, which does not focus on the accuracy analysis and analysis methods. With this data, we estimated the maximum displacement and developed fragility curve using different sampling sets, including 100, 150, 200, 250, 300, 350, and 400. By changing the sampling sizes, the appropriate sampling size can be identified to provide the best performance with the proposed seismic model. In this section, we focus on the six-floor building to estimate the displacement and the fragility curve. To analyze the performance of the selected seismic features in the process, we obtain results from the models based on six selected properties and 13 (full) properties. The variables applied for the simulation are also transformed to obtain standardization and normality; then, the transformed variables are used in the six-parameter-based model and the 13-parameterbased model. The performance of the model is validated using the statistical indices, such as RMSE, R 2 , NASH, and BIAS. Figure 4 shows the results from various statistical indices for the different sampling size based on the model with six features and 13 features. The six-parameter-based model presents a relatively high performance compared to the 13-parameter-based model. For the RMSE, R 2 , and NASH, the performance seems to slightly decrease between 150 and 300 sampling sizes with the six features. However, the overall performance appears to be improved when the sampling size increases for the 13 parameter-based model. Based on the observation, we can conclude that the six-parameter-based model for estimating the maximum displacement in seismic analysis tends to provide a satisfactory performance with more accuracy. Table 4 presents the values of the statistical indices for six-parameterbased model and 13-parameter-based model. From the table, we can observe that both models with 400 sampling size produce the best performance, while a sampling size of 100 in both models generates the worst performance.  In addition, we developed the seismic fragility curve derived from the maximum displacement for the six-parameter-based model and the 13-parameter-based model. Fragility curves based on IO, LS, and CP levels are shown in Figure 4 by comparing their performance. In the work, we used the MIDR as a performance criterion, while the fragility curve is calculated based on the two data sets (the exact data from simulation and estimation data from the proposed model) with the seismic waves. The levels for IO, LS, and CP determine the acceptance range of MIDR using the maximum height. The performance levels have threshold values considering the IO (0.7%), LS (2.5%), and CP (5.0%) to provide the MIDR limits [27,28,45,46]. Generally, the structural analysis that produces the fragility curve is conducted based on the PGA of 0.0g in the fragility function. For the analysis, we used the three performance levels (IO, LS, and CP); among the three levels, the CP indicates the entire destruction that we want to obtain by increasing the PGA to 3.5g. The fragility curve is then estimated based on the log-normal distribution. In other words, the CP requires the point of the entire collapse of the target structures. Therefore, the simulations (e.g., nonlinear dynamic time history analysis) are conducted by increasing the PGA until the structure is collapsed. In the study, the PGA of 3.5g is selected for the upper limits of the PGA range for the simulation.

Thirteen Floor Building Analysis for Development of the Rapid Prediction Model
The GPR model is also examined for the 13-floor building, which has more degrees of freedom than the six-floor building. The six-parameter-based and 13-parameter-based models are used to produce the maximum displacement with fragility curve. To assess the model performance, four statistical indices are applied for the results obtained from the seismic model. Figure 6 shows the statistical results using RMSE, R 2 , NASH, and BIAS, and they show a better estimation as the sampling sizes increase from 100 to 400. As shown in the figure, the model with six features seems to have similar results to the model with 13 features. Furthermore, we can identify that model performance tends to be largely enhanced with the model based on 150 sampling size for the six and 13 features. These results conclude that the six-parameter-based model would be sufficient for estimating the displacement of the 13-floor structure in the seismic analysis. Table 5 presents the values of the indices for both models with various sampling sizes. Among the different sizes, 400 sampling size shows the best performance with the proposed models.  The analysis for the fragility curve is carried out for the 13-floor building based on the IO, LS, and CP levels with the exact data and the estimated data, as performed in the model for the six-floor building. Figure 7 represents the curves for the different levels and different data sampling sizes, including 100 and 400. The curves appear to be the S-curve, and the model with the IO level seems to produce the best performance, compared to the LS and CP levels. The values of the fragility curve function using the IO level with 100 sampling size have the exact data of −0.9051 and 0.9779, the six-feature data of −0.9735 and 1.1005, and the 13-feature data of −1.0502 and 1.1109 for the median and standard deviation statistics, respectively. The values for 400 sampling size have the exact data of −0.9051 and 0.9779, the six-feature data of −0.9045 and 0.9774, and the 13-feature data of −0.9062 and 0.9791 for the median and standard deviation statistics, respectively. Based on the model with six features, we compare the performance of the six-floor structure to the 13-floor structure in estimating the maximum displacement to identify their trend and the appropriate sampling size. The results for the RMSE, R 2 , NASH, and BIAS based on the two types of buildings are presented in Figure 8 for different sampling sizes, ranging from 100 to 400. For the six-floor structure, the model performance improves gradually with increasing sampling sizes from the four statistical indices. In estimating the displacement of the six-floor structure, represented as the low height building, a smaller sampling size tends to be sufficient. On the other hand, for the 13-floor structure, we can clearly determine that the model performance largely increases from 100 to 200 sampling size based on the RMSE, NASH, and BIAS. The value of R 2 significantly increases from 250 to 350 sampling sizes. This behavior shows that the 13-floor structure, representing a high height building, requires a large sampling size of more than 200. This result is expected because a structure with more complexity (e.g., the 13-floor structure) has more degrees of freedom, which can affect the model performance compared to a lower structure. Table 6 shows the values of the statistical indices for the six-floor and 13floor structures with various sampling sizes.

Discussion with Regional Seismic Damage Assessment
We investigated the structural response (i.e., the maximum displacement) and then performed a seismic vulnerability assessment at Gyeongju by applying the proposed prediction model. The GIS analysis and estimation of structural response were performed.
There is a lack of information, such as a prediction model of other structures. However, the proposed approach can be applied to different structural types, and the omitted information can be easily obtained. The US building data can be used as training to conduct the seismic analysis in Gyeongju, because it is based on the results from the simulation model.

GIS Analysis
The regional seismic damage assessment is performed using the data set provided by Korea National Spatial Data Infrastructure Portal (http://www.nsdi.go.kr/). Based on the GIS data derived from National Hub Open Data, we combine topographical information of geo-spatial information with GIS integrated building information. Then, the topographical information and the integrated/combined building information are extracted for the analysis of the target region, Gyeongju. The estimation model is developed, in the present study, based on a steel-frame structure, and it is expected to achieve structural responses with maximum displacement for the structures in the area. Among various types of structures in Gyeongju, the steel-frame structure occupies 26.52%. Other structures, such as masonry-frame, concrete-frame, and wood-frame structures, occupy 29.64, 19.53, and 24.19%, respectively. Figure 9 shows the locations of all types of structures in Gyeongju and the locations of the steel-frame structure used for the analysis.

Estimation of Structural Responses
To estimate the structural response, we use the seismic information based on the earthquake at Gyeongju in 2016. The reduction in magnitude for seismic activities is conducted to consider the location of earthquakes and individual buildings in the study region [47]. The reduction equations considered for the analysis follow the equations of Noh and Lee [48], Toro et al. [49], and Jo and Baag [50]. Each equation is applied considering the weight based on 20, 40, and 40%, respectively, to solve uncertainly problems. The entire process for the structural response estimation is shown in Figure 10. Based on the presented procedure, we estimate the damages from the seismic waves at Gyeongju with the GPR model. From the national and architectural seismic design standard, Gyeongju characterizes the seismic district I, the seismic district coefficient of 0.11 with the return period of 500, and the traditional design standard for earthquake of 0.147g. The result of the maximum displacement estimation based on the traditional standard is shown in Figure 11 for the steel-frame structures in Gyeongju. Note that there are 32 buildings with more than six floors and 19,984 buildings with fewer than six floors in a total number of 20,016 buildings at the study area. The maximum PGA derived from the occurrence of the Gyeongju earthquake in 2016 is 0.404g. In this study, the maximum displacement estimated using this maximum PGA is obtained, and the result is presented in Figure 12. From this figure, we can identify that the displacement from the earthquake provides information on the magnitude of the damages occurred due to natural disasters. As a result, the proposed model can be used to generate accurate and rapid responses based on the regional seismic damage assessment utilizing the maximum displacement estimation. In this study, we developed two estimation models, and we applied these models to structures in Gyeongju. The results demonstrate the effectiveness of the proposed models pertaining to vulnerability assessment, and they can be used to developing estimated models in future studies. Additionally, the HAZUS and other tools can be applied for vulnerability assessment after further developing the estimation models.

Conclusions
The machine learning framework for the estimation of the maximum displacement and fragility curve is presented in this paper. The GPR model was applied to two types of buildings, i.e., six-floor and 13-floor structures. Additionally, 13 seismic wave features were considered to establish the model, and six features were selected from the correlation analysis using the Pearson correlation coefficient. The six-parameter-based model and 13parameter-based model were compared to assess the model performance in estimating the maximum displacement. The performance was validated using the RMSE, R 2 , NASH, and BIAS with 10-fold cross validation, and sensitivity analyses with various sampling sizes of 100, 150, 200, 250, 300, 350, and 400 were performed to determine the proper sampling number for the two structure types.
In the analysis of the models with different numbers of features, the six-parameterbased model presented a similar performance to that of a 13-parameter-based model for both types of the buildings, based on the four statistical indices. The six-parameter-based model seems to provide an acceptable performance in estimating the seismic maximum displacement by reducing the computing cost for the development of estimation models. In the sensitivity analysis of the six-floor building, the model performance improves steadily with the increase in the sampling size. For the 13-floor building, the performance significantly increases for sampling sizes greater than 200. A short building tends to require a smaller sampling size, whereas a tall building tends to require a large sampling size. This might be because a high height structure has more degrees of freedom, which can affect the sampling size.
Furthermore, we examined the development of seismic response models to rapidly estimate seismic damages based on the fragility curve and regional damage assessment. The seismic fragility curve was examined based on the three performance levels, namely, IO, LS, and CP in MIDR. We compared the fragility curves derived from the actual data and estimated data and verified the accuracy of the estimated curve. Regional seismic damage assessment was also investigated by determining the structural response with the maximum displacement. The GPR model based on seismic information on Gyeongju earthquake estimated the maximum displacement considering the maximum PGA of 0.404g. The result apparently indicates that the estimation of structural response can establish advanced seismic models with the displacement estimation and can assist in political decisions by providing robust and prompt responses from seismic activities. The maximum structural displacement, playing an important role in seismic risk assessment, can be quickly and accurately estimated using the proposed models based on seismic activities and then easily applied for constructing fragility curves by providing a probabilistic damage information and regional seismic damage prediction. Potentially, the estimated structural damage information can be helpful for decision-makers, city planners, etc. Future work should focus on various structural types by applying for the GPR or other techniques to obtain an accurate estimation of the maximum seismic displacement.