Application of Support Vector Machine Modeling for the Rapid Seismic Hazard Safety Evaluation of Existing Buildings

: The economic losses from earthquakes tend to hit the national economy considerably; therefore, models that are capable of estimating the vulnerability and losses of future earthquakes are highly consequential for emergency planners with the purpose of risk mitigation. This demands a mass prioritization ﬁltering of structures to identify vulnerable buildings for retroﬁtting purposes. The application of advanced structural analysis on each building to study the earthquake response is impractical due to complex calculations, long computational time, and exorbitant cost. This exhibits the need for a fast, reliable, and rapid method, commonly known as Rapid Visual Screening (RVS). The method serves as a preliminary screening platform, using an optimum number of seismic parameters of the structure and predeﬁned output damage states. In this study, the efﬁcacy of the Machine Learning (ML) application in damage prediction through a Support Vector Machine (SVM) model as the damage classiﬁcation technique has been investigated. The developed model was trained and examined based on damage data from the 1999 Düzce Earthquake in Turkey, where the building’s data consists of 22 performance modiﬁers that have been implemented with supervised machine learning.


Introduction
The world has experienced innumerable catastrophic earthquakes in the history of mankind that led to enormous counts of fatalities and acute property damage. Old structures in service, structures with historical heritage, highly important buildings, and buildings not compliant with current seismic codes are the most vulnerable to seismic damage. In that case, a seismic structural prioritization is credibly the best way to adopt prevention or be prepared with post-disaster management schemes. Therefore, proposing a method called Rapid Visual Screening (RVS) helps to determine the damage index for different types of structures [1,2]. Sinha and Goyal [3] have provided a very effective and concise discussion as a motivation for a novice reader. The demand for an elementary and rapid vulnerability assessment was identified first in the United States of America and, therefore, the first method for RVS was proposed in 1988 as "Rapid Visual Screening of Buildings for Potential Seismic Hazards: A Handbook" [4], which was later revised in 2002 incorporating the latest research advancements in seismic sciences. This was further applied by other countries while adopting their local conditions, modifications, and considerations [5]; for instance, Indian RVS (IITK-GSDMA) [6] or Philippine RVS [7]. RVS typically involves a walk down survey to record the seismic parameters of

Choice of Building's Damage Inducing Parameters
To proceed with any RVS methods, some primary information has to be obtained. In some studies, the usefulness of building characteristics as inputs to seismic vulnerability assessment has been investigated [15,32,33]. It shows that most useful data and many of the methods used parameters in congruence with FEMA 154 [4]: (i) System type, (ii) vertical irregularity, (iii) plan irregularity, (iv) year of construction, and (v) construction quality. There are further parameters considered as per Yakut et al. [34] for vulnerability assessment of structures. Relating to the attributes of the damaged structures and an enormous count of the existing building stock, they suggested the following parameters, which were also adopted as the primary evaluation parameters for this research. In the following, there is a brief introduction of the critical and significant parameters from the total 22 parameters used for the study and model development of RVS. A detailed investigation including a rigorous discussion with the impact of these components on the observed damage is provided elsewhere [12,[34][35][36].

System Type
The type of frame action, the load transfer, and bearing pattern contribute significantly to the study of RVS. For instance, a masonry, or load-bearing structure could be proven to be more vulnerable to seismic ground motions as compared to the moment-resisting RC frame [37].

Reinforced Concrete Frame
A framework that consists of horizontal and vertical elements connected through rigid joints is called a reinforced concrete frame structure. These elements are known technically as beams and columns, cast monolithically with the reinforced concrete mixture. Reinforced concrete frames have resistance towards gravity, and therefore, transfers lateral loads across the elements [38].

Reinforced Concrete Frame with Shear Walls
A shear wall is a rigid vertical component in structures, which can withstand lateral forces in the direction parallel to the plane of the shear wall via bending and shear. Shear walls are specially built for high-rise structures to minimize the earthquake damage to the structure and mainly alleviate lateral sway [39].

Year of Construction
This parameter typically appears to be essential in the seismic analysis as it specifies the service time-span of the structure. The seismic performance of a structure significantly depends on its age. Analogical studies clearly state that the old aged buildings become easily affected during an earthquake event and may result in severe damage or collapse of the structure, while the newly constructed RC structures, i.e., moment-resisting frames, could have better resistance during such events.

Number of Stories (NS)
The story is the section in between the ceiling height and the floor thickness. The number of stories classifies the buildings as per their height. Multistory structures are prone to be highly vulnerable during an earthquake event that may cause severe damage risks. Comparatively, low-rise structures show resistance against high vulnerability and have low or sometimes moderate damage risks [40].

Ground Floor
The ground floor area of the building is estimated at a horizontal level at the ground floor stage within its broad exterior dimensions, excluding open spaces, balconies, and stairways. It is generally occupied by shops or commercial spaces in modern constructions [41].

Total Floor Area
Floor area, or sometimes called a plan area, is the area of the one-floor plane. In cases of the building being multistory, the total floor area can be estimated by multiplying the floor area of one story by the number of floors. The purpose of this data is to determine the occupancy loads, the importance type, cost, or value of the building. It is also an important parameter to consider the value of the damaged area and assign a proper retrofitting method for it.

Overhang Area
The area of dense projections that falls beyond the building's predefined frame line is called the overhang area. This can typically be dangerous as they are subjected to higher seismic forces during intense ground motions.

Ground and Normal Story Height
Scenarios where a building with a ground story significantly taller than the stories above causes the piers to be taller on the first floor than at the upper stories, resulting in a soft story. This shall be considered as a severe vertical irregularity. Some structures are characterized by tall story heights with thin walls, which occurs in severe out-of-plane buckling when exposed to lateral load. In addition, the total height of structure has influenced by the natural period of a structure [12,42].

Irregularities
Irregularity in plan and elevations' perspectives state the judgment of the shape and configuration of the structure in plan and elevation view [8]. Irregularities are typically classified as:

Horizontal Plan Irregularity
It is proven that buildings with regular and straightforward plan configurations such as rectangular, square, or circle behave effectively in resisting earthquakes. A box-shaped building is more durable than an L or U shaped or a building with wings. Any building in the plan which is irregular in shape shall induce heavy torsional moments and twisting motion in case of ground motions. Following plan irregularities are considered here: • A1: Torsional irregularity, • A2: Floor irregularity, • A3: Discontinuity in plan, • A4: Non-parallel axes of structural elements.

Vertical Irregularity
The structural deficiency of the building detected by any irregular shapes of the building from an elevation perspective can be defined as vertical irregularity. The presence of step-backs or setbacks and any other architectural provisions for aesthetic purposes can make the building highly vulnerable. The buildings constructed on steeply sloping grounds, especially in hilly areas, create irregular column heights in the same story, resulting in severe stiffness irregularities. Existence of the following vertical irregularities are considered in this study: • B1: Strength irregularity (weak story), • B2: Stiffness irregularity (soft story), • B3: Discontinuity of vertical structural elements.

Number of Continuous Frames in X-direction and Y-direction
This parameter indicates the number of continuous frames in the X and Y directions for the structures.

Normalized Redundancy Score (NRS)
Redundancy indicates the degree of continuity of multiple frame lines which allocate lateral forces for the entire structural system [43].

Soft Story Index (SSI)
Soft story forms in situations where there are lesser partition walls in the ground story than in the story above [44]. Soft story index can be defined as the ratio of the height of the first story (H 1 ) to the second story (H 2 ): (1)

Overhang Ratio (OR)
Overhang area (OA) is the area, which falls beyond the outermost frame lines on all sides in any floor plan [44]. The ratio of aggregated overhang area (A overhang ) to the area of the ground story (A g f ) gives the overhang ratio value;

Minimum Normalized Lateral Strength Index (MNLSI)
MNLSI indicates the base shear capacity of the critical story [44]. In the calculation of this index, in addition to the existing columns and structural walls, the presence of unreinforced masonry filler walls are also considered. While doing this, unreinforced masonry filler walls are assumed to carry 10 percent of the shear force that can be carried by a structural wall having the same cross-sectional area.

Minimum Normalized Lateral Stiffness Index (MNLSTFI)
MNLSTF Index indicates the lateral rigidity of the ground story, which is usually the most critical story [44]. If the story height, boundary conditions of the individual columns, and the properties of the materials used are kept constant, this index is calculated by considering the columns and the structural walls at the ground story.

ML Modelling Approach
The current segment elaborates on the procedure utilized in predicting the seismic damage vulnerability of the RC buildings post-earthquake, employing the ML technique. ML gains experience while learning through the various pre-defined algorithm, mainly used for classification and regression problems [45]. Figure 1 illustrates the steps for the formulation of the problem, and further levels are found in detail in the following subsections.

Input Dataset
The first stage for designing the ML model is the selection of input data. The input data here are the independent variables or features points. The ML model classifier explains the effect that features have on the outcome.

Classification of Damage Data
SVM is the classification technique that is implemented in this task. The building's damage class is assigned based on the susceptibility to different damage levels depending on the required purpose.

Data Pre-Processing
In Statistical data, there are three main data types; numeric, categorical and ordinal. However, ML models can only handle numeric features. To make, the model, work properly, we convert the categorical and ordinal features into numeric features. The conversion of different data forms into numerical is possible by using "Panda" library and creating dummy features, which indicates 1 for the observation that belongs to that category; the other category observations remain as 0. Another option "OneHotEncoder" in "sci-kit learn" library works the same way.
Dataset standardization is a worldwide demand for several ML classifiers enforced in scikit-learn; the performance might not be expected if the individual features do not follow standard normally distributed data. Feature scaling is an alternate option for standardization; it scales the features between an interval of minimum and maximum value, commonly between 0 and 1. It brings the maximum absolute value of each attribute point to a unit size using the classes, such as "MinMaxScaler" or "MaxAbsScaler". Scaling adds robustness to minimal standard deviations of attributes and perpetuating null entries in the sporadic data.
Another critical aspect to consider is the missing data. However, in the given task, all the data are crucial as they belong to seismic observation. So the alternative solution to replace the missing data is by using the mean, median, or the highest frequency value of the given feature. In case of any missing data in this study, they are filled with the mean value of the features.

Selection of Input Parameters
ML is capable of computing multi-parametric problems. Based on the area of study, the selection of parameters may differ. For the given task, the selected parameters are the structural parameters characterizing the seismic behavior of the buildings' posts any earthquake. Structural parameters are observed during the field survey.

Splitting of Dataset
The ML algorithm uses a recognized dataset segregated into a training subset and a testing subset. The training subset constructs the predictive models, and the testing subset analyses the efficiency of the models. Every data point consists of two attributes: A predictor and its respective classes-the predictors are independent variables with labeled categories. The damage class for each structure is determined by the safety and risk associated with the building. The training subset contains the sample buildings' damage scale for learning purposes, while the test sub-set keeps the damage scale obscure for accurate prediction of the classifier.

Model Selection
SVM, as the selected model in this study, was first proposed by Cortes and Vapnik [46]. It belongs to supervised learning and used for classification, regression, and outlier detection. The purpose of the SVM design is to analyze a hyperplane in an N-dimensional feature space that precisely segregates the separate class points. Support vectors are feature points near the hyperplane and impact the hyperplane's location and arrangement. Support vectors optimize the margin in between the classifier. Hyperplanes act as the choice boundaries which label the feature data. Data points belonging on either side of the hyperplane may associate with distinctive groups. However, the number of feature points decide the dimension of the hyperplane.
SVM is a comparatively new learning algorithm, and the primary difference between SVM and the rest of the ML algorithms is that SVM reduces the viable liability rather than lowering the classification error. This mechanism's model functioning is to segregate feature points using the hyperplanes to the various classes they belong to, keeping the most significant margin in between classes. A nonlinear mapping transfers the feature points from the primordial space to higher dimensional feature space and drives to search the optimum hyperplane. Figure 2 shows the mechanism of a linear SVM while classifying different classes. For a two-dimensional space, the middle separator between the two categories is called the discriminator. Therefore, for N-dimensional area, the classifier is a hyperplane. Assuming that the distance between each class data point and the classifier equals to 1 [47].
Let x: Feature vector where x ∈ R n , y: Class where y ∈ {1, −1}, w and b: SVM parameters which can be learn using the training set, x (i) , y (i) : i th sample of the dataset among N sample training set, then, the class y(i) for vector x(i) is determined by: In order to segregate the classes of data points, many possible hyperplanes are possible. The main objective is to detect a plane that has the maximum margin, i.e., the optimal separation between data points of each class (M). Using Equation (3) the optimal margin (M) is given by: Furthermore, SVMs are also protracted to solve multi-class problems using the "one-against-one" [48] approach. The same technique is applied to study the task for the damage classification of the buildings with five different damage classes. In "one-against-one" approach, for n number of classes, there are n × (n−1) 2 classifiers and each one trains data from two classes.

Evaluating the Performance of Predicted Model
Aim of this section is the evaluation of the classification competence of the created model for the unseen samples in the test subset. For enhancing the model performance, SVM uses several factors like kernel, degree of the kernel function, scaling parameter, and cost of constraints violation (C) as the tunable parameters.

Model Utilization
The satisfactory model evaluates the unseen examples. The model accuracy depends on many factors, like quantity and quality of input data, selection of features, and the number of outliers. A useful model performs with 50% of overall accuracy for data assessment.

Methodology and Database
In this study, the database is collected from the archival material of SERU (Structural Engineering Research Unit) [49]. It was recorded from post-earthquake damage evaluations conducted after the 1999 Düzce earthquake in Turkey and enclosed detailed information on 484 selected buildings deteriorated by various degrees of damage. There are twenty-two feature parameters that the dataset contains such as system type, year of construction, number of stories, ground floor area, total floor area, overhang area, ground story height, normal story height, irregularities in plan and horizontal, X & Y direction frames, MNLSTFI, MNLSI, NRS, SSI, and OR. Table 1 states the damage grade and the corresponding damage scale used for the risk assessment. Figure 3 shows the distribution of buildings reciprocal to their damage categories. The vulnerability of the RC buildings is measured by anticipating the damage caused post-earthquake.

Data Pre-Processing
The dataset contained all numeric features, but many of them had missing data points. SimpleImputer class from scikit-learn works well by replacing the missing values using the mean or median values. For the task dataset, the unavailable data points are replaced using " mean" value. The feature parameter data were standardized in order to proportionate the complete input vector of the dataset to a standard scale, without altering the variance in the ranges of values. Figure 4 shows the distribution of data points over each feature parameter. These parameters as introduced previously are Ground floor area, Irregularities (here as Irr-A1 to Irr-B3), MNLSI, MNLSTFI, No. of story, OR, OA, RNS, SSI, story heights ground and normal, system type, total floor area, frames in X and Y directions, and year of construction. It is clearly observable that the input variables are not distributed ordinarily. The maximum parameter is following Gaussian distribution, like ground floor area, MNLSI, OR, OA, story height ground, total floor area, years of constructions, X-dir frames, Y-dir frames, story height normal, number of story and SSI are following a Gaussian distribution. Whereas different irregularities presented as Irr-A2, Irr-A3, Irr-A4, Irr-B1, Irr-B3 are biased towards the right, and the rests are slanted to the left.

Splitting of Dataset
As mentioned above, the dataset divides into training and test subset. As a good practice, 80% of the data is for training, and 20% of the data is for test subset. The training set consists of familiar output, and the classifier gains the experience while learning on this data to classify other unseen examples further. The test subset evaluates the predictive performance of the model using its subset.

SVM: Feature Selection and Kernels
Feature selection is important to maximize the performance of the model in terms of accuracy. Tunning certain parameters, model accuracy can be improved. Major parameters include: The misclassification or error term tells the SVM optimization of how much error is bearable.
When C is high, it classifies all the data points correctly, but often there is a chance of over-fitting.
In counter, when C is low, the optimizer looks for a larger-margin to separate the hyperplane, though the hyperplane misinterprets more points. • Gamma: Gamma is specific to the RBF kernel, not for the linear or polynomial kernel. The gamma parameter characterizes the effect of a single training sample attainment, where lower gamma means "far", and higher gamma means "close-by". Gamma decides the curvature in a decision boundary.

Result and Discussion
This task includes the four significant kernels: Polynomial kernel, RBF kernel, Sigmoid kernel, and Linear kernel. The SVM model trains itself using those kernels and generates a predicted output. To train the model optimally and to attain the best possible accuracy, the model is trained ten times using the algorithm, where the highest accuracy is selected out of all the outcomes.
The classifier evaluates each kernel and calculates various values, such as accuracy, precision, and recall. Table 2 shows the percentage of accuracy achieved by each kernel. RBF, Sigmoid, and Linear Kernels have identical accuracy of 45%, whereas polynomial performance is lower with an accuracy of 26%. All the kernels showed an accuracy of less than 50%, which does not make a good model. For accuracy, the model hyper-parameters such as C, gamma, and kernels are tuned. Hyper-parameters do not learn directly; instead, they pass as an argument to the estimator class's constructor. Grid search in scikit-learn helps to tune the hyper-parameter and to evaluate the SVM model for every combination of algorithm input allocated in the grid. Grid search returns the best estimator, and for this task, the resultant best estimator showed an accuracy of 52% using RBF kernel, whereas the previous study by Tesfamariam [14] in this area of research has accuracy reported 45%. Figure 5 represents the confusion matrix, which visualizes the performance of the classification model over the test set for which the correct labels are known. Based on the number of classes, the representation is a [5 × 5] matrix. The diagonal numbers in the confusion matrix illustrated the number of occurrences when the model classifier predicted the class correctly. Other numbers across the matrix are misclassified cases. Therefore, higher numbers are desirable in the confusion matrix for an accurate model.  The performance of the model is also visualized using the Receiver Operating Characteristics (ROC) curve. The ROC curve is a graphical representation of the diagnostic ability of any binary classifier. For the classification of different classes, it breaks down the relationships into all pair-wise comparisons, and the area under the curve is calculated for each lesson match (i.e., lesson O vs. class L; lesson L vs. course M; etc.). The area beneath the curve over the significant pair-wise AUC's (range under the ROC) shows the variable significance degree. Figure 6 shows the ROC curves for the test data of the classifier. The area under the ROC curve (AUC) shows a descent value, which means the classifier has a good fit with the test set. The micro-average ROC curve area and the macro-average ROC curve area are 0.75 and 0.70, respectively. The graph shows that Class 3 is primarily below the curve, and hence the accuracy of Class 3 is also decreased. Class 0 and Class 1 have attained the optimum value of ROC, meaning the maximum numbers of buildings under these categories are correctly classified.

Conclusions
Damage to buildings caused by recent earthquakes clearly shows the need to identify and retrofit vulnerable buildings. However, it is a monumental challenge to strengthen all existing buildings with the required design. However, existing buildings can be classified, representing their associated risk factor. Analyzing the expected damage and its analogous uncertainty is primarily for risk estimation and risk management. The study has considered 22 different features applied as inputs to the method, which include: System type, year of construction, ground floor area, total floor area, overhang area, ground and normal story height, vertical and plan irregularities, X and Y direction frames, number of stories, MNLSTFI, MNLSI, NRS, SSI, and OR. The performance of the model classifier depends on the selection of the input parameter, classification technique, and dataset. A classifier performance will degrade when the feature variable is not sufficient to discriminate the output class criteria explicitly or when outliers and wrong variables are present in the dataset.
The SVM method proposed a good damage classification of the buildings according to their respective damage classes. The technique could be used to help to make strategic risk management decisions and risk assessment for earthquake-prone buildings prior to events. The results showed an accuracy of 52% when using all 22 parameters, which is an acceptable rate for the sample size used. In the future, the model accuracy should be improved by training the model with the most useful parameters. In addition, using the k-fold cross-validation technique is advised to verify the performance of the model classifier.