Terrain Proxy-Based Site Classiﬁcation for Seismic Zonation in North Korea within a Geospatial Data-Driven Workﬂow

: Numerous seismic activities occur in North Korea. However, it is difﬁcult to perform seismic hazard assessment and obtain zonal data in the Korean Peninsula, including North Korea, when applying parametric or nonparametric methods. Remote sensing can be implemented for soil characterization or spatial zonation studies on irregular, surﬁcial, and subsurface systems of inaccessible areas. Herein, a data-driven workﬂow for extracting the principal features using a digital terrain model (DTM) is proposed. In addition, geospatial grid information containing terrain features and the average shear wave velocity in the top 30 m of the subsurface ( V S 30 ) are employed using geostatistical interpolation methods; machine learning (ML)-based regression models were optimized and V S 30 -based seismic zonation in the test areas in North Korea were forecasted. The interrelationships between V S 30 and terrain proxy (elevation, slope, and landform class) in the training area in South Korea were veriﬁed to deﬁne the input layer in regression models. The landform class represents a new proxy of V S 30 and was subgrouped according to the correlation with grid-based V S 30 . The geospatial grid information was generated via the optimum geostatistical interpolation method (i.e., sequential Gaussian simulation (SGS)). The best-ﬁtting model among four ML methods was determined by evaluating cost function-based prediction performance, performing uncertainty analysis for the empirical correlations of V S 30 , and studying spatial correspondence with the borehole-based V S 30 map. Subsequently, the best-ﬁtting regression models were designed by training the geospatial grid in South Korea. Then, DTM and its terrain features were constructed along with V S 30 maps for three major cities (Pyongyang, Kaesong, and Nampo) in North Korea. A similar distribution of the V S 30 grid obtained using SGS was shown in the multilayer perceptron-based V S 30 map. sites: Pyongyang, Kaesong, and Nampo. The best-ﬁtting models of LR, K-NN, SVR, and MLP were determined based on


Introduction
Numerous seismic activities such as earthquakes, collapses, and explosions have been recorded in North Korea. However, assessing the seismic hazards is challenging because of the lack of and/or discrepancies in the data from public seismic stations, damage inventories of earthquakes, and site investigation results. Historically, the largest earthquake (M W 6.2;19 March 1952) in the Korean Peninsula occurred near Pyongyang, North Korea [1]. Moreover, nuclear tests between 2006 and 2017 triggered events such as mine collapses [2]. In North Korea, the Phyongnam and Taebaeksan Basins are two major Paleozoic basins in the Rangnim massif range and the Gyeonggi and Yeongnam massifs, where seismic amplification and ground shaking tend to occur [3]. Therefore, seismic site effects due to the amplification of seismic waves have been suspected to cause aggravated earthquake hazards in the basin area.
The damage caused by earthquakes is enhanced by the reflection and dispersion of seismic waves near the surface or shallow subsurface [4]. In city-scale areas with high spatial variability of terrain or geology, certain wave propagation phenomena related to local amplification have been investigated via numerical simulations and seismological observations [5]. In nationwide areas comprising sedimentary basins, valleys, and mountain ridges or tops, empirical cross-correlations with terrain, geology, and geophysical characteristics have been estimated, while site response indexes have been developed by dividing these areas via macro-or micro-seismic zonation.
Many counties have seismic codes to compute the seismic site effects at local sites that are based on the observed average shear wave velocity (V S ) in the top 30 m of the subsurface (V S30 ) [6]. However, in the absence of site surface data or seismological observations, the validity of V S30 in tectonically active complex territories, particularly in geological contexts (such as the existence of inversions in the V S profiles), has been challenged by certain researchers [7][8][9]. To indirectly predict V S30 distribution, Ahdi et al. [10] developed V S30 maps that included local proxy types, namely topographical slope and geomorphological landscape containing high spatial variability. The terrain features (curvature and alluvial fan) influence the subsurface strata and their stiffness. Accordingly, the geomorphological, landform, and land cover features should be incorporated into the subsoil investigation results for geotechnical or geology engineering to visualize the V S30 map. The digital elevation model (DEM) and digital terrain model (DTM) are useful for developing V S30 maps based on geographic information system (GIS)-based regression analysis [11]. DEMs of various resolutions for several regions across the United States were used to examine the use of optimum resolution DEMs in developing V S30 maps [12,13]. A method of generating the automatic topography classification using a 50 m DEM was developed using multiple linear regression analysis of the logarithm correlations between the observed V S30 and the topographic attributes reported by Iwahashi and Pike [14]. However, seismic amplification associated with proxies is too nuanced for determining a single synthetic parameter such as V S30 [15,16].
Therefore, to integrate the proxies for determining V S30 and other site response parameters, a data-driven approach including a random effect procedure, based on artificial intelligence (AI), is essential. This data-driven approach is suitable for the identification of geomechanical risks and facilitates the avoidance of pitfalls (e.g., low correlation coefficient); hence, it can be standardized into a widely consistent workflow [17]. AI technology uses spatial, temporal, and anomalous features of systems for classification, regression, and clustering. In addition, it has been used extensively owing to its efficiency in producing useful information and finding hidden variables in various datasets [18]. Machine learning (ML) models have become increasingly prevalent in the simulation of the spatial distribution of soil properties [19][20][21]. The ML-based subsurface or soil texture mapping includes the use of artificial neural networks, boosted regression trees, random forests, and artificial neuro-fuzzy inference systems [22,23]. For early disaster response to floods, landslides, and earthquake risks, a combination of the ML workflow and geospatial information such as DEM [24], synthetic aperture radar (SAR) imagery [25], and area intensity [26] was proposed. Moreover, following the incorporation of stochastic methods and geological hazard assessment with nonlinear data handling by using varying scales/types of sources, the application of data mining and ML algorithms increased [27,28]. The bestfitting model using only surficial proxy presents a spatial distribution of V S30 based on the borehole measurements.
In this study, the V S30 of major cities in North Korea (Pyongyang, Kaesong, and Nampo) were mapped. In the training areas in South Korea, the borehole-based V S30 values and DTM-based terrain characteristics were identified as labels and features, respectively; these parameters were constructed as geospatial grid information for input into the regression models. The elevation, slope, and topographic position index (TPI)-based landform classes were extracted using the DTM. The V S30 grids from the borehole dataset were interpolated using geostatistical methods with the lowest residuals. The best-fitting model for V S30 regression, with the grids incorporating V S30 and terrain-based features, visualized the V S30 map of the test areas in North Korea. The novelty of the proposed methodology for developing V S30 is the ML-based multivariate regression that integrates a large number of grids as interpreted from the DTM and borehole datasets.

Study Area
Geographically, the Korean peninsula is located on the eastern borders (33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43) • N and 124-132 • E) of the Eurasian continent. According to Sun et al. [29], different geologic strata from the Pre-Cambrian period to the Cenozoic era have been distributed throughout the peninsula. Most regions consist of plutonic and metamorphic rocks, mainly comprising Rangnim, Gyeonggi, and Yeongnam massifs [30] as well as a partial sedimentary rock distribution. The Choogaryong rift valley clearly divides geology into two regions in the north and south, which is mostly similar with the Military Demarcation Line in the Korean Peninsula. In South Korea, various geological rock types (e.g., granite and Mesozoic sedimentary rock) are distributed. However, North Korea is mostly formed by Pre-Cambrian metamorphic rocks, Paleozoic sedimentary rocks, and Quaternary volcanic rocks. From the geomorphological perspective, the Korean Peninsula is defined as an old landform due to its continuous erosion. Steep and high mountains are commonly distributed in North Korea, whereas low mountains and alluvial plains are common in South Korea. Nationally, geological and geomorphological differences are observed between South and North Korea; however, local similarities are observed in certain areas at a city level scale as the shallow soil layers above the engineering bedrock mostly influence the effects of the seismic site. Geographical, geomorphological, and geological similarities during the formation of the layers should be determined prior to building datasets. Therefore, the study area comprised one training area in South Korea and three test areas in North Korea for the development of the best-fitting model for site classification by considering the similarities shown in Figure 1. The training area in South Korea included Seoul city, part of Incheon city, and Gyeonggi Province (Figure 1a). Seoul spans 605.25 km 2 and is approximately 15 km long; it is bisected by the Han River into the northern and southern halves. The city is bordered by eight mountains and comprises more lowlands in the Han River plains and in the west. Incheon, bordering Seoul and Gyeonggi Province to the east, is located in northwestern South Korea. Gyeonggi Province, which encompasses Seoul, covers 10% of the territory (10,171 km 2 ) in South Korea. Hence, the training area between the east longitude of 126 • 39 N and 127 • 0 N and north latitude of 37 • 27 N and 37 • 31 12"N is defined by considering the spatial proximity and connectivity of the largest metropolitan area in the northwestern region of South Korea. In the training area, the geology characteristic is largely granite, wherein the bedrock is widely intruded into the Jurassic Period of Mesozoic era [31]. In the northwest direction, granite is present in Seoul.
For V S30 mapping in North Korea, sections of Pyongyang, Kaesong, and Nampo were selected as the test areas based on the archived DTM. The minimum distances between the training and test areas are 176.3, 42, and 172.5 km for Pyongyang, Kaesong, and Nampo, respectively. Pyongyang is located in the west-central region of North Korea in a flat plain, which is one of the two largest plains (50 km east of Korea Bay) on the western coast of the Korean Peninsula (Figure 1b) [32]. Kaesong is the southernmost city of North Korea and is located close to the border of South Korea. The city center encompasses the significantly smaller mountainous northeastern region (average elevation of 103 m), and most areas consist of low hills with a height of less than 100 m [33]. Nampo is a seaport and lies on the northern shore of the Taedong River at a distance of 15 km east from the river's mouth [26,34]. The Imjingang Belt lies between the Rangnim and Gyeonggi massifs; therefore, were grouped the training area (Seoul, Incheon Gyeonggi Province) and Kaesong [30].
With respect to geography, the Pyongyang area located on the Taedong River has a plain area that is similar to that of Seoul. The mountainous region in the Kaesong area resembles the southern and northern parts of Seoul. The Nampo area is a seaport and can be regarded as the western coast of Incheon. The training and test areas are located For VS30 mapping in North Korea, sections of Pyongyang, Kaesong, and Nampo were selected as the test areas based on the archived DTM. The minimum distances between the training and test areas are 176.3, 42, and 172.5 km for Pyongyang, Kaesong, and Nampo, respectively. Pyongyang is located in the west-central region of North Korea in a flat plain, which is one of the two largest plains (50 km east of Korea Bay) on the western coast of the Korean Peninsula ( Figure 1b) [32]. Kaesong is the southernmost city of North Korea and is located close to the border of South Korea. The city center encompasses the significantly smaller mountainous northeastern region (average elevation of 103 m), and most areas consist of low hills with a height of less than 100 m [33]. Nampo is a seaport and lies on the northern shore of the Taedong River at a distance of 15 km east from the river's mouth [26,34]. The Imjingang Belt lies between the Rangnim and Gyeonggi massifs; therefore, were grouped the training area (Seoul, Incheon Gyeonggi Province) and Kaesong [30].
With respect to geography, the Pyongyang area located on the Taedong River has a plain area that is similar to that of Seoul. The mountainous region in the Kaesong area resembles the southern and northern parts of Seoul. The Nampo area is a seaport and can be regarded as the western coast of Incheon. The training and test areas are located in the same Paleozoic basins (i.e., Phyongnam basin) and exhibit large distributions of Mesozoic igneous rocks. The shallow soil layer was mostly formed as Quaternary alluvial soil and weathered residual soil by fluvial or weathering processes. Despite the uncertainty in literature-based geological investigations in North Korea owing to the huge knowledge gap, the formation similarities between the shallow subsurface in the training and test areas were identified.

Digital Terrain Model
DTMs in the training area of South Korea were archived from NASA's Shuttle Radar Topography Mission (SRTM, http://www2.jpl.nasa.gov/srtm/instr.htm, accessed date: 3 March 2020) and other local radar mission projects to render surficial terrain information for the target areas. Despite the archiving of the terrain information in the test areas, North

Digital Terrain Model
DTMs in the training area of South Korea were archived from NASA's Shuttle Radar Topography Mission (SRTM, http://www2.jpl.nasa.gov/srtm/instr.htm, accessed date: 3 March 2020) and other local radar mission projects to render surficial terrain information for the target areas. Despite the archiving of the terrain information in the test areas, North Korea is inaccessible due to military security. The global DTM was acquired by the Advanced Land Observing Satellite from the Japan Aerospace Exploration Agency.
To integrate the boreholes and DEM precisely, the transverse Mercator spatial coordinate system in the DEM was simultaneously transformed based on the geographic coordinate system (latitude and longitude). The DTM with a 5 m × 5 m resolution (0.15 arcsec) was available; subsequently, interferometry was used to gather topographic (elevation) data. The horizontal and vertical accuracies of the root mean squared error (RMSE) were 5 m, and global site condition maps for deriving the topographic slope data as a proxy for the site amplification were collected from the SRTM 30 and 3 arcsec [12,13]. Stewart et al. [35] developed geologic and terrain-based site condition maps for California by using a geological map and 1, 3, 9, and 30 arcsec DEMs. Conversely, high-resolution topographic data for developing local site condition maps are essential, particularly for North Korea, where direct land surveys are impossible. The slope was extracted via remotely sensed applications using DTMs based on DTM-derived longitudinal profile analysis [36] implemented in ESRI GIS products. Figure 2 shows the DTM-based elevation and slope distribution maps. topographic data for developing local site condition maps are essential, particularl North Korea, where direct land surveys are impossible. The slope was extracted vi motely sensed applications using DTMs based on DTM-derived longitudinal profile ysis [36] implemented in ESRI GIS products. Figure 2 shows the DTM-based elevation slope distribution maps.  The maximum elevation is approximately 852 m in the northern and southern mountainous regions of the training areas, and the average standard deviation is 6.2 m (Figure 2a). The highest slope is approximately 72 • . The maximum elevations in Pyongyang, Kaesong, and Nampo (the test areas) are approximately 363, 478, and 501 m, respectively, and their maximum slopes are approximately 63.3, 90, and 90 • , respectively. In the Pyongyang area, the elevation and slope along the watershed were approximately 0-25 m and 0-10 • , respectively. In the Kaesong area, the surrounding mountains, spatially occupying approximately 65% of the area, have an elevation of 110-478 m and a slope of 30-70 • , excluding the central basin. In the Nampo area, most of the plain (70% area) exhibits an elevation of 0 to 100 m and a slope of 0 to 8 • , except for certain mountainous areas in the northern region.

Geotechnical Datasets
In the test area, the borehole log datasets (31,033 points), including the engineering strata and standard penetration test (SPT) results, were collected from the Korea Civil Engineering and Building Technology Institute's Geo-Info system. Additionally, site visits for 350 locations were carried out to gather surface geo-knowledge information, mainly in regions that lacked borehole data. The site survey was conducted at locations where borehole data were not obtained using cone penetration, a global positioning system, and other techniques to validate the rock outcrop or cut slope and cross-check with the engineering strata from the adjacent borehole data [37]. The engineering strata were classified into five groups: fill layer, alluvial soil, weathered soil, weathered rock, and bedrock (soft rock).
In situ geophysical measurement of V S such as downhole, cross-hole, and spectral analysis of surface waves is the preferred approach for calculating the small-strain shear properties [25]. It was also used in the site classification systems and ground motion equations. When the direct measurement of the V S profile is not determined, the use of multiple indirect methods (i.e., penetration-based V S correlations) is recommended; these methods require selecting a V S30 design according to the seismic design codes in many countries [38]. The site-specific correlations between the V S and penetration tests, such as the cone penetration test, SPT, and Becker penetration test, have been obtained previously [39][40][41][42][43]. However, the correlations exhibit certain limitations: (1) low coefficient of determination values, (2) local uncertainty based on low quality data, and (3) disregard of the characteristic deformation behaviors of soils.
In this study, the direct V S profiles were not stored in the Geo-Info system, and large-scale SPT results were yielded. SPT, detailed in ASTM D 1586 [44], provides penetration resistance related to the blow count (n value), which is widely used to determine engineering properties and soil design parameters. For uncertainty analysis with regard to the empirical relation of SPT-N versus V S , the V S30 from each borehole location was computed using the representative transformation formulas for soil types (Table 1). If the n value was not registered above the engineering bedrock, the representative V S values were calculated to be 190 m/s for the fill layer, 280 m/s for alluvial soil, 350 m/s for weathered soil, 650 m/s for weathered rock, and 1300 m/s for bedrock [45]. Correlation #1 [46], which was reasonably consistent with correlations from previous studies in other countries, was adopted in the domestic seismic design [4,47] and applied as the preferred approach. Correlation #2 [38] was recommended for N 60 (normalized to 60% energy ratio), effective stress (σ v ), and the geological era, and applied in the Pacific Earthquake Engineering Center (PEER), USA. Correlation #3 was derived by using N 60 that included previous correlations [42]. The performance of ML models was evaluated depending on the N versus V S correlations. Table 1. N-V S correlation equations (modified after related studies by Sun et al. [46], Wair et al. [38], and Hasancebi and Ulsay [42]).

. Topographic Position Index and Land Formation
A strong correlation is observed between the topographic position and physical and biological processes that influence landscape, such as soil erosion and deposition, hydrological balance, wind exposure, and cold air drainage [48]. The TPI is used to compare the elevation of each cell with the average elevation of the neighborhood surrounding each cell to classify the landforms using the DTM or DEM [48][49][50]. In the center of the local window, the local mean elevation is subtracted and calculated via algorithm analysis using a geographic information system (GIS) [51]. The algorithm consists of a local neighborhood and a central point, representing the vertical position in length above or below the mean elevation of the neighborhood as follows: where Z 0 is the elevation of the model point under elevation, and Z n is the elevation of the cell within the local window. Neighborhood sizes of 25 and 500 m were used for small and large elements of landform classes [52]. The TPI contributes to the classification of the DEM into 10 landform categories with a grid threshold TPI (25 or 500) ( Table 2). TPI values are categorized into pitch positions depending on the pitch and how intense they are at each point. TPI values above a certain level may represent mountain tops or hilltops, while those below the threshold could signify bottoms of the valleys or depressions. TPI values of approximately 0 can be defined as plains (when the path is above 0) or midslope ridges (if the slope is above a certain threshold). For a better categorization of the landform class appropriate for the terrain proxy, the grouped classes are presented in Table 2.

Geospatial Interpolation for Surficial and Subsurface Grid Building
The geotechnical datasets, including the borehole data and results gathered from on-site studies, mainly comprised building lots, roads, and railroads. Grid generation, in conjunction with engineering strata and the SPT-N value based on geostatistical interpolation, is essential for comparing the geotechnical properties with the DEM-based terrain features under the same grid (or resolution). Kriging is a representative interpolation method for smoothing out the inherent distribution of the original dataset. However, in practice, the variance of the estimates is lower and, therefore, on average, high values are underestimated and low values are overestimated. Conditional simulation is an efficient method of resolving such limitations of the kriging estimates. Table 2. Definition of Topographic Position Index (TPI)-based landform classes and grouped landform classes (modified after Weiss [48]).
The 5 m × 5 m resolution of V S was established by interpolating the borehole-based V S value for unit depth (i.e., 1 m) and visualizing along a two-and-a-half-dimensional (2.5D) geospatial grid. The selection procedure for the optimized interpolation method was proposed for borehole data using a cross-validation technique [55,56]. Therefore, the optimal interpolation method with the lowest prediction error (residuals) was applied.

Multivariate Site Response Parameters and Classification System
Seismic zonation based on the multivariate site response parameters was assessed with a spatially assigned grid including elevation, slope, TPI-based landform class, and V S . V S profiles can be obtained using downhole, cross-hole, and surface wave methods from different sources for general estimates. Conventional parameters are primarily used to quantify the seismic site effect and define the criteria for site classification. The site classification scheme was applied to V S30 in the US seismic code after 1994 as the main categorization parameter, and the average V S up to 30 m below ground level [57] can be calculated as follows: where D i and V Si represent the thickness and the V S of the i th layer, respectively. The engineering bedrock (H), also known as the analytical brief index, only represents the geometrical features without considering the stiffness of the V S profile, if only the engineering strata are given. The natural site period results from the sharp contrast between the soil and engineering bedrock and is measured using strata thickness and V S . V S topographical calculations also positively correlate with DEMs, geological maps, and other geomorphic indicators [12,13,58]. The scientific concept of a terrain proxy criterion was established in a metropolitan area in South Korea, including mountains, valleys, plains, and sites with diverse geological features [37,45,59]. Empirical correlations with site response parameters were suggested in South Korea with respect to site-specific geospatial conditions to regional zoning of seismic site effects ( Table 3). As V S30 decreases, the stiffness of the strata weakens, and the elevation and slope decrease. The best-fitting regression algorithms were compared with approaches focused on cost functions to predict the V S30 value in this analysis. The V S30 value as the label attribute and other original features are included in the grid properties: elevation, slope, and landform class. The elevation and slope are subgrouped with the landform class for predicting the V S30 and validation of the best-fitting models. It is imperative to preprocess the data before inputting them into regression models. There are five basic stages of data preprocessing: (1) handling missing values, (2) scaling, (3) one-hot encoding, (4) feature selection, and (5) splitting the dataset into training and test datasets [21]. In this study, the input attributes of the training area in Seoul and Incheon were defined as the training datasets. After preprocessing, the grouped attributes with landform classes in the training datasets were input into the fitted regression models. V S30 was predicted using the test datasets and terrain information in the three test areas in North Korea.
Given the different numbers of attributes inputted into the training datasets, a resample method, such as K-fold cross-validation, was required to determine the best-fitting model. K-fold cross-validation removes the convergence of the K-fold data partition and yields K. Each K-fold can be used for end-of-line testing, and all other folds are used simultaneously for training. In this study, 10 folds are generated for the subsurface grid in the training area. Prior to splitting the grids, shuffles and a robust pseudorandom number generator, denoted as Mersenne Twister [61] in Python, are applied. We designed four representative regression algorithms of AI: logistic regression (LR), neighbors K-nearest (K-NN), support vector regression (SVR), and multilayer perceptron (MLP). The models were trained using each algorithm, while hyperparameters were incorporated and optimized model outputs on the K-fold validation dataset were contrasted. In this study, the top-performing model pipeline was determined based on the Auto-Sklearn [62], which is an open-source library for data transformation and machine learning algorithms based on the Bayesian optimization search procedures [63] in Python.
The classical statistical approach used for binary classification is LR and has been adopted as a simple ML model. The LR model uses a logistic feature to compress the output of a linear equation between 0 and 1, rather than fitting a straight line or hyperplane. As linear regression assumes that the data are linear, LR uses a sigmoid to model the data.
For the K-NN procedure, the predicted values for the neighboring measurements of the variables were obtained as weighted averages. One of the main challenges of this approach is the need to choose an ad hoc similarity metric, particularly for heterogeneous datasets of different types and sizes that exhibit varying interrelationships between the extracted features [64]. The proximity to a certain spot in a high-dimensional space is scarce and leads to significant variety. K-NN regression is a nonparametric method that intuitively approximates the relations between the independent and continuous effects by integrating observations in the same neighborhood. K-NN regression uses distance functions and an important and less concerned approach for producing K-NN rankings (e.g., Euclidean distance, Manhattan distance, and Minkowski distance). Manhattan distance (denoted as block distance or taxicab geometry), which is the distance (d) between two points (x i and y i ) at right angles to the axes, has been calculated as follows: SVR is a linear or nonlinear regression method based on the concept of support vector machines (SVMs). When formulating SVMs as convex optimization problems [65][66][67], SVM resolves binary classification problems. The SVR methodology uses linear quadratic programming methods to handle data in high-dimensional space [68]. We also sought to reduce the LR error rate. During SVR, we attempted to match the error under any threshold. The former condition generates the objective function in equation (Min 1 2 w 2 ) where the size of the normal vector to the surface is approximated.
MLP is a static neural structure consisting of layers that transmit and transfer information through synaptic links represented by weight adjustment. The MLP has also been used as a regression approximation tool. A conversion function is used to transfer the weighted number of inputs and the terms of bias to the activating step [69,70]. A feed-forward neural network is an artificial and non-cyclic neural network. Every perception is completely linked to all the nodes in a single layer. The input, hidden, and output layers are usually composed of MLP modeling. Hyperparameter tuning was performed for the best-fitting model of the MLP classifier by adjusting numerous parameters such as the number of hidden layers, number of nodes of the hidden layer, activation function, optimizer, number of epochs, batch size, and learning rate. In particular, the process of finding the optimal network was applied, namely grid search; meanwhile, various cases in the hyperparameter were altered. In this analysis, two hidden layers in 80 nodes were linked to the link layers.
The input values were weighted and generated according to the activation function of the layer above [71]. As an activation function, rectified linear units (ReLU) were used for the hidden layers and the Adam optimizer was set up. If the input was less than 0 and the raw output of ReLU was different, the output was considered to be 0. The random state and solver are eight, and the stochastic gradient descent, respectively [70].

Topographic Position Index-Based Classification of Landforms
The TPI was computed, and its grid-based landform class was determined based on the relative height difference and slope inclination (Figure 3). The zonal proportion based on the occupied grid was determined against 10 TPI classes for each training and test area (Figure 4). TPI-based landform maps support the site-specific characterization of topographic site effects on seismic waves to induce important gradients during the ground acceleration in grids while elucidating their relationship with V S30 .
In the training area, the plain (class 5) that accounted for approximately 33% of the total area was mainly distributed along the riverside and reclaimed land. The widest landform was formed by the U-shaped valleys (class 4) that accounted for approximately 35% of the test area; these valleys were mainly observed on the northern hillside and lower slopes. The upper slopes (class 7) and open slopes (class 6) were regionalized in the northern and southern mountainous areas. The U-shaped valleys and upper slopes accounted for approximately 32% and 24%, respectively, of the test area in Pyongyang. The partial river basin (plain, class 5) occupying 17% of the area was classified as a flat relief or lowland along the Taedong River. The upper slope (class 6), which accounted for approximately 25% of the test area in Pyongyang, was also distributed and blended with the plain area of an urban district. The test area of Kaesong exhibited a major landform that was classified as U-shaped valleys (class 4), which accounted for approximately 35% of the total test area; these valleys comprised most of the urban and industrial zones; a plain area that accounted for only 7% of the test area in Pyongyang was also identified. For the test area in Nampo, the U-shaped valleys (class 4) accounted for approximately 28% of the northern mountainous region. The plain (class 5), excluding the sea area, accounted for approximately 28% of the test area and was distributed in the blended area with upper slopes (class 7) that accounted for 18% of the test area.
The test area in Pyongyang was relatively similar to the training area when compared to the U-shaped valleys and plain area. For the U-shaped valleys, the test area in Kaesong had a terrain characteristic resembling that of the northern mountainous area in the training area. Kaesong exhibited more terrain features (classes 1, 2, 3, 8, 9, and 10) of highland regions compared to the others. The test area in Nampo was similar to the northwestern area in the test area with regard to only the plain area.
The TPI was computed, and its grid-based landform class was determined based on the relative height difference and slope inclination (Figure 3). The zonal proportion based on the occupied grid was determined against 10 TPI classes for each training and test area (Figure 4). TPI-based landform maps support the site-specific characterization of topographic site effects on seismic waves to induce important gradients during the ground acceleration in grids while elucidating their relationship with VS30.

Topographic Position Index-Based Classification of Landforms
The TPI was computed, and its grid-based landform class was determined based on the relative height difference and slope inclination (Figure 3). The zonal proportion based on the occupied grid was determined against 10 TPI classes for each training and test area (Figure 4). TPI-based landform maps support the site-specific characterization of topographic site effects on seismic waves to induce important gradients during the ground acceleration in grids while elucidating their relationship with VS30.

Geospatial Grids Assigned with V S30 and in Test Area
Geospatial grids (5 m × 5 m resolution) assigned with V S30 (Figure 5b-j) were developed using the borehole datasets (Figure 5a), based on the proposed geostatistical method (IDW, SK, OK, UK, EBK, and SGS). V S was transformed from the SPT-N value using three correlations (Table 1), and N-V S correlation #1 was used for selecting the optimized interpolation method. The predicted V S30 was generally high when focusing on the outside mountainous region of the administrative divisions of Seoul. A deep stiff soil with a V S30 value lower than 360 m/s was found to be distributed along the Han River and the southern part of Incheon, including the Incheon International Airport. On the V S30 maps in the test area, there were minor spatial variations according to the interpolation methods. sequential Gaussian simulation-5th realization; (h) sequential Gaussian simulation-50th realization; (i) sequential Gaussian simulation-100th realization; (j) sequential Gaussian simulation-E-type. VS30 was calculated using VS that was transformed from the SPT-N value by using correlation #1 ( Table 1). The unit of VS30 is m/s. The proposed optimization procedure was also conducted by considering the cost function of cross-validation. The residuals and scatter of the plots between the measured and predicted VS30 values were analyzed ( Figure 6). When the actual VS30 values were below 760 m/s, the threshold of the engineering bedrock depth [72] implied that VS30 values (g) sequential Gaussian simulation-5th realization; (h) sequential Gaussian simulation-50th realization; (i) sequential Gaussian simulation-100th realization; (j) sequential Gaussian simulation-E-type. V S30 was calculated using V S that was transformed from the SPT-N value by using correlation #1 ( Table 1). The unit of V S30 is m/s. The proposed optimization procedure was also conducted by considering the cost function of cross-validation. The residuals and scatter of the plots between the measured and predicted V S30 values were analyzed ( Figure 6). When the actual V S30 values were below 760 m/s, the threshold of the engineering bedrock depth [72] implied that V S30 values were overestimated. When the actual V S30 values exceeded 760 m/s, V S30 was underestimated. As the 70% boring log was mostly investigated prior to the engineering bedrock, soil strata were largely scattered, leading to over-fitting and poor validation due to the smoothing effects of kriging [73]. Although most outcrop bedrocks in mountainous regions were studied and their V S values were assumed to be 1300 m/s as per the site visiting survey, the measured V S for bedrock was lower than 1300 m/s. Moreover, the high stochastic/spatial variation of V S for bedrock in local areas induced low spatial correlation coefficients and under-fitting of the V S30 predictions. The residuals of UK are larger than those of the other methods because UK was developed precisely to recognize and quantify the linear spatial trends in datasets [74]. were overestimated. When the actual VS30 values exceeded 760 m/s, VS30 was underestimated. As the 70% boring log was mostly investigated prior to the engineering bedrock, soil strata were largely scattered, leading to over-fitting and poor validation due to the smoothing effects of kriging [73]. Although most outcrop bedrocks in mountainous regions were studied and their VS values were assumed to be 1300 m/s as per the site visiting survey, the measured VS for bedrock was lower than 1300 m/s. Moreover, the high stochastic/spatial variation of VS for bedrock in local areas induced low spatial correlation coefficients and under-fitting of the VS30 predictions. The residuals of UK are larger than those of the other methods because UK was developed precisely to recognize and quantify the linear spatial trends in datasets [74]. To better compare the residuals in regression analysis, this study evaluated four error indices: mean absolute error (MAE), RMSE, root relative squared error (RRSE), and coefficient of determination (R 2 ) ( Table 4). The metrics of the indices are calculated as follows: To better compare the residuals in regression analysis, this study evaluated four error indices: mean absolute error (MAE), RMSE, root relative squared error (RRSE), and coefficient of determination (R 2 ) ( Table 4). The metrics of the indices are calculated as follows: whereŷ is the predicted value of y, and y is the mean value of y. The indices indicate the difference between the measured and predicted V S30 ; the higher the accuracy, the closer the difference value is to 0. R 2 shows how closely the values complement the original value; the higher the value, the better the model. The SGS-E-type method showed the optimal prediction results when comparing the RMSE and RRSE values; these were the lowest for the SGS-E-type method. The MAE for the SK method had the lowest values. The R 2 values of the SK, OK, and SGS-E-types were closer to 1, indicating a higher precision. When the error distribution is assumed to be Gaussian, RMSE and RRSE are more suitable for reflecting the model results than MAE. In particular, RMSE is more useful if significant errors are avoided, and it is considered that RRSE reduces the error to the same dimensions as the quantity being predicted. Accordingly, the SGS-E-type as the best-fitting method was used to construct the geospatial grid assigned with V S30 in the test area. The proportions of grids, which were built by interpolation methods and allocated for the site class (Table 3), have been summarized in Table 4. In general, a similar distribution for proportion is presented; the ratio of the B, C, D, and E classes is 42%, 31%, 13.3%, and 13.7% in the training area, respectively.

Adaptation of Terrain Proxy-Based Site Class
To investigate the interrelationship between V S30 and landform, the combination of landform classes, which are clearly defined as the principal components for the regression model, have been examined using the box and whisker plots. The grids assigned with V S30 in the training area were built using SGS and grouped by the TPI-based landform class, which has been denoted as LF (Figure 7a). Class 1 (LF1) had the broadest range (400-1100 m/s between the first and third quartiles) of V S30 as well as the highest V S30 (approximately 1300 m/s). LF2, LF3, LF8, LF9, and LF10 represented the drainages or ridges in the mountainous area and exhibited a similar range (approximately 400-800 m/s between the first and third quartiles) of V S30 . LF4, LF6, and LF7, which were classified as valleys and slopes, ranged from approximately 400 to 600 m/s between the first and third quartiles of V S30 . LF5 (plain area) was verified as having the lowest range (approximately 300-420 m/s between the first and third quartiles) of V S30 . Accordingly, the LFs were grouped into four classes (LF-I, -II, -III, and -IV) with similar ranges of V S30 . (Figure 7b, Table 2). LF-I represented deeply incised streams characterized as metamorphic rocks [75] and exhibited the highest V S30 . LF-II was defined to include drainage and ridge sites (TPI 25 ≤ −1). LF-III represented valleys and slopes, with TPI 25 > −1 and < 1, and a slope > 2 • . LF-IV denoted a plain where the mean V S30 was 400 m/s. LF5 (plain area) was verified as having the lowest range (approximately 300-420 m/s between the first and third quartiles) of VS30. Accordingly, the LFs were grouped into four classes (LF-I, -II, -III, and -IV) with similar ranges of VS30. (Figure 7b, Table 2). LF-I represented deeply incised streams characterized as metamorphic rocks [75] and exhibited the highest VS30. LF-II was defined to include drainage and ridge sites (TPI25 ≤ −1). LF-III represented valleys and slopes, with TPI25 > −1 and < 1, and a slope > 2°. LF-IV denoted a plain where the mean VS30 was 400 m/s. The coinciding of grids of different sites classified by VS30 (Table 3) was evaluated along with the four groups of the LFs by comparing the elevation and slope as a proxy of VS30. The minimum first quartile of the VS30 value for each class was mostly 400 m/s because the grids with VS30 values of 400 m/s had a wide range of surface elevations and slopes (Figure 7c,d). No grid was determined as a class E. The grid area associated with the elevation and slope of LF-IV accounted for approximately 32% and 89% of the total area for the D class, respectively. Among class C, the grid area associated with the elevation and slope of LF-III accounted for approximately 83% and 65% of the total area, respectively. The grid area associated with the elevation and slope of LF-II accounted for approximately The coinciding of grids of different sites classified by V S30 (Table 3) was evaluated along with the four groups of the LFs by comparing the elevation and slope as a proxy of V S30 . The minimum first quartile of the V S30 value for each class was mostly 400 m/s because the grids with V S30 values of 400 m/s had a wide range of surface elevations and slopes (Figure 7c,d). No grid was determined as a class E. The grid area associated with the elevation and slope of LF-IV accounted for approximately 32% and 89% of the total area for the D class, respectively. Among class C, the grid area associated with the elevation and slope of LF-III accounted for approximately 83% and 65% of the total area, respectively. The grid area associated with the elevation and slope of LF-II accounted for approximately 68% and 59% of the total area for B class, respectively. The grids classified as LF-I accounted for only 2% of the entire test site, although they exhibited scattered relations with V S30 . The grid area associated with the elevation and slope of LF-I accounted for 95% and 89% of the total area for B class, respectively. Therefore, LF-II (including LF-I), LF-III, and LF-IV were strongly related to the site classes B, C, and D, respectively. The statistical characteristics of terrain proxies (elevation and slope) and V S30 values are summarized using the grouped TPI-based landform classes (Table 5). From LF-I to LF-IV, V S30 increases; otherwise, the elevation and slope generally increase.

Multivariate Regression Model for Terrain Proxy-Based V S30 Classification
Using the grids (171,000 cells) in the training area constructed by SGS, the elevation and slope were subgrouped using the proposed four LFs and defined as input feature datasets for the training procedure in the proposed four regression models. The labeled data were V S30 . Using the four modeled regression algorithms for each LF group, the four input combinations (elevation and slope) were trained and verified using K (i.e., 10-fold cross-validation). The averaged residuals between the predicted and actual V S30 values ( Figure 8) were calculated for each cross-validation using the metrics of the indices: MAE, RMSE, RRSE, R 2 . The validation data for the model were then considered to be a single subsample (i.e., 17,100 cells), and the remaining K-1 samplings (i.e., 153,900 cells) were used as the training data. The prediction performance of N-V S correlations (Table 3) for deriving the V S30 grid were also compared with the cost functions.  Table 3).
The best-fitting model for each LF group was determined using the cost functions. The MAE for SVR had the lowest value and was considered to be a more optimized method. When RMSE and RRSE exhibited their lowest values, LR and MLP showed optimal prediction performance. In particular, MAE and RMSE were compared with the minimum deviation (140 m/s) of VS30 threshold in the site classification system. The indices were lower than 140 m/s except for K-NN, indicating the low possibility of encountering errors during site classification when using the regression algorithms. The R 2 values of LR and MLP were also closer to 1, indicating relatively high precision. In particular, K-NN was the least reliable in predicting VS30. The optimal predictable LF was determined by  Table 3).
The best-fitting model for each LF group was determined using the cost functions. The MAE for SVR had the lowest value and was considered to be a more optimized method. When RMSE and RRSE exhibited their lowest values, LR and MLP showed optimal prediction performance. In particular, MAE and RMSE were compared with the minimum deviation (140 m/s) of V S30 threshold in the site classification system. The indices were lower than 140 m/s except for K-NN, indicating the low possibility of encountering errors during site classification when using the regression algorithms. The R 2 values of LR and MLP were also closer to 1, indicating relatively high precision. In particular, K-NN was the least reliable in predicting V S30 . The optimal predictable LF was determined by comparing the cost functions. Similar to the optimization of geospatial interpolation models, the LR presented performed optimally in the prediction of V S30 because using the RMSE is more appropriate for realizing a Gaussian error distribution. To select the eclectic regression model that almost completely satisfies the cost functions, the priorities of the optimal performing regression algorithm are as follows: LR, MLP, SVR, and K-NN.
The predicted grids classified as LF-III and comprising maximum data (103,704 cells) yielded the lowest metrics of residual indices for all the models. The input feature of the landform class that represents valleys and slopes may be consistent with the slope and elevation. Otherwise, LP-I, which is characterized by its deeply incised streams and identified as class B accruing to site classification, shows the highest level of uncertainty in predicting the V S30 in the training area. Characterizing the representative V S30 in bedrock is complicated when using the sparse distribution of investigated rock outcrops in the training area. After comparing the N-V S correlations in the cost functions, correlation #1 was selected and used for the transformation of N to V S to calculate the V S30 as it yielded the lowest MAE, RMSE, and RRSE values and the highest R 2 . Nevertheless, the reasons these values when compared to other stochastic relationships are as follows: (1) low number of geotechnical investigations were performed in mountain and rock outcrop areas; (2) uncertainty in the geospatial interpolation of sparsely distributed and low correlated features; and (3) possible underfitting in the auto fitting procedures of hyperparameters.
In this study, the predicted residuals were elucidated and the difference between the predicted V S30 values with the measured V S30 values ( Figure 9) was evaluated based on the N-V S correlation #1. The best-fitting regression model out of the four algorithms was applied to the training areas to compare the spatial variability in the predicted V S30 map ( Figure 10). The scatter patterns in LR, SVR, and MLP were very similar, whereas the residual application of the K-NN was significantly scattered. For all grouped LFs, the V S30 value was under approximately 400 m/s; at this value, the site classes D or E were overestimated. The grids assigned with low V S30 were mainly located at LF-IV, where elevation and slope were the smallest (5.73 m and 0.01 • , respectively) ( Table 5). However, these grids in LF-I or LF-II may be a source of noise parameters in the regression models. Conversely, when the V S30 value exceeded approximately 400 m/s, site classes B and C were underestimated.
The grids allocated in the high V S30 map ( Figure 5) mostly exhibited top or ridge characteristics: LF-I or LF-II, high elevation and slope, and outcrop bedrock or stiff soil. The grid with a high V S30 in the plain would be a source of noise parameters. When the V S30 value was either higher or lower than 400 m/s, the number of residuals increased. V S30 was mapped in the training area ( Figure 10) and test areas ( Figure 11) based on the best-fitting model of the four regression methods. The maps from LR and SVR presented similar spatial trends, particularly in the northern and southern mountainous regions where V S30 exceeded 800 m/s. The grids were defined as class B using LR, K-NN, SVR, and MLP, which occupied 32%, 39%, 31%, 38% of the training area, respectively. The grid areas assigned with class C for the application of LR, K-NN, SVR, and MLP occupied 19%, 25%, 20%, and 30% of the training area, respectively. Class D was distributed at 49%, 46%, 49%, and 42%, respectively, in the four V S30 maps. The MLP-based V S30 map had a similar proportion of site class with the V S30 grid, according to the geospatial interpolation models.     tial trends, particularly in the northern and southern mountainous regions where VS30 exceeded 800 m/s. The grids were defined as class B using LR, K-NN, SVR, and MLP, which occupied 32%, 39%, 31%, 38% of the training area, respectively. The grid areas assigned with class C for the application of LR, K-NN, SVR, and MLP occupied 19%, 25%, 20%, and 30% of the training area, respectively. Class D was distributed at 49%, 46%, 49%, and 42%, respectively, in the four VS30 maps. The MLP-based VS30 map had a similar proportion of site class with the VS30 grid, according to the geospatial interpolation models. Figure 11. VS30 mapping for the test areas in Pyongyang, Kaesong, and Nampo, using regression models: logistic regression; K-nearest neighbors; support vector regression; multilayer perceptron. Figure 11. V S30 mapping for the test areas in Pyongyang, Kaesong, and Nampo, using regression models: logistic regression; K-nearest neighbors; support vector regression; multilayer perceptron.

V S30 Mapping in the Test Area, North Korea
Based on the best-fitting regression model, the V S30 values in four test areas in North Korea were mapped using elevation, slope, and the grouped LF from the DTM (Figure 11). In Pyongyang, the results of LF and K-NN were similar to those of the training area. K-NN predicted the discontinuous zone, which was classified as class B, along the Taedong River. In the MLP-based V S30 map, the C3 and C4 classes, representing the intermediate stiff soil, were distributed on the basin-edge area and occupied a wider area (28%) than those in the LR, K-NN, and SVR-based maps (10%, 19%, and 9%, respectively). The D and E classes were distributed in the basin and embankment areas, except in the K-NN-based map.
The LF-I zone in the test area, Kaesong, was mainly classified as class B. The LF-II zone was classified as class C based on the MLP-based map, whereas other maps mostly identified the zone as class D. As class C denotes the weathered rock or stiff soil and is generally distributed in the drainage and ridge areas , the MLP-based map provided reasonable distribution of the site class. In every map, classes D or E were determined by focusing on LF-IV. The V S30 maps of the test area in Nampo were obtained from the four regression models that generally represented a similar pattern, with LF-I and LF-IV being classified as classes B and D, respectively. Class D was widely distributed, exclusively in the northern and northwestern mountains, where the elevation exceeded 350 m and the slope exceeded 50 • .
The LR-and SVR-based maps showed a similar V S30 distribution to that observed in the training area. The K-NN-based V S30 maps showed discontinuous zoning despite having the same LF. The MLP-based map determined class C for a wider area than the maps based on other models. The normal distribution of V S30 obtained from the best-fitting model in the four regression methods was derived for the four test areas (Figure 12). The probability density of the SGS-E-type in the training area was also compared based on the distribution patterns (skewness and kurtosis). The probability density distributions of LF, SVR, and MLP-based V S30 maps were more peaked (leptokurtic) than that of the SGS-E-type. If V S30 was leptokurtic, most of the V S30 grids were distributed mainly in the V S30 range of 600-800 m/s as compared to that observed in a mesokurtic or platykurtic distribution (i.e., SGS-E-type and K-NN applications). The kurtosis of V S30 developed from the K-NN was similar to that of the SGS-E-type. In the training area, the skewness coefficients of LF and SVR were zero, while K-NN and MLP were typically negative. In the test areas of Pyongyang and Nampo, the distribution patterns were similar to those in the training area. Similar to the resemblance in geography and terrain of the training and test areas, the predicted V S30 grid conserved the surficial and subsurface characteristics. In contrast, the negative skewness of V S30 for the four models was estimated for the test area in Kaesong. As most areas of Kaesong are low hills, a higher V S30 value (approximately 700-900 m/s) than that for the training area was possible.

Discussion
The large-scale VS30 map predicted using only the terrain features supports site classification against seismic site effects. In particular, the configuration of subsurface characteristics was complicated by the given confined terrain information, despite the important global or local spots in the Korean Peninsula. The TPI-based landform characteristics were newly defined as the principal components of the learning model to reclassify the conventional criteria of various terrain proxies. The terrain-based proxy usually uses the slope

Discussion
The large-scale V S30 map predicted using only the terrain features supports site classification against seismic site effects. In particular, the configuration of subsurface characteristics was complicated by the given confined terrain information, despite the important global or local spots in the Korean Peninsula. The TPI-based landform characteristics were newly defined as the principal components of the learning model to reclassify the conventional criteria of various terrain proxies. The terrain-based proxy usually uses the slope gradient and elevation, and in combination with LF, adequately predicts V S30 . The training area in South Korea was representative of the geography and terrain characteristics of the Korean Peninsula. The subareas (basins, ridges, and valleys) in the training area resembled those in the three test areas in North Korea based on the LF. As the LFs were classified into 10 landform categories with a grid threshold TPI (25 or 500 m), the subgrouping of LFs was appropriate for clearly categorizing the relations between LF and V S30 .
The previous methodologies suggested the individual proxy dependence of V S30 related to geomorphological, terrain, and geological properties through a stochastic approach, such as a lognormal linear regression model [11][12][13][14]44]. The best-fitting models for the training area had the lowest residuals for various cost function metrics and presented a good performance by applying K-fold cross-validation. However, determining the optimal regression model by considering only the cost functions is challenging. Hence, geospatial interpolation was applied to verify the reliability of the models based on the site class proportion of V S30 grids. The V S30 maps in North Korea primally visualized site classes against seismic site effects using the best-fitting models of the four regression methods. Although verification is impossible without geotechnical or geophysical investigations, the coefficients of the site amplification can be preliminarily interpreted for seismic hazard assessment or resistance designs.
Prior to the site classification, regression, and mapping workflow, the archiving of the V S30 profile and its site response from the geophysical survey is essential. However, if the V S profile is not distributed for a large-scale area, the empirical correlations with other physical value can be assumed as the best alternative approach for determining the V S profile. Nevertheless, the applicability of N-V S correlations for regression modeling were validated; the direct V S profile should be archived or supplemented for the site-specific characterization of seismic amplification. Therefore, the proposed regression model is recommended only in n-value-based V S30 mapping.
Geological unit-based proxies for seismic site characterization maps in high seismicity regions have been effectively used to demonstrate the shallow shear wave profile and its index properties. Accordingly, the integrated proxy index is possible considering the local cross-correlation between terrain and geological features, given the geological map of North Korea. An ensemble learning approach is necessary to combine several regression models to produce an optimal predictive model for an extended follow-up research.

Conclusions
In this study, the DTM was used to map large-scale V S30 in the test site in North Korea according to the proposed procedures: (1) TPI-based landform classification; (2) geostatistical interpolation for building a geospatial grid with geotechnical and DTM features; (3) calculation of V S30 grids; (4) reclassification and subgrouping of the TPIbased landform class with respect to V S30 grids; (5) uncertainty analysis with regression modes, landform class, and N-V S correlations; and (6) multivariate regression for V S30 grid prediction using terrain proxy-based principal components. The reclassified TPI-based landform was strongly correlated with V S30 , and its site response coefficients were similar to the correlations between V S30 and elevation or slope. In particular, the grouped LF divided the sedimentary site with a lower V S30 from the outcrop bedrock site in the training and test sites. The geospatial grid, which has been developed based on the SGS-E-type, also supported the characterization of shallow subsoil and land cover conditions in addition to V S30 . The ML techniques for predicting the V S30 grid, which were initially allocated with the DTM, determined the site-specific V S30 map at the test sites: Pyongyang, Kaesong, and Nampo. The best-fitting models of LR, K-NN, SVR, and MLP were determined based on K-fold cross-validation. The MLP-based V S30 map represented a similar distribution of the geospatial grid by using the SGS-E-type. On the V S30 map in the test areas of North Korea, feasible site classification against the seismic site effect was visualized, which will probably be used as a reference map for preliminary seismic hazard assessments.

Data Availability Statement:
The data that support the findings of this study are openly available in GEO Big Data Open Platform at https://data.kigam.re.kr/.