1. Introduction
Sea level rise exacerbates the risk of coastal disasters by amplifying flooding, inundation and shoreline erosion [
1]. From a coastal engineering perspective, sea level can be broadly decomposed into several components: the mean sea level, the astronomical tide, the meteorological tide (i.e., storm surge), and other residual effects [
2,
3,
4,
5,
6]. Among these, storm surges have emerged as a major threat to coastal regions worldwide due to their capacity to cause extensive inundation and damage [
7,
8,
9,
10,
11,
12,
13,
14,
15]. In addition, the risk of coastal inundation from storm surges has been observed to increase with the ongoing rise in sea levels attributable to climate change [
16,
17,
18,
19].
Storm surges are abnormal rises in sea level primarily driven by meteorological factors—particularly low atmospheric pressure and strong winds [
2,
3,
4,
5,
6]. Its research is often conducted in relation to tropical cyclones, which are accompanied by both low atmospheric pressure and strong winds [
15,
18,
19]. In general, two main approaches have been adopted for predicting storm surge: numerical models and statistical models. While numerical hydrodynamic models offer detailed physical representation, they are computationally expensive and time-consuming (e.g., [
16,
17,
18,
19]). On the other hand, statistical models—particularly regression-based approaches—enable rapid predictions and are more suitable for operational forecasting (e.g., [
20,
21,
22,
23,
24,
25]). Given the importance of speed and reliability in early warning systems for coastal hazards, statistical models provide a practical alternative for storm surge prediction.
To date, various statistical methods, including machine learning techniques, have been applied to the development of storm surge height prediction models [
26,
27,
28,
29,
30,
31,
32,
33,
34,
35]. However, Multiple Linear Regression (MLR) remains the most widely used approach, demonstrating strengths in predictive accuracy, interpretability, and computational efficiency. Roberts et al. [
26,
27] developed an MLR-based storm surge prediction model for the New York-New Jersey coastal region. They used meteorological inputs from the NARR and CFSR reanalysis datasets as independent variables and tide gauge observations as the dependent variable. Their model achieved a high level of accuracy—within 0.1 m of the observed peak water level—for extreme events such as Hurricane Sandy (2012), performing comparably to the physics-based numerical model, NYHOPS (New York Harbor Observing and Prediction System). Notably, due to its low computational cost, the MLR model could be used to operate dozens of ensemble forecasts, making it advantageous for both real-time and long-term scenario-based predictions.
However, some studies utilizing MLR have tended to focus primarily on estimating optimal regression coefficients and achieving predictive performance, without applying systematic data partitioning methods or establishing a clear model development strategy. This tendency is particularly evident in studies that use synthetic typhoon scenarios or data derived from numerical models. Al Kajbaf and Bensi [
31] developed an MLR-based surrogate model for storm surge prediction along the U.S. East Coast, using thousands of synthetic typhoon scenarios generated by the ADCIRC (Advanced Circulation Model for Oceanic, Coastal and Estuarine Waters). The input variables included central pressure deficit, radius to maximum winds, forward speed, and track direction of the typhoon, while the output variable was the peak storm surge height. Although the model enabled rapid surge estimation through a simple linear regression formula—without requiring complex hydrodynamic simulations—the absence of a well-defined data splitting and modeling strategy may lead to overfitting, reduced model robustness, and limited applicability to real-world storm events (e.g., [
36]).
In recent study, numerical model-based datasets such as ERA5 (European Centre for Medium-Range Weather Forecasts Reanalysis v5), GTSR (Global Tide and Surge Reanalysis), and ADCIRC have been increasingly used to overcome the limitations of observational data, including spatial sparsity and the lack of extreme event records. For example, Tadesse and Wahl [
35] developed a global storm surge prediction model using ERA-Interim, satellite data, and GTSR—a reanalysis dataset based on numerical modeling—for coastal regions worldwide. They applied and compared several statistical methods, including MLR, KNN (K-Nearest Neighbors) and Random Forest, and found that atmospheric pressure—particularly lagged pressure—was the most significant predictor in more than 70% of the study areas. While their study demonstrated the feasibility of global-scale surge prediction using only model-based data, the direct applications of such an approach to coastal disaster risk planning at the national level may be limited. This is because the performance of the global model represents an average over broad spatial domains, which makes it difficult to optimize the underlying numerical models for every individual coastline. As a result, model-driven errors are inevitable (e.g., [
15]), and certain regions within a given country may not be suitable for the application of such generalized methods.
In addition, compared to the extensive body of research conducted in the United States and Europe, studies focusing on the Korean coastline remain extremely limited. Choo et al. [
37] applied logistic regression and MLR techniques to assess sea level anomalies at three sites along the southeastern coast of Korea—Busan, Geoje, and Gadeokdo. Using meteorological and oceanographic observation variables as inputs, they developed a statistical model to predict anomalous sea level events. Their results showed that the MLR-based model improved predictive performance by approximately 4.9% in terms of the coefficient of determination (R
2) compared to an existing empirical approach. Their study is one of the few cases in which MLR has been applied to Korean coastal area, highlighting both the potential for region-specific statistical model development and the relative scarcity of such research in the region.
Taken together, these previous studies underscore the need for storm surge prediction models that not only utilize reliable statistical techniques such as multiple linear regression (MLR), but also incorporate physically meaningful variables, observational data, and region-specific characteristic. Addressing these gaps, the present study aims to develop a storm surge height prediction model tailored to the southeastern coast of Korea, using multiple linear regression technique.
Figure 1 illustrates the overall workflow of this study. The storm surge height prediction model was developed using typhoon characteristics such as typhoon location and intensity, extracted from best-track data as independent variables, while observed storm surge heights were used as the dependent variable. The model’s predictive performance was evaluated using the Root Mean Square Error (RMSE) and the coefficient of determination (R
2). To enhance model reliability and interpretability, this study implemented a structures model configuration strategy that distinguishes it from previous studies. Specifically, models were separately developed based on threshold criteria, including (1) the distance between the typhoon center and the observation point, and (2) the magnitude of the observed storm surge height. This approach contributes to both improved model performance and structural simplicity while addressing key limitations of prior research. In general, higher storm surges occur at locations situated to the right side of the typhoon track. However, because this study developed an MLR model applicable to 11 stations, the relative position between the typhoon and each observation station varies for the same typhoon. Therefore, the variation in storm surge height according to the relative position was not considered in this study.
2. Materials and Methods
2.1. Research Area
As shown in
Figure 2a, the southeastern coast of the Korea Peninsula (KP) was selected as the research site. This region has frequently experienced typhoon-related damages associated with high storm surges in the past [
15]. The southeastern coastline is characterized by complex coastal topography and is relatively less affected by astronomical tides and wind waves compared to the western and eastern coasts of the KP. In contrast, the southwestern coast exhibits a large tidal range, necessitating an analysis of the nonlinear interactions between tide and storm surges. The eastern coast of the KP, adjacent to the deep waters of the East Sea, is subject to significant wave transformations, requiring a detailed examination of the nonlinear interactions between wind waves and storm surges.
Eleven locations along the southeastern coast of the KP were selected as the points of interest for the development of storm surge prediction models using the multiple linear regression (MLR) technique. The selected locations include Geomundo, Goheung, Yeosu, Gwangyang, Tongyeong, Masan, Geojedo, Gadeokdo, Busan, Ulsan and Pohang.
Figure 2b presents the locations of tide gauge stations installed within the area of interest for this study, and
Table 1 provides their detailed coordinates.
2.2. Data
2.2.1. Independent Variable (Predictors)
In this study, typhoons that passed through the region spanning from 32° N to 40° N and from 122° E to 132° E over the period from 1979 to 2020, as shown in
Figure 3, were defined as typhoons that affected the KP. A total of 155 typhoons were considered in this study, and their key characteristics are presented in
Table A1. The characteristics of typhoons were used as input variables. Typhoon data were obtained from the best track dataset provided by IBTrACS.
IBTrACS is an international database that integrates best track data of tropical cyclones worldwide [
38]. It is a project developed by the National Centers for Environmental Information (NCEI) of National Oceanic and Atmospheric Administration (NOAA), initiated to unify tropical cyclone track records that were previously managed independently by Regional Specialized Meteorological Centers (RSMCs) and Tropical Cyclone Warning Centers (TCWCs) across different ocean basins [
38,
39]. The primary objective of IBTrACS is to standardize these records—originally archived using different formats and criteria by each agency—into a consistent dataset that is easily accessible and usable by researchers. The IBTrACS dataset [
40] covers the global ocean region from 70° N to 70° S and from 180° W to 180° E, and classifies data into seven basins: North Atlantic (NA), Eastern Pacific (EP), Western Pacific (WP), North Indian Ocean (NI), South Indian Ocean (SI), South Pacific (SP), and South Atlantic (SA). IBTrACS has been continuously updated since 1842, and as of 24 July 2025, the most recent version is v04r01. The dataset generally provides data at 3-h intervals and focuses on storm center position, maximum sustained wind speed, and minimum central pressure. Additionally, depending on the source agency, it may include other tropical cyclone-related parameters such as radius of maximum winds, environmental pressure, storm classification.
In this study, among the various tropical cyclone characteristics provided by IBTrACS, the independent variables for the multiple linear regression model were selected based on the “TOKYO” data, which are issued by the Worle Meteorological Organization Regional Specialized Meteorological Center in Tokyo, operated by the Japan Meteorological Agency, the official forecasting authority for typhoons in the western North Pacific, with records available from 1951 to the present. The selected variables include the latitude and longitude of the typhoon center, the maximum sustained wind speed and central pressure at the typhoon center, the translational speed of the typhoon, the distance between the typhoon center and a specific location, and the angle of approach of the typhoon relative to that location. These variables were selected based on their known influence on storm surge dynamics and their availability from typhoon track datasets [
29,
30,
31,
32,
33,
34]. The angle of approach was calculated clockwise from true north. For instance, if the point of interest is located in the upper-right quadrant relative to the typhoon center, the angle of approach falls within the range of 0° to 90°.
2.2.2. Dependent Variable (Predictand)
To better understand and predict various oceanographic phenomena occurring along the Korean coast—such as tides, storm surges, and sea level rise—a nationwide tide gauge network consisting of approximately 50 stations has been established and is operated primarily by the Korea Hydrographic and Oceanographic Agency (KHOA) [
41,
42]. The temporal resolution of tide observations varies by station, with data available at intervals of 1 min, 10 min, or 1 h. KHOA provides two types of tide observation datasets with a temporal resolution of 1 h, both of which have undergone quality control procedures, including raw time series processing, gap filling, and outlier removal [
43].
In this study, as mentioned in
Section 2.1 Research area, the dependent variable for the multiple linear regression model was the storm surge height observed at eleven tide gauge stations located along the southeastern coast of the KP, the designated study area. The storm surge height was calculated by removing the astronomical tide components and the annual mean sea level from the quality-controlled water level records with one-hour interval at each tide gauge station. The astronomical tide components were estimated using the default setting of the T_tide MATLAB (R2025a) toolbox [
44].
2.3. Multiple Linear Regression
Multiple linear regression (MLR) is a widely adopted statistical approach for quantifying the relationship between a set of independent (explanatory) variables and a dependent variable. By constructing a mathematical model based on observed data, MLR enables the prediction of the dependent variable’s behavior as a function of several explanatory factors. The general structure of an MLR model is expressed as follows:
where
denotes the dependent variable,
represent the independent variables, and
are the regression coefficients estimated by the least squares method. Although MLR assumes linear and additive relationships among variables, it has demonstrated robust applicability in modeling complex real-world phenomena [
45,
46,
47].
In this study, the dependent variable is the observed storm surge height (SSH) at each station. The set of independent variables used in the regression analysis comprises the typhoon’s latitude (TOKYO_LAT), longitude (TOKYO_LON), the angle (ang) and distance (dis) between the typhoon center and the observation station, typhoon speed (speed_km), and central pressure (TOKYO_PRES).
Prior to model development, the dataset was rigorously preprocessed to ensure statistical validity and robustness. Missing values and storm surge heights less than zero were excluded from the dataset and the corresponding values of the dependent and independent variables at those timestamps were excluded from further analysis. This exclusion is to avoid phenomenon in which storm surge height becomes negative as a typhoon approaches, known as a “negative storm surge” or “reverse storm surge”, which occurs through a mechanism opposite to that of a typical storm surge [
48,
49,
50]. It should be noted that due to the linear formulation of MLR, the regression occasionally yielded slightly negative predicted SSH values. These outputs are physically unrealistic because negative storm surges (reverse surges) were excluded from the dataset and are not the focus of this study. In practical applications, such negative predictions can be safely truncated to zero to ensure physical consistency and to prevent false positives.
Two different data splitting strategies were applied. In the first approach, all available time series data, regardless of typhoon event, were pooled and randomly split into training and testing sets at a 7:3 ratio. In the second approach, data were grouped by individual typhoon events, and the list of typhoons was split into training and testing groups (7:3 ratio), after which each station’s time series data were merged accordingly. This dual strategy was designed to test whether including all-time series typhoon events together influences the regression results, allowing for the evaluation of model performance both within and across typhoon events.
All input variables were standardized using z-score scaling prior to regression modeling to ensure comparability among predictors and improve numerical stability. To address potential multicollinearity among explanatory variables, variance inflation factors (VIFs) were computed for all predictors. The VIF is a commonly used diagnostic that quantifies the extent to which the variance of a regression coefficient is inflated due to multicollinearity with other variables. Generally, a VIF greater than 5 or 10 is considered indicative of problematic multicollinearity. Accordingly, in this study, only independent variables with a VIF value less than 5 were used in the regression analysis, thereby ensuring that multicollinearity does not pose a significant problem.
All analyses, including data preprocessing, scaling, VIF-based feature selection, model training, and performance evaluation were performed using Python 3.9. Executable code and the corresponding datasets have been provided as supplementary data to ensure transparency and reproducibility.
2.4. Objective Functions
To objectively evaluate the predictive performance of the MLR models developed in this study, four widely used statistical criteria were employed: mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), and coefficient of determination (R2).
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is given by:
where
and
are the observed and simulated (predicted) values, respectively, and
is the number of data points. Lower
MAE values indicate higher model accuracy.
MAE is widely used because it provides a straightforward interpretation of the average prediction error in the original units of the target variable.
MSE represents the average of the squared differences between observed and predicted values:
MSE penalizes larger errors more than smaller ones due to the squaring of the residuals, making it particularly sensitive to outliers. It is commonly used in regression analysis for model calibration and comparison.
RMSE is the square root of the mean squared error:
This criterion expresses the model error in the same units as the observed variable, aiding in interpretation and practical assessment of the model’s predictive performance. A lower RMSE value indicates better model performance.
R2 quantifies the proportion of the variance in the observed data that is explained by the model. It is defined as:
where
and
are the mean observed and predicted values, respectively.
R2 values range from 0 to 1, with higher values indicating better model fit. In general,
R2 values greater than 0.5 are considered acceptable for applications [
49,
50,
51].
4. Discussion
This study demonstrated that storm surge prediction using MLR can be significantly improved by careful data selection and preprocessing, particularly with respect to the typhoon–station distance and the application of SSH thresholds. Although Tadesse and Wahl [
36] reported that atmospheric pressure, particularly lagged pressure, was a dominant predictor in many regions globally, typhoons approaching the Korean Peninsula generally weaken in intensity, making pressure- or wind-based thresholds less effective. Instead, distance- and SSH-based thresholds were adopted, as they better capture the regional characteristics of storm surges along the Korean Peninsula. Yang et al. [
15] also demonstrated in their study that storm surge height are strongly influenced by the distance between the typhoon center and the point of interest, supporting the validity of the present findings. Our results indicated that when the distance between a typhoon center and the observation station was restricted to within 900–1000 km, and when the dataset was partitioned according to SSH ranges, the overall predictive performance of the regression models was enhanced. This improvement can be attributed to the exclusion of data points corresponding to events with negligible storm surge response, thereby reducing noise and focusing model training on physically meaningful cases.
Despite these improvements, the model’s predictive skill in the low SSH range remained relatively limited. The results suggest that storm surge heights in this regime may be influenced by factors or interactions not fully captured by the linear and additive assumptions of MLR. This is further supported by the observed plateauing of predicted values and the limited alignment with the 1:1 reference line in scatter plots, indicating that the linear model may not adequately capture the underlying dynamics of storm surge events, particularly for lower-magnitude occurrences.
In contrast, in the high SSH range, the regression models achieved notably higher R2 values, with several stations exhibiting values greater than 0.5. This can be ascribed to the increased variance and dynamic range of SSH in this interval, which facilitates the identification of linear relationships among predictors. Nevertheless, it must be acknowledged that as the SSH threshold increases, the number of data points in the high range decreases, potentially resulting in overfitting and reduced statistical robustness. Although this study implemented rigorous checks of statistical significance (p-values) and multicollinearity (VIFs), these limitations should be considered when interpreting the results.
Given these findings, future research should explore the use of nonlinear or ensemble modeling approaches—such as random forest regression, gradient boosting, or neural networks—which may better accommodate the inherent nonlinearity and complex interactions in storm surge processes. In addition, expanding the range of predictor variables to include real-time sea level observations, additional meteorological parameters, and local bathymetric features may further enhance predictive performance [
15,
35,
36]. The integration of physical-based numerical models with data-driven statistical approaches also warrants investigation as a potential pathway for achieving greater accuracy and reliability. In this context, storm surge height is affected not only by typhoon characteristics but also by the site-specific conditions of the location of interest (e.g., [
15,
56,
57,
58,
59]). As the threshold was defined based on the distance between the typhoon and the observation site, it may be difficult to ensure the same model performance in other regions, even if the threshold value is identical. Therefore, when applying the model development in this study to other regions with the same threshold values, the predictive performance should be examined separately for those regions.
From an operational perspective, the methodology and findings presented here offer valuable insights for the development of storm surge early warning systems and coastal risk management. Nevertheless, additional validation and calibration across a broader range of sites, as well as under diverse climatic and typhoon scenarios, will be required to generalize the applicability of these results.