Next Article in Journal
Design of Control System for Underwater Inspection Robot in Hydropower Dam Structures
Previous Article in Journal
A State-of-the-Art Review of the Hydrodynamics of Offshore Pipelines Under Submarine Gravity Flows and Their Interactions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development of a Storm Surge Prediction Model Using Typhoon Characteristics and Multiple Linear Regression

1
Division of Civil and Environmental Engineering, College of Engineering, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
2
Asia Infrastructure Research Center, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(9), 1655; https://doi.org/10.3390/jmse13091655
Submission received: 1 August 2025 / Revised: 25 August 2025 / Accepted: 28 August 2025 / Published: 29 August 2025
(This article belongs to the Section Marine Environmental Science)

Abstract

Storm surges pose a significant threat to coastal regions worldwide, particularly as sea levels continue to rise due to climate change. This study aims to develop a storm surge height prediction model for the southeastern coast of Korea using a multiple linear regression (MLR) approach. Typhoon characteristics, including location and intensity derived from best-track data, were used as independent variables, while observed storm surge heights served as the dependent variable. The model’s predictive performance was assessed using the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE) and the coefficient of determination (R2). To enhance model accuracy and interpretability, a threshold-based model configuration strategy was implemented by categorizing data according to (1) the distance between the typhoon center and the observation point, and (2) the magnitude of the observed storm surge height. The results indicate that restricting typhoon events to within 900–1000 km of the observation site and segmenting surge heights into low and high ranges significantly improves predictive skill, especially for extreme surge events. For example, at Masan station, the model achieved an R2 of 0.82 for high storm surge height (>0.2 m), and Gwangyang station showed an R2 of 0.57 at a 500 km distance threshold, demonstrating substantial skill in predicting extreme surges. However, limitations remain in capturing the variability of lower-magnitude surges, suggesting the need for future research incorporating nonlinear and ensemble methods. This study provides a foundation for improving coastal hazard prediction and contributes to the development of more effective early warning systems and risk management strategies.

1. Introduction

Sea level rise exacerbates the risk of coastal disasters by amplifying flooding, inundation and shoreline erosion [1]. From a coastal engineering perspective, sea level can be broadly decomposed into several components: the mean sea level, the astronomical tide, the meteorological tide (i.e., storm surge), and other residual effects [2,3,4,5,6]. Among these, storm surges have emerged as a major threat to coastal regions worldwide due to their capacity to cause extensive inundation and damage [7,8,9,10,11,12,13,14,15]. In addition, the risk of coastal inundation from storm surges has been observed to increase with the ongoing rise in sea levels attributable to climate change [16,17,18,19].
Storm surges are abnormal rises in sea level primarily driven by meteorological factors—particularly low atmospheric pressure and strong winds [2,3,4,5,6]. Its research is often conducted in relation to tropical cyclones, which are accompanied by both low atmospheric pressure and strong winds [15,18,19]. In general, two main approaches have been adopted for predicting storm surge: numerical models and statistical models. While numerical hydrodynamic models offer detailed physical representation, they are computationally expensive and time-consuming (e.g., [16,17,18,19]). On the other hand, statistical models—particularly regression-based approaches—enable rapid predictions and are more suitable for operational forecasting (e.g., [20,21,22,23,24,25]). Given the importance of speed and reliability in early warning systems for coastal hazards, statistical models provide a practical alternative for storm surge prediction.
To date, various statistical methods, including machine learning techniques, have been applied to the development of storm surge height prediction models [26,27,28,29,30,31,32,33,34,35]. However, Multiple Linear Regression (MLR) remains the most widely used approach, demonstrating strengths in predictive accuracy, interpretability, and computational efficiency. Roberts et al. [26,27] developed an MLR-based storm surge prediction model for the New York-New Jersey coastal region. They used meteorological inputs from the NARR and CFSR reanalysis datasets as independent variables and tide gauge observations as the dependent variable. Their model achieved a high level of accuracy—within 0.1 m of the observed peak water level—for extreme events such as Hurricane Sandy (2012), performing comparably to the physics-based numerical model, NYHOPS (New York Harbor Observing and Prediction System). Notably, due to its low computational cost, the MLR model could be used to operate dozens of ensemble forecasts, making it advantageous for both real-time and long-term scenario-based predictions.
However, some studies utilizing MLR have tended to focus primarily on estimating optimal regression coefficients and achieving predictive performance, without applying systematic data partitioning methods or establishing a clear model development strategy. This tendency is particularly evident in studies that use synthetic typhoon scenarios or data derived from numerical models. Al Kajbaf and Bensi [31] developed an MLR-based surrogate model for storm surge prediction along the U.S. East Coast, using thousands of synthetic typhoon scenarios generated by the ADCIRC (Advanced Circulation Model for Oceanic, Coastal and Estuarine Waters). The input variables included central pressure deficit, radius to maximum winds, forward speed, and track direction of the typhoon, while the output variable was the peak storm surge height. Although the model enabled rapid surge estimation through a simple linear regression formula—without requiring complex hydrodynamic simulations—the absence of a well-defined data splitting and modeling strategy may lead to overfitting, reduced model robustness, and limited applicability to real-world storm events (e.g., [36]).
In recent study, numerical model-based datasets such as ERA5 (European Centre for Medium-Range Weather Forecasts Reanalysis v5), GTSR (Global Tide and Surge Reanalysis), and ADCIRC have been increasingly used to overcome the limitations of observational data, including spatial sparsity and the lack of extreme event records. For example, Tadesse and Wahl [35] developed a global storm surge prediction model using ERA-Interim, satellite data, and GTSR—a reanalysis dataset based on numerical modeling—for coastal regions worldwide. They applied and compared several statistical methods, including MLR, KNN (K-Nearest Neighbors) and Random Forest, and found that atmospheric pressure—particularly lagged pressure—was the most significant predictor in more than 70% of the study areas. While their study demonstrated the feasibility of global-scale surge prediction using only model-based data, the direct applications of such an approach to coastal disaster risk planning at the national level may be limited. This is because the performance of the global model represents an average over broad spatial domains, which makes it difficult to optimize the underlying numerical models for every individual coastline. As a result, model-driven errors are inevitable (e.g., [15]), and certain regions within a given country may not be suitable for the application of such generalized methods.
In addition, compared to the extensive body of research conducted in the United States and Europe, studies focusing on the Korean coastline remain extremely limited. Choo et al. [37] applied logistic regression and MLR techniques to assess sea level anomalies at three sites along the southeastern coast of Korea—Busan, Geoje, and Gadeokdo. Using meteorological and oceanographic observation variables as inputs, they developed a statistical model to predict anomalous sea level events. Their results showed that the MLR-based model improved predictive performance by approximately 4.9% in terms of the coefficient of determination (R2) compared to an existing empirical approach. Their study is one of the few cases in which MLR has been applied to Korean coastal area, highlighting both the potential for region-specific statistical model development and the relative scarcity of such research in the region.
Taken together, these previous studies underscore the need for storm surge prediction models that not only utilize reliable statistical techniques such as multiple linear regression (MLR), but also incorporate physically meaningful variables, observational data, and region-specific characteristic. Addressing these gaps, the present study aims to develop a storm surge height prediction model tailored to the southeastern coast of Korea, using multiple linear regression technique.
Figure 1 illustrates the overall workflow of this study. The storm surge height prediction model was developed using typhoon characteristics such as typhoon location and intensity, extracted from best-track data as independent variables, while observed storm surge heights were used as the dependent variable. The model’s predictive performance was evaluated using the Root Mean Square Error (RMSE) and the coefficient of determination (R2). To enhance model reliability and interpretability, this study implemented a structures model configuration strategy that distinguishes it from previous studies. Specifically, models were separately developed based on threshold criteria, including (1) the distance between the typhoon center and the observation point, and (2) the magnitude of the observed storm surge height. This approach contributes to both improved model performance and structural simplicity while addressing key limitations of prior research. In general, higher storm surges occur at locations situated to the right side of the typhoon track. However, because this study developed an MLR model applicable to 11 stations, the relative position between the typhoon and each observation station varies for the same typhoon. Therefore, the variation in storm surge height according to the relative position was not considered in this study.

2. Materials and Methods

2.1. Research Area

As shown in Figure 2a, the southeastern coast of the Korea Peninsula (KP) was selected as the research site. This region has frequently experienced typhoon-related damages associated with high storm surges in the past [15]. The southeastern coastline is characterized by complex coastal topography and is relatively less affected by astronomical tides and wind waves compared to the western and eastern coasts of the KP. In contrast, the southwestern coast exhibits a large tidal range, necessitating an analysis of the nonlinear interactions between tide and storm surges. The eastern coast of the KP, adjacent to the deep waters of the East Sea, is subject to significant wave transformations, requiring a detailed examination of the nonlinear interactions between wind waves and storm surges.
Eleven locations along the southeastern coast of the KP were selected as the points of interest for the development of storm surge prediction models using the multiple linear regression (MLR) technique. The selected locations include Geomundo, Goheung, Yeosu, Gwangyang, Tongyeong, Masan, Geojedo, Gadeokdo, Busan, Ulsan and Pohang. Figure 2b presents the locations of tide gauge stations installed within the area of interest for this study, and Table 1 provides their detailed coordinates.

2.2. Data

2.2.1. Independent Variable (Predictors)

In this study, typhoons that passed through the region spanning from 32° N to 40° N and from 122° E to 132° E over the period from 1979 to 2020, as shown in Figure 3, were defined as typhoons that affected the KP. A total of 155 typhoons were considered in this study, and their key characteristics are presented in Table A1. The characteristics of typhoons were used as input variables. Typhoon data were obtained from the best track dataset provided by IBTrACS.
IBTrACS is an international database that integrates best track data of tropical cyclones worldwide [38]. It is a project developed by the National Centers for Environmental Information (NCEI) of National Oceanic and Atmospheric Administration (NOAA), initiated to unify tropical cyclone track records that were previously managed independently by Regional Specialized Meteorological Centers (RSMCs) and Tropical Cyclone Warning Centers (TCWCs) across different ocean basins [38,39]. The primary objective of IBTrACS is to standardize these records—originally archived using different formats and criteria by each agency—into a consistent dataset that is easily accessible and usable by researchers. The IBTrACS dataset [40] covers the global ocean region from 70° N to 70° S and from 180° W to 180° E, and classifies data into seven basins: North Atlantic (NA), Eastern Pacific (EP), Western Pacific (WP), North Indian Ocean (NI), South Indian Ocean (SI), South Pacific (SP), and South Atlantic (SA). IBTrACS has been continuously updated since 1842, and as of 24 July 2025, the most recent version is v04r01. The dataset generally provides data at 3-h intervals and focuses on storm center position, maximum sustained wind speed, and minimum central pressure. Additionally, depending on the source agency, it may include other tropical cyclone-related parameters such as radius of maximum winds, environmental pressure, storm classification.
In this study, among the various tropical cyclone characteristics provided by IBTrACS, the independent variables for the multiple linear regression model were selected based on the “TOKYO” data, which are issued by the Worle Meteorological Organization Regional Specialized Meteorological Center in Tokyo, operated by the Japan Meteorological Agency, the official forecasting authority for typhoons in the western North Pacific, with records available from 1951 to the present. The selected variables include the latitude and longitude of the typhoon center, the maximum sustained wind speed and central pressure at the typhoon center, the translational speed of the typhoon, the distance between the typhoon center and a specific location, and the angle of approach of the typhoon relative to that location. These variables were selected based on their known influence on storm surge dynamics and their availability from typhoon track datasets [29,30,31,32,33,34]. The angle of approach was calculated clockwise from true north. For instance, if the point of interest is located in the upper-right quadrant relative to the typhoon center, the angle of approach falls within the range of 0° to 90°.

2.2.2. Dependent Variable (Predictand)

To better understand and predict various oceanographic phenomena occurring along the Korean coast—such as tides, storm surges, and sea level rise—a nationwide tide gauge network consisting of approximately 50 stations has been established and is operated primarily by the Korea Hydrographic and Oceanographic Agency (KHOA) [41,42]. The temporal resolution of tide observations varies by station, with data available at intervals of 1 min, 10 min, or 1 h. KHOA provides two types of tide observation datasets with a temporal resolution of 1 h, both of which have undergone quality control procedures, including raw time series processing, gap filling, and outlier removal [43].
In this study, as mentioned in Section 2.1 Research area, the dependent variable for the multiple linear regression model was the storm surge height observed at eleven tide gauge stations located along the southeastern coast of the KP, the designated study area. The storm surge height was calculated by removing the astronomical tide components and the annual mean sea level from the quality-controlled water level records with one-hour interval at each tide gauge station. The astronomical tide components were estimated using the default setting of the T_tide MATLAB (R2025a) toolbox [44].

2.3. Multiple Linear Regression

Multiple linear regression (MLR) is a widely adopted statistical approach for quantifying the relationship between a set of independent (explanatory) variables and a dependent variable. By constructing a mathematical model based on observed data, MLR enables the prediction of the dependent variable’s behavior as a function of several explanatory factors. The general structure of an MLR model is expressed as follows:
Y = β 0 + β 1 X 1 + β 2 X 2 + + β n X n
where Y denotes the dependent variable, X i   i = 1 ,   ,   n represent the independent variables, and β i are the regression coefficients estimated by the least squares method. Although MLR assumes linear and additive relationships among variables, it has demonstrated robust applicability in modeling complex real-world phenomena [45,46,47].
In this study, the dependent variable is the observed storm surge height (SSH) at each station. The set of independent variables used in the regression analysis comprises the typhoon’s latitude (TOKYO_LAT), longitude (TOKYO_LON), the angle (ang) and distance (dis) between the typhoon center and the observation station, typhoon speed (speed_km), and central pressure (TOKYO_PRES).
Prior to model development, the dataset was rigorously preprocessed to ensure statistical validity and robustness. Missing values and storm surge heights less than zero were excluded from the dataset and the corresponding values of the dependent and independent variables at those timestamps were excluded from further analysis. This exclusion is to avoid phenomenon in which storm surge height becomes negative as a typhoon approaches, known as a “negative storm surge” or “reverse storm surge”, which occurs through a mechanism opposite to that of a typical storm surge [48,49,50]. It should be noted that due to the linear formulation of MLR, the regression occasionally yielded slightly negative predicted SSH values. These outputs are physically unrealistic because negative storm surges (reverse surges) were excluded from the dataset and are not the focus of this study. In practical applications, such negative predictions can be safely truncated to zero to ensure physical consistency and to prevent false positives.
Two different data splitting strategies were applied. In the first approach, all available time series data, regardless of typhoon event, were pooled and randomly split into training and testing sets at a 7:3 ratio. In the second approach, data were grouped by individual typhoon events, and the list of typhoons was split into training and testing groups (7:3 ratio), after which each station’s time series data were merged accordingly. This dual strategy was designed to test whether including all-time series typhoon events together influences the regression results, allowing for the evaluation of model performance both within and across typhoon events.
All input variables were standardized using z-score scaling prior to regression modeling to ensure comparability among predictors and improve numerical stability. To address potential multicollinearity among explanatory variables, variance inflation factors (VIFs) were computed for all predictors. The VIF is a commonly used diagnostic that quantifies the extent to which the variance of a regression coefficient is inflated due to multicollinearity with other variables. Generally, a VIF greater than 5 or 10 is considered indicative of problematic multicollinearity. Accordingly, in this study, only independent variables with a VIF value less than 5 were used in the regression analysis, thereby ensuring that multicollinearity does not pose a significant problem.
All analyses, including data preprocessing, scaling, VIF-based feature selection, model training, and performance evaluation were performed using Python 3.9. Executable code and the corresponding datasets have been provided as supplementary data to ensure transparency and reproducibility.

2.4. Objective Functions

To objectively evaluate the predictive performance of the MLR models developed in this study, four widely used statistical criteria were employed: mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), and coefficient of determination (R2).
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is given by:
M A E = 1 n Σ i = 1 n O i S i
where O i and S i are the observed and simulated (predicted) values, respectively, and n is the number of data points. Lower MAE values indicate higher model accuracy. MAE is widely used because it provides a straightforward interpretation of the average prediction error in the original units of the target variable.
MSE represents the average of the squared differences between observed and predicted values:
M S E = 1 n Σ i = 1 n O i S i 2
MSE penalizes larger errors more than smaller ones due to the squaring of the residuals, making it particularly sensitive to outliers. It is commonly used in regression analysis for model calibration and comparison.
RMSE is the square root of the mean squared error:
R M S E = 1 n Σ i = 1 n O i S i 2
This criterion expresses the model error in the same units as the observed variable, aiding in interpretation and practical assessment of the model’s predictive performance. A lower RMSE value indicates better model performance.
R2 quantifies the proportion of the variance in the observed data that is explained by the model. It is defined as:
R 2 = Σ i = 1 n O i O ¯ S i S ¯ Σ i = 1 n O i O ¯ 2 Σ i = 1 n S i S ¯ 2 2
where O ¯ and S ¯ are the mean observed and predicted values, respectively. R2 values range from 0 to 1, with higher values indicating better model fit. In general, R2 values greater than 0.5 are considered acceptable for applications [49,50,51].

3. Results

3.1. Effect of Typhoon Event Grouping on Model Performance

3.1.1. Without Consideration of Individual Typhoon Events

To evaluate the impact of typhoon–station distance on regression model performance without explicit consideration of individual typhoon events, all available station data were pooled and randomly split into training and testing sets at a 7:3 ratio. The regression models were then constructed and evaluated across a range of distance thresholds, from 2000 km down to 500 km (Table 2).
The results show that model predictive performance, as measured by test R2, is lowest when the distance threshold is set to 2000 km, with a general improvement as the threshold is reduced. The highest predictive performance is typically achieved when the distance threshold is set at either 1000 km or 900 km, after which further reductions in the threshold do not lead to substantial improvement and may even slightly decrease model accuracy. For example, among all stations, Gwangyang exhibited the highest R2 value (0.5588) at a threshold of 500 km, while Geojedo station recorded the second-highest (0.4220) at the same distance. When all stations were combined into a single dataset (“Total” row), the best R2 values (0.19) were observed at 1000 km and 900 km thresholds.
This trend suggests that the inclusion of data from distant typhoon events (i.e., those with typhoon centers located more than 1000 km from the observation site) introduces substantial noise to the regression model, likely because such events do not meaningfully affect local storm surge heights. By excluding these data points, the model focuses on physically relevant typhoon events, thereby increasing explanatory power and prediction accuracy. This finding aligns with the physical expectation that storm surge responses are most pronounced when a typhoon is within a certain proximity to the observation site.
The results further indicate spatial variability in model performance across stations. Notably, the Gwangyang and Geojedo stations consistently display higher R2 values at shorter distance thresholds compared to other locations. This may be attributed to local geographic and bathymetric conditions—such as coastline orientation, shallow continental shelves, or exposure to typhoon tracks—that amplify the sensitivity of these sites to nearby typhoon-induced surges [52,53,54].
Figure 4 illustrates the predictive performance of the regression model using all station data at each distance threshold. In panels (a) through (c) (2000 km to 1000 km), red ellipses highlight regions where predicted values remain near zero despite variation in observed values. These points, which are gradually eliminated as the threshold decreases, do not contribute to the predictive capability of the model. In panels (e) through (h) (800 km to 500 km), blue ellipses indicate areas where the observed values exhibit sharp increases, but predicted values fail to exceed approximately 0.2, reflecting the model’s limited ability to capture extreme observed SSHs in these cases.
In summary, the distance-based thresholding of input data was found to be a critical factor in optimizing regression model performance, and the observed spatial differences among stations underscore the importance of local factors in storm surge prediction.

3.1.2. With Typhoon Event Grouping

To assess whether grouping data by individual typhoon events can improve regression model performance, we applied a data partitioning strategy in which the entire set of 76 typhoon events was randomly split into training and testing groups in a 7:3 ratio. Specifically, 53 typhoon events were used for model training, while the remaining 23 events were reserved for independent testing (Table 3).
Overall, the event-based grouping approach led to modest improvements in predictive performance compared to the results without grouping (Section 3.1.1, Table 2). For example, at Goheung station, the R2 value exceeded 0.4 at the 700–600 km threshold, which was not observed in the non-grouped analysis. Similarly, stations such as Gadeokdo, Geomundo, Yeosu, Tongyeong, and Pohang showed R2 increases of approximately 0.02–0.05 under event-based grouping. However, the effect was not uniformly positive. In some cases, such as Goheung at 2000–1000 km thresholds, R2 decreased by 0.03–0.08, and Masan and Geojedo stations exhibited both improvements and reductions in R2 values depending on the distance threshold.
These findings indicate that event-based grouping helps preserve the temporal coherence and intra-event variability inherent in typhoon time series data, Reducing potential information leakage between training and testing sets. This leads to a more realistic estimate of the model’s generalization ability for future, unseen typhoon events. The observed improvements are particularly notable at stations more exposed to typhoon impacts and at shorter distance thresholds, which aligns with the expectation that local bathymetry, coastline orientation, and storm track proximity all play significant roles in storm surge predictability.
Direct comparison of Table 2 and Table 3 reveals that the greatest R2 improvements from event grouping were achieved at Goheung (up to +0.07 at a 700 km threshold) and Gwangyang (up to +0.03 at a 500 km threshold). For some stations, such as Masan and Geojedo, R2 values fluctuated, reflecting the complex interaction of local factors and data partitioning method.
From an operational standpoint, the event-based grouping approach is advantageous because it better simulates real-world forecasting scenarios—where the model must predict storm surge heights for entirely new typhoon events—thereby supporting more robust early warning system development. However, it should be noted that grouping by event reduces the effective sample size in both training and testing datasets, which may affect the statistical stability of regression results, especially at higher distance thresholds where sample sizes are inherently smaller.
Figure 5 presents the regression model test results for all stations under event-based grouping across various typhoon–station distance thresholds. In each panel, most data points with observed SSH values below 0.3 cluster densely, with close correspondence between observed and predicted values. For observed SSH values exceeding 0.3, however, the model persistently underestimates, with predicted values rarely surpassing 0.2. This limitation remains even as the distance threshold decreases, although the range of predicted values becomes somewhat broader at lower thresholds. This suggests that while event-based grouping enhances performance for moderate SSH events, the regression model continues to struggle with predicting extreme storm surge events.
In summary, grouping by typhoon events can yield slight improvements in regression model accuracy and generalizability, particularly for stations and thresholds most affected by typhoon surges. Nevertheless, the model’s limitations in capturing the highest observed SSH values persist. To address this issue, the following section presents an additional analysis applying threshold values to the observed SSH data in order to further investigate and potentially improve the model’s predictive capability.

3.2. Model Performance According to SSH Threshold Values

Since the regression analysis in Section 3.1 indicated that a distance threshold of 1000 km yielded the best model performance, the distance was fixed at 1000 km for subsequent analyses. Consistent with the approach in Section 3.1.2, the dataset was grouped by individual typhoon events prior to partitioning.
For the SSH thresholds, values were empirically determined based on the distribution of observed SSHs shown in Figure 5. Specifically, the threshold was estimated by identifying the point at which data began to cluster along the lower bound (red line), yielding an initial cut-off of approximately 0.2 m. The dataset was then divided into two groups: a low SSH range (0 ≤ SSH ≤ threshold) and a high SSH range (threshold < SSH), and performing regression modeling separately for each interval. To further assess model sensitivity, the threshold was subsequently increased in increments of 0.05 m, and regression modeling was performed separately for each interval.

3.2.1. Model Performance in the Low SSH Range

For the low SSH range, the results showed that as the threshold increased—thereby broadening the low SSH interval—the test R2 values generally improved across most stations (Table 4). This trend suggests that expanding the low range to include a wider range of SSH values, not only the smallest but also those approaching the threshold, enables the regression model to better capture the underlying linear relationship between storm surge height and the explanatory variables. In very narrow low ranges (e.g., threshold = 0.2 or 0.25), the regression outcomes are often more scattered, and linearity is not clearly observed, due to statistical noise and limited variability in the observed SSH values. As the threshold increases, the inclusion of more SSH cases closer to the threshold leads to a clearer linear structure and stronger model fit, as reflected in the increasing R2.
It is noteworthy that a few stations, such as Geojedo and Goheung, exhibited higher R2 values when the low range was more restricted. This phenomenon may be associated with a limited number of data points or the presence of a more pronounced linear relationship within a narrower SSH interval at these locations. In such cases, the calculated model performance becomes highly sensitive to both data range and distribution, which may yield higher R2 values for small but well-structured datasets.
Figure 6 presents the scatter plots of predicted versus observed SSH for all stations across various threshold values. For lower thresholds, predicted SSH values are largely confined within a narrow range, exhibiting minimal sensitivity to variations in the observed SSH. As the threshold increases, this pattern persists: the predicted values remain concentrated within a limited interval, even as the observed SSH demonstrates greater variability. Consequently, the regression model systematically underestimates higher observed SSH values within the low range, resulting in a plateau effect rather than improved alignment with the 1:1 reference line. These findings indicate that increasing the threshold does not substantively enhance the model’s ability to reproduce the full spectrum of observed SSH values. The improvement in R2 at higher thresholds is therefore primarily attributable to the statistical effects of expanding the data range, rather than to genuine advancements in predictive performance.
Overall, these results highlight the limitations of the regression model in accurately capturing the variability of SSH in the low range, regardless of the threshold setting, and suggest that alternative modeling strategies or the inclusion of additional explanatory variables may be required to improve predictive skill.

3.2.2. Model Performance in the High SSH Range

For the high SSH range (threshold < SSH), the regression model’s predictive performance generally improved as the threshold value increased, indicating that more selective data curation led to enhanced model skill (Table 5). With the exception of Masan and Pohang, all stations exhibited increasing R2 values with higher thresholds. This trend reflects that, for more extreme storm surge events, the model is better able to capture linear relationships among the input variables, possibly because the variance and dynamic range of SSH are higher in this subset. This observation is supported by the calculated variance and range of storm surge height values across different threshold datasets (see Table S1 in the Supplementary Materials), which demonstrate that both variance and range increase with higher thresholds. This finding confirms that more selective data curation in the high storm surge height range enhances the linear relationship between predictors and storm surge heights.
However, it is important to note that as the threshold increases, the number of available data points within the high SSH range decreases. In several cases, particularly for the height thresholds, the sample size became insufficient to conduct regression analysis, as indicated by empty cells in Table 5. This reduction in sample size may introduce risks of overfitting and statistical instability, as regression estimates become more sensitive to individual data points when the dataset is small. Although in this study both p-values and variance inflation factors (VIFs) were carefully checked to mitigate these risks, the potential for overfitting remains a consideration that warrants caution in the interpretation of these results.
Table 5 summarizes the test R2 values for each station at varying SSH thresholds in the high range. Figure 7 presents the scatter plots for each station at the threshold yielding the highest R2 value. Compared to the low SSH range, the high range results demonstrate clearer linearity and improved model performance. For example, the Masan station achieved the highest R2 value of 0.82. With the exception of Goheung, Yeosu, Ulsan, and Tongyeong, most stations exhibited R2 values exceeding 0.5, indicating a substantial gain in explanatory power within this range.
Although Figure 7c,e shows relatively high R2 values, the predicted SSH values are more widely scattered when the observed values cluster around 0.2–0.3 m. This dispersion arises because global linear regression attempts to preserve a single linear trend across the entire dataset, leading to systematic deviations in ranges where the true relationship is nearly flat. Consequently, small variations in predictors translate into disproportionately large differences in fitted SSH. Such behavior reflects the inherent limitation of global linear regression, which cannot fully capture localized nonlinear relationships, and the increased residual sensitivity near thresholds with high sample density, potentially inducing heteroscedasticity [55]. In addition, nonlinear tide–surge interactions and site-specific coastal geometry may contribute to the variability in this range [52,53,54]. In future research, we plan to incorporate geographical factors such as the relative position of observation stations to typhoon tracks, coastline configuration, and the presence of bays to reduce such dispersion and improve model performance.
Overall, these findings suggest that, although the regression model shows limited performance for smaller storm surge heights, it is able to capture the relationships among typhoon characteristics and SSH more effectively for higher surge events, provided sufficient data are available.

4. Discussion

This study demonstrated that storm surge prediction using MLR can be significantly improved by careful data selection and preprocessing, particularly with respect to the typhoon–station distance and the application of SSH thresholds. Although Tadesse and Wahl [36] reported that atmospheric pressure, particularly lagged pressure, was a dominant predictor in many regions globally, typhoons approaching the Korean Peninsula generally weaken in intensity, making pressure- or wind-based thresholds less effective. Instead, distance- and SSH-based thresholds were adopted, as they better capture the regional characteristics of storm surges along the Korean Peninsula. Yang et al. [15] also demonstrated in their study that storm surge height are strongly influenced by the distance between the typhoon center and the point of interest, supporting the validity of the present findings. Our results indicated that when the distance between a typhoon center and the observation station was restricted to within 900–1000 km, and when the dataset was partitioned according to SSH ranges, the overall predictive performance of the regression models was enhanced. This improvement can be attributed to the exclusion of data points corresponding to events with negligible storm surge response, thereby reducing noise and focusing model training on physically meaningful cases.
Despite these improvements, the model’s predictive skill in the low SSH range remained relatively limited. The results suggest that storm surge heights in this regime may be influenced by factors or interactions not fully captured by the linear and additive assumptions of MLR. This is further supported by the observed plateauing of predicted values and the limited alignment with the 1:1 reference line in scatter plots, indicating that the linear model may not adequately capture the underlying dynamics of storm surge events, particularly for lower-magnitude occurrences.
In contrast, in the high SSH range, the regression models achieved notably higher R2 values, with several stations exhibiting values greater than 0.5. This can be ascribed to the increased variance and dynamic range of SSH in this interval, which facilitates the identification of linear relationships among predictors. Nevertheless, it must be acknowledged that as the SSH threshold increases, the number of data points in the high range decreases, potentially resulting in overfitting and reduced statistical robustness. Although this study implemented rigorous checks of statistical significance (p-values) and multicollinearity (VIFs), these limitations should be considered when interpreting the results.
Given these findings, future research should explore the use of nonlinear or ensemble modeling approaches—such as random forest regression, gradient boosting, or neural networks—which may better accommodate the inherent nonlinearity and complex interactions in storm surge processes. In addition, expanding the range of predictor variables to include real-time sea level observations, additional meteorological parameters, and local bathymetric features may further enhance predictive performance [15,35,36]. The integration of physical-based numerical models with data-driven statistical approaches also warrants investigation as a potential pathway for achieving greater accuracy and reliability. In this context, storm surge height is affected not only by typhoon characteristics but also by the site-specific conditions of the location of interest (e.g., [15,56,57,58,59]). As the threshold was defined based on the distance between the typhoon and the observation site, it may be difficult to ensure the same model performance in other regions, even if the threshold value is identical. Therefore, when applying the model development in this study to other regions with the same threshold values, the predictive performance should be examined separately for those regions.
From an operational perspective, the methodology and findings presented here offer valuable insights for the development of storm surge early warning systems and coastal risk management. Nevertheless, additional validation and calibration across a broader range of sites, as well as under diverse climatic and typhoon scenarios, will be required to generalize the applicability of these results.

5. Conclusions

This study evaluated the performance of multiple linear regression models for storm surge prediction, focusing on the influence of typhoon–station distance and SSH thresholds on model accuracy. The results demonstrated that careful data partitioning—restricting typhoon events to within 900–1000 km of the observation site and separating data into low and high SSH ranges—significantly improved predictive skill, especially for high surge events. However, the linear regression model exhibited limitations in accurately capturing the variability of lower-magnitude storm surges, highlighting the complexity and inherent nonlinearity of storm surge processes.
The methodology and findings presented herein provide valuable insights for the development of coastal risk management strategies and early warning systems. Nevertheless, further research is needed to address the identified limitations, particularly by applying nonlinear and ensemble modeling approaches, expanding the set of predictor variables, and validating the models across broader spatial and climatic contexts. Continued advancements in this field will support more reliable storm surge forecasting and enhanced coastal resilience in the face of extreme weather events.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jmse13091655/s1, Table S1: Storm surge height variance range by threshold.

Author Contributions

Conceptualization, J.-A.Y. and Y.L.; methodology, J.-A.Y. and Y.L.; software, Y.L.; validation, Y.L.; formal analysis, Y.L.; investigation, J.-A.Y. and Y.L.; resources, J.-A.Y. and Y.L.; data curation, J.-A.Y. and Y.L.; writing—original draft preparation, J.-A.Y. and Y.L.; writing—review and editing, J.-A.Y. and Y.L.; visualization, J.-A.Y. and Y.L.; supervision, J.-A.Y.; funding acquisition, J.-A.Y. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Korea Meteorological Administration (grant number RS-2024-00404973). This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2023-00249996).

Data Availability Statement

All data and code used in the manuscript are openly available at the Zenodo repository: https://doi.org/10.5281/zenodo.16916353. The repository titled “MLR_code_and_data_for_SSH” contains all datasets and code necessary to reproduce the results presented in the manuscript.

Acknowledgments

Special thanks to Yoojin Song for her support in compiling the References and Abbreviations. During the preparation of this manuscript, the author utilized Perplexity Pro to assist with the literature search and analytical synthesis, and ChatGPT-4o to generate and refine the literature review text. The author has reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ADCIRCAdvanced Circulation Model for Oceanic, Coastal and Estuarine Waters
CFSRClimate Forecast System Reanalysis
ERA5European Centre for Medium-range Weather Forecasts Reanalysis v5
GTSRGlobal Tide and Surge Reanalysis
IBTrACSInternational Best Track Archive for Climate Stewardship
KNNK-Nearest Neighbors
MAEMean Absolute Error
MLRMultiple Linear Regression
MSEMean Squared Error
NARRNorth American Regional Reanalysis
NCEINational Centers for Environmental Information
NYHOPSNew York Harbor Observing and Prediction System
R2Coefficient of Determination
RMSERoot Mean Square Error
RSMCRegional Specialized Meteorological Centre
SSHStorm Surge Height
TCWCTropical Cyclone Warning Center
VIFsVariance Inflation Factors

Appendix A

Table A1. Lists of Typhoons that affected Korea and were defined as such based on their passage through the region from 32° N to 40° N and from 122° E to 132° E.
Table A1. Lists of Typhoons that affected Korea and were defined as such based on their passage through the region from 32° N to 40° N and from 122° E to 132° E.
No.Typhoon NamePminUmaxTyphoon Lifetime
1IRVING95875197987-1979820
2JUDY980501979815-1979827
3KEN991431979830-1979910
4IDA996NaN198075-1980715
5NORRIS1002NaN1980823-1980831
6ORCHID96770198091-1980916
7IKE1006NaN198167-1981617
8JUNE990451981615-1981626
9OGDEN983NaN1981726-198181
10AGNES970551981825-198196
11CLARA1004NaN1981913-1981102
12CECIL97555198281-1982819
13ELLIS955701982817-198294
14FORREST968701983916-1983930
15ALEX1004NaN1984628-198476
16HOLLY965701984812-1984823
17GERALD1002NaN1984814-1984824
18JUNE1002NaN1984825-198493
19HAL996NaN1985611-1985628
20JEFF992451985718-198583
21KIT970701985730-1985817
22LEE98060198588-1985816
23ODESSA985551985819-198592
24PAT965701985824-198592
25BRENDAN980701985925-1985108
26NANCY994451986618-1986627
27VERA960701986813-198692
28ABBY996NaN198699-1986924
29THELMA96078198776-1987718
30ALEX994NaN1987721-198782
31DINAH940851987819-198793
32ELLIS990401989618-1989625
33JUDY970651989720-1989729
34VERA1002NaN1989911-1989919
35OFELIA996NaN1990615-1990626
36ROBYN992401990629-1990714
37ABE996NaN1990822-199093
38CAITLIN945801991718-1991730
39GLADYS975501991813-1991824
40UNNAMED994351991821-1991831
41KINNA96570199198-1991916
42MIREILLE935951991913-1991101
43JANIS965701992730-1992813
44IRVING994401992730-199285
45KENT98050199283-1992820
46POLLY1000NaN1992823-199294
47TED992451992914-1992927
48OFELIA990401993724-1993729
49PERCY980551993725-199381
50ROBYN945851993730-1993814
51YANCY955751993827-199397
52RUSS1004NaN199462-1994612
53WALT992401994711-1994728
54BRENDAN992451994725-199483
55DOUG985481994730-1994813
56ELLIE97065199483-1994819
57FRED1004NaN1994812-1994826
58SETH975551994930-19941016
59FAYE950751995712-1995725
60JANIS990NaN1995817-1995830
61RYAN985601995914-1995925
62EVE980601996710-1996727
63KIRK960751996728-1996818
64PETER975601997615-199774
65TINA975601997721-1997810
66OLIWA970651997828-1997919
67YANNI975551998924-1998102
68NEIL980501999722-1999728
69OLGA975601999726-199985
70PAUL992351999731-199989
71RACHEL1000NaN199985-1999811
72SAM1004NaN1999817-1999827
73WENDY1006NaN1999829-199997
74ZIA990401999911-1999917
75ANN994381999914-1999920
76BART940851999917-1999929
77DAN1012NaN1999101-19991012
78KAI-TAK99435200072-2000712
79BOLAVEN985402000719-200082
80BILIS1001NaN2000817-2000827
81PRAPIROON965702000824-200094
82SAOMAI970602000831-2000919
83XANGSANE1003NaN20001024-2000112
84CHEBI1000NaN2001619-2001625
85RAMMASUN965652002626-200277
86NAKRI996NaN200277-2002713
87FENGSHEN980502002713-2002728
88RUSA960702002822-200293
89KUJIRA1000NaN200348-2003425
90SOUDELOR97560200367-2003624
91MAEMI93590200394-2003916
92MINDULLE984452004621-200475
93NAMTHEUN996402004724-200483
94MEGI970652004813-2004822
95CHABA955802004817-200495
96SONGDA945752004826-2004910
97MEARI975602004918-2004102
98MATSA998NaN2005729-200589
99NABI955752005828-200599
100KHANUN1000NaN200595-2005913
101CHANCHU996NaN200657-2006519
102EWINIAR975602006629-2006712
103WUKONG980452006812-2006821
104SHANSHAN95080200699-2006919
105MAN-YI95570200776-2007723
106USAGI960802007727-200784
107PABUK995NaN200784-2007815
108NARI960752007911-2007918
109WIPHA1005NaN2007914-2007920
110KROSA1010NaN2007101-20071014
111KALMAEGI994NaN2008711-2008724
112LINFA998NaN2009613-2009630
113MORAKOT998NaN200982-2009813
114DIANMU98550201086-2010813
115KOMPASU970702010827-201096
116MALOU992502010831-2010910
117MERANTI1003NaN201096-2010914
118MEARI980552011620-2011627
119MUIFA973632011726-2011815
120KULAP1012NaN201195-2011911
121KHANUN991432012713-2012720
122DAMREY965702012727-201284
123TEMBIN980552012817-201291
124BOLAVEN960652012818-201291
125SANBA940852012910-2012918
126LEEPI1002NaN2013616-2013623
127DANAS965652013101-2013109
128NEOGURI97550201472-2014713
129MATMO994NaN2014716-2014726
130NAKRI980502014727-201484
131FUNG-WONG998352014917-2014925
132VONGFONG975602014101-20141016
133CHAN-HOM973582015629-2015713
134HALOLA99445201576-2015726
135SOUDELOR998352015729-2015812
136GONI945852015813-2015830
137NAMTHEUN994452016830-201695
138MERANTI1004NaN201698-2016917
139CHABA965702016924-2016107
140NANMADOL98555201771-201778
141PRAPIROON965602018627-201875
142JONGDARI992452018723-201884
143LEEPI998402018810-2018815
144SOULIK963732018815-2018830
145KONG-REY975652018927-2018107
146DANAS985432019714-2019723
147FRANCISCO97565201981-2019811
148LINGLING963732019830-2019912
149TAPAH975602019917-2019923
150MITAG988502019924-2019105
151HAGUPIT996NaN2020730-2020812
152JANGMI99640202086-2020814
153BAVI950852020820-2020829
154MAYSAK950802020826-202097
155HAISHEN945852020830-2020910

References

  1. Masson-Delmotte, V.; Zhai, P.; Pirani, A.; Connors, S.L.; Péan, C.; Chen, Y.; Goldfarb, L.; Gomis, M.I.; Matthews, J.B.R.; Berger, S.; et al. Climate Change 2021: The Physical Science Basis. In Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2021; pp. 1–2391. [Google Scholar]
  2. Papadopoulos, N.; Gikas, V. Combined Coastal Sea Level Estimation Considering Astronomical Tide and Storm Surge Effects: Model Development and Its Application in Thermaikos Gulf, Greece. J. Mar. Sci. Eng. 2023, 11, 2033. [Google Scholar] [CrossRef]
  3. Muis, S.; Verlaan, M.; Winsemius, H.C.; Aerts, J.C.J.H.; Ward, P.J. A global reanalysis of storm surges and extreme sea levels. Nat. Commun. 2016, 7, 11969. [Google Scholar] [CrossRef]
  4. Antunes, C.; Lemos, G. A probabilistic approach to combine sea level rise, tide and storm surge into representative return periods of extreme total water levels: Application to the Portuguese coastal areas. Estuar. Coast. Shelf Sci. 2025, 323, 109060. [Google Scholar] [CrossRef]
  5. Palmer, K.; Watson, C.S.; Power, H.E.; Hunter, J.R. Quantifying the Mean Sea Level, Tide, and Surge Contributions to Changing Coastal High Water Levels. J. Geophys. Res. Oceans 2024, 129, e2023JC020737. [Google Scholar] [CrossRef]
  6. Goring, D.G.; Stephens, S.A.; Bell, R.G.; Pearson, C.P. Estimation of Extreme Sea Levels in a Tide-Dominated Environment Using Short Data Records. J. Waterw. Port Coast. Ocean Eng. 2011, 137, 150–156. [Google Scholar] [CrossRef]
  7. Bernier, N.B.; Hemer, M.; Mori, N.; Appendini, C.M.; Breivik, O.; Camargo, R.D.; Casas-Prat, M.; Duong, T.M.; Haigh, I.D.; Howard, T.; et al. Storm surges and extreme sea levels: Review, establishment of model intercomparison and coordination of surge climate projection efforts (SurgeMIP). Weather. Clim. Extrem. 2024, 45, 100689. [Google Scholar] [CrossRef]
  8. Yoon, J.J.; Kim, S.I. Analysis of Long Period Sea Level Variation on Tidal Station around the Korea Peninsula. J. Korean Soc. Coast. Disaster Prev. 2012, 12, 299–305. [Google Scholar]
  9. Kim, A.J.; Lee, M.H.; Suh, S.W. Effect of Summer Sea Level Rise on Storm Surge Analysis. J. Korean Soc. Coast. Ocean Eng. 2021, 33, 298–307. [Google Scholar] [CrossRef]
  10. Hague, B.S.; Talke, S.A. The Influence of Future Changes in Tidal Range, Storm Surge, and Mean Sea Level on the Emergence of Chronic Flooding. Earth’s Future 2024, 12, e2023EF003993. [Google Scholar] [CrossRef]
  11. Haigh, I.D.; Wadey, M.P.; Wahl, T.; Ozsoy, O.; Nicholls, R.J.; Brown, J.M.; Horsburgh, K.; Gouldby, B. Spatial and temporal analysis of extreme sea level and storm surge events around the coastline of the UK. Sci. Data 2016, 3, 160107. [Google Scholar] [CrossRef]
  12. Jin, H.Y.; Hwang, T.G.; Kim, H.J.; Min, B.I.; Lee, W.D. Storm surge simulations using hypothetical scenarios based on historical typhoons impacting the Korean Peninsula: Analysis of storm surge and overtopping volumes. J. Korea Water Resour. Assoc. 2024, 57, 1037–1051. [Google Scholar] [CrossRef]
  13. Park, J.K.; Kim, M.K.; Kim, D.C.; Yoon, J.S. Study on Development of Surge-Tide-Wave Coupling Numerical Model for Storm Surge Prediction. J. Ocean Eng. Technol. 2013, 27, 33–44. [Google Scholar] [CrossRef]
  14. Heo, D.S.; Yeom, K.S.; Kim, J.M.; Kim, D.S.; Bae, K.S. Estimation of Storm Surges on the Coast of Busan. J. Ocean Eng. Technol. 2006, 20, 37–44. [Google Scholar]
  15. Yang, J.A.; Kim, S.Y.; Mori, N.; Mase, H. Bias correction of simulated storm surge height considering coastline complexity. Hydrol. Res. Lett. 2017, 11, 121–127. [Google Scholar] [CrossRef]
  16. Muis, S.; Aerts, J.C.J.H.; Antolinez, J.A.A.; Dullaart, J.C.; Duong, T.M.; Erikson, L.; Haarsma, R.J.; Apecechea, M.I.; Mengel, M.; Bars, D.L.; et al. Global Projections of Storm Surges Using High-Resolution CMIP6 Climate Models. Earth’s Future 2023, 11, e2023EF003479. [Google Scholar] [CrossRef]
  17. Fernández-Montblanc, T.; Vousdoukas, M.I.; Ciavola, P.; Voukouvalas, E.; Mentaschi, L.; Breyiannis, G.; Feyen, L.; Salamon, P. Towards robust pan-European storm surge forecasting. Ocean Model. 2019, 133, 129–144. [Google Scholar] [CrossRef]
  18. Yang, J.A.; Kim, S.Y.; Mori, N.; Mase, H. Assessment of long-term impact of storm surges around the Korean Peninsula based on a large ensemble of climate projections. Coast. Eng. 2018, 142, 1–8. [Google Scholar] [CrossRef]
  19. Yang, J.A.; Kim, S.Y.; Son, S.; Mori, N.; Mase, H. Assessment of uncertainties in projecting future changes to extreme storm surge height depending on future SST and greenhouse gas concentration scenarios. Clim. Chang. 2020, 162, 425–442. [Google Scholar] [CrossRef]
  20. Salmun, H.; Molod, A.; Wisniewska, K.; Buonaiuto, F.S. Statistical Prediction of the Storm Surge Associated with Cool-Weather Storms at the Battery, New York. J. Appl. Meteorol. Clim. 2011, 50, 273–282. [Google Scholar] [CrossRef]
  21. Costa, W.; Idier, D.; Rohmer, J.; Menendez, M.; Camus, P. Statistical Prediction of Extreme Storm Surges Based on a Fully Supervised Weather-Type Downscaling Model. J. Mar. Sci. Eng. 2020, 8, 1028. [Google Scholar] [CrossRef]
  22. Xie, W.; Xu, G.; Zhang, H.; Dong, C. Developing a deep learning-based storm surge forecasting model. Ocean Model. 2023, 182, 102179. [Google Scholar] [CrossRef]
  23. Harris, D.L.; Angelo, A. A regression model for storm surge prediction. Mon. Weather. Rev. 1963, 91, 710–726. [Google Scholar] [CrossRef]
  24. Ohz, A.; Klein, A.H.F.; Franco, D. A Multiple Linear Regression-Based Approach for Storm Surge Prediction Along South Brazil. In Climate Change, Hazards and Adaptation Options; Springer: Cham, Switzerland, 2020; pp. 27–50. [Google Scholar] [CrossRef]
  25. Rajasekaran, S.; Gayathri, S.; Lee, T.-L. Support vector regression methodology for storm surge predictions. Ocean Eng. 2008, 35, 1578–1587. [Google Scholar] [CrossRef]
  26. Roberts, K.J.; Colle, B.A.; Georgas, N.; Munch, S.B. A Regression-Based Approach for Cool-Season Storm Surge Predictions along the New York–New Jersey Coast. J. Appl. Meteorol. Clim. 2015, 54, 1773–1791. [Google Scholar] [CrossRef]
  27. Roberts, K.J.; Colle, B.A.; Korfe, N. Impact of Simulated Twenty-First-Century Changes in Extratropical Cyclones on Coastal Flooding at the Battery, New York City. J. Appl. Meteorol. Clim. 2017, 56, 415–432. [Google Scholar] [CrossRef]
  28. Schaffer, L.; Boesch, A.; Baehr, J.; Kruschke, T. Development of a wind-based storm surge model for the German Bight. Nat. Hazards Earth Syst. Sci. 2025, 25, 2081–2096. [Google Scholar] [CrossRef]
  29. Sahoo, B.; Bhaskaran, P.K. Prediction of Storm Surge and Inundation Using Climatological Datasets for the Indian Coast Using Soft Computing Techniques. Soft Comput. 2019, 23, 12363–12383. [Google Scholar] [CrossRef]
  30. Kim, S.; Pan, S.; Mase, H. Artificial Neural Network-Based Storm Surge Forecast Model: Practical Application to Sakai Minato, Japan. Appl. Ocean Res. 2019, 91, 101871. [Google Scholar] [CrossRef]
  31. AI Kajbaf, A.A.; Bensi, M. Application of Surrogate Models in Estimation of Storm Surge:A Comparative Assessment. Appl. Soft Comput. 2020, 91, 106184. [Google Scholar] [CrossRef]
  32. Chen, K.; Kuang, C.; Wang, L.; Chen, K.; Han, X.; Fan, J. Storm Surge Prediction Based on Long Short-Term Memory Neural Network in the East China Sea. Appl. Sci. 2022, 12, 181. [Google Scholar] [CrossRef]
  33. Lee, J.-W.; Irish, J.L.; Bensi, M.T.; Marcy, D.C. Rapid Prediction of Peak Storm Surge from Tropical Cyclone Track Time Series Using Machine Learning. Coast. Eng. 2021, 170, 104024. [Google Scholar] [CrossRef]
  34. Wei, Z.; Nguyen, H.C. Storm Surge Forecast Using an Encoder–Decoder Recurrent Neural Network Model. J. Mar. Sci. Eng. 2022, 10, 1980. [Google Scholar] [CrossRef]
  35. Tadesse, M.; Wahl, T.; Cid, A. Data-Driven Modeling of Global Storm Surges. Front. Mar. Sci. 2020, 7, 260. [Google Scholar] [CrossRef]
  36. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–745. [Google Scholar]
  37. Choo, T.H.; Kim, J.G.; Park, W.S.; Choi, H.G. A Study on the Evaluation of Tidal Prediction Capacity of Busan, Gadeokdo, and Geoje Island using Logistic Regression Analysis and Multiple Regression Analysis. J. Korea Acad.-Ind. Coop. Soc. 2023, 24, 466–473. [Google Scholar] [CrossRef]
  38. Knapp, K.R.; Kruk, M.C.; Levinson, D.H.; Diamond, H.J.; Neumann, C.J. The International Best Track Archive for Climate Stewardship (IBTrACS): Unifying Tropical Cyclone Data. Bull. Am. Meteorol. Soc. 2010, 91, 363–376. [Google Scholar] [CrossRef]
  39. National Centers for Environmental Information. Available online: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc%3AC00834 (accessed on 25 July 2025).
  40. Gahtan, J.; Knapp, K.R.; Schreck, C.J.; Diamond, H.J.; Kossin, J.P.; Kruk, M.C. International Best Track Archive for Climate Stewardship (IBTrACS) Project, Version 4r01; (subset used: [specify]); NOAA National Centers for Environmental Information: Asheville, NC, USA, 2024.
  41. Korea Hydrographic and Oceanographic Agency. Available online: https://www.khoa.go.kr (accessed on 25 July 2025).
  42. Suk, M.J.; Hwang, C.S.; Lee, S.H.; Lee, J.S.; Song, D.H.; Park, S.P. Quality Improvement Measures for Sea Level Observation Data Using Near-Real Time Quality Control of Processing Techniques. J. Korean Soc. Mar. Environ. Saf. 2023, 12, 21–35. [Google Scholar]
  43. Ocean Data in Grid Framework. Available online: https://www.khoa.go.kr/oceangrid/koofs/kor/observation/obs_real.do (accessed on 24 July 2025).
  44. Pawlowicz, R.; Beardsley, B.; Lentz, S. Classical tidal harmonic analysis including error estimates in MATLAB using T_TIDE. Comput. Geosci. 2002, 28, 929–937. [Google Scholar] [CrossRef]
  45. Sousa, S.I.V.; Martins, F.G.; Alvim-Ferraz, M.C.; Pereira, M.C. Multiple linear regression and artificial neural networks based on principal components to predict ozone concentrations. Environ. Model. Softw. 2007, 22, 97–103. [Google Scholar] [CrossRef]
  46. Prieto, A.J.; Silva, A.; de Brito, J.; Macías-Bernal, J.M.; Alejandre, F.J. Multiple linear regression and fuzzy logic models applied to the functional service life prediction of cultural heritage. J. Cult. Herit. 2017, 27, 20–35. [Google Scholar] [CrossRef]
  47. Lee, Y.; Jung, C.; Kim, S. Spatial distribution of soil moisture estimates using a multiple linear regression model and Korean geostationary satellite (COMS) data. Agric. Water Manag. 2019, 213, 580–593. [Google Scholar] [CrossRef]
  48. Zhang, K.; Li, Y.; Liu, H.; Xu, H.; Shen, J. Comparison of three methods for estimating the sea level rise effect on storm surge flooding. Clim. Chang. 2013, 118, 487–500. [Google Scholar] [CrossRef]
  49. Jensen, C.; Mahavadi, T.; Schade, N.H.; Hache, I.; Kruschke, T. Negative Storm Surges in the Elbe Estuary—Large-Scale Meteorological Conditions and Future Climate Change. Atmosphere 2022, 13, 1634. [Google Scholar] [CrossRef]
  50. Dinápoli, M.G.; Simionato, C.G.; Alonso, G.; Bodnariuk, N.; Saurral, R. Negative storm surges in the Río de la Plata Estuary: Mechanisms, variability, trends and linkage with the Continental Shelf dynamics. Estuar. Coast. Shelf Sci. 2024, 305, 108844. [Google Scholar] [CrossRef]
  51. Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Applied Linear Statistical Models, 5th ed.; McGraw-Hill Irwin: Boston, MA, USA, 2005; pp. 1–1396. [Google Scholar]
  52. Resio, D.T.; Westerink, J.J. Modeling the physics of storm surges. Phys. Today 2008, 61, 33–38. [Google Scholar] [CrossRef]
  53. Irish, J.L.; Resio, D.T.; Ratcliff, J.J. The influence of storm size on hurricane surge. J. Phys. Oceanogr. 2008, 38, 2003–2013. [Google Scholar] [CrossRef]
  54. Irish, J.L.; Resio, D.T.; Divoky, D. Statistical properties of hurricane surge along a coast. J. Geophys. Res. Ocean 2011, 116. [Google Scholar] [CrossRef]
  55. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: Berlin/Heidelberg, Germany, 2013; Volume 103, pp. 265–301. [Google Scholar] [CrossRef]
  56. Santhi, C.; Arnold, J.G.; Williams, J.R.; Dugas, W.A.; Srinivasan, R.; Hauck, L.M. Validation of the swat model on a large rwer basin with point and nonpoint sources 1. JAWRA J. Am. Water Resour. Assoc. 2001, 37, 1169–1188. [Google Scholar] [CrossRef]
  57. Van Liew, M.W.; Arnold, J.G.; Garbrecht, J.D. Hydrologic simulation on agricultural watersheds: Choosing between two models. Trans. ASAE 2003, 46, 1539–1551. [Google Scholar] [CrossRef]
  58. Moriasi, D.N.; Arnold, J.G.; Van Liew, M.W.; Bingner, R.L.; Harmel, R.D.; Veith, T.L. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE 2007, 50, 885–900. [Google Scholar] [CrossRef]
  59. Mori, N.; Kato, M.; Kim, S.; Mase, H.; Shibutani, Y.; Takemi, T.; Tsuboki, K.; Yasuda, T. Local amplification of storm surge by Super Typhoon Haiyan in Leyte Gulf. Geophys. Res. Lett. 2014, 41, 5106–5113. [Google Scholar] [CrossRef]
Figure 1. The research flow of this study.
Figure 1. The research flow of this study.
Jmse 13 01655 g001
Figure 2. (a) Area of interest for this study; (b) Location of tide gauge stations within the area of interest.
Figure 2. (a) Area of interest for this study; (b) Location of tide gauge stations within the area of interest.
Jmse 13 01655 g002
Figure 3. Tracks of typhoons that affected Korea, identified by their passage through 32° N to 40° N and 122° E to 132° E region, shown as a blue solid square in the figure.
Figure 3. Tracks of typhoons that affected Korea, identified by their passage through 32° N to 40° N and 122° E to 132° E region, shown as a blue solid square in the figure.
Jmse 13 01655 g003
Figure 4. Scatter plots of regression models using all station data according to the typhoon–station distance: (a) 2000 km, (b) 1500 km, (c) 1000 km, (d) 900 km, (e) 800 km, (f) 700 km, (g) 600 km, and (h) 500 km. Red ellipses indicate regions where predicted values remain unchanged despite variation in the observed values, which are eliminated as the distance threshold decreases. This demonstrates that these points do not contribute to the predictive performance of the regression model. Blue ellipses highlight regions where the observed values increase sharply, but the predicted values fail to exceed approximately 0.2, indicating the model’s limited predictive ability for rapidly changing observed SSHs. Note that the regression model occasionally produces slightly negative predictions due to the mathematical characteristics of MLR. These values are not physically meaningful and can be treated as zero in practical applications.
Figure 4. Scatter plots of regression models using all station data according to the typhoon–station distance: (a) 2000 km, (b) 1500 km, (c) 1000 km, (d) 900 km, (e) 800 km, (f) 700 km, (g) 600 km, and (h) 500 km. Red ellipses indicate regions where predicted values remain unchanged despite variation in the observed values, which are eliminated as the distance threshold decreases. This demonstrates that these points do not contribute to the predictive performance of the regression model. Blue ellipses highlight regions where the observed values increase sharply, but the predicted values fail to exceed approximately 0.2, indicating the model’s limited predictive ability for rapidly changing observed SSHs. Note that the regression model occasionally produces slightly negative predictions due to the mathematical characteristics of MLR. These values are not physically meaningful and can be treated as zero in practical applications.
Jmse 13 01655 g004
Figure 5. Scatter plots of regression models using all station data according to the typhoon–station distance under typhoon event-based data grouping: (a) 2000 km, (b) 1500 km, (c) 1000 km, (d) 900 km, (e) 800 km, (f) 700 km, (g) 600 km, and (h) 500 km. The red vertical lines indicate the observed SSH value of 0.3, while the blue horizontal lines represent the predicted SSH value of 0.2. As the distance threshold decreases, the predicted values increasingly exceed 0.2, indicating a wider range of predicted SSHs. Notably, observed SSH values below approximately 0.3 tend to cluster densely in the range of 0–0.3 for both observed and predicted values, whereas for observed SSH values exceeding 0.3, the regression model fails to capture the increases, resulting in underestimation of high observed SSH events. Note that the regression model occasionally produces slightly negative predictions due to the mathematical characteristics of MLR. These values are not physically meaningful and can be treated as zero in practical applications.
Figure 5. Scatter plots of regression models using all station data according to the typhoon–station distance under typhoon event-based data grouping: (a) 2000 km, (b) 1500 km, (c) 1000 km, (d) 900 km, (e) 800 km, (f) 700 km, (g) 600 km, and (h) 500 km. The red vertical lines indicate the observed SSH value of 0.3, while the blue horizontal lines represent the predicted SSH value of 0.2. As the distance threshold decreases, the predicted values increasingly exceed 0.2, indicating a wider range of predicted SSHs. Notably, observed SSH values below approximately 0.3 tend to cluster densely in the range of 0–0.3 for both observed and predicted values, whereas for observed SSH values exceeding 0.3, the regression model fails to capture the increases, resulting in underestimation of high observed SSH events. Note that the regression model occasionally produces slightly negative predictions due to the mathematical characteristics of MLR. These values are not physically meaningful and can be treated as zero in practical applications.
Jmse 13 01655 g005
Figure 6. Scatter plots of regression models using all station data in the low SSH range (0 ≤ SSH ≤ threshold), according to different observed SSH thresholds: (a) 0.2, (b) 0.25, (c) 0.3, (d) 0.35, (e) 0.4, (f) 0.45, (g) 0.5, (h) 0.55, and (i) 0.6.
Figure 6. Scatter plots of regression models using all station data in the low SSH range (0 ≤ SSH ≤ threshold), according to different observed SSH thresholds: (a) 0.2, (b) 0.25, (c) 0.3, (d) 0.35, (e) 0.4, (f) 0.45, (g) 0.5, (h) 0.55, and (i) 0.6.
Jmse 13 01655 g006
Figure 7. Scatter plots of regression results for each station at the optimal observed SSH threshold in the high SSH range (threshold < SSH): (a) Gadeokdo, (b) Geomundo, (c) Geojedo, (d) Goheung, (e) Masan, (f) Busan, (g) Yeosu, (h) Ulsan, (i) Tongyeong, and (j) Pohang.
Figure 7. Scatter plots of regression results for each station at the optimal observed SSH threshold in the high SSH range (threshold < SSH): (a) Gadeokdo, (b) Geomundo, (c) Geojedo, (d) Goheung, (e) Masan, (f) Busan, (g) Yeosu, (h) Ulsan, (i) Tongyeong, and (j) Pohang.
Jmse 13 01655 g007
Table 1. The geographic coordinates of these points of interest.
Table 1. The geographic coordinates of these points of interest.
Point NameLongitudeLatitude
Gadeokdo128°48′39″ E35°01′27″ N
Geomundo127°18′32″ E34°01′42″ N
Geojedo128°41′57″ E34°48′05″ N
Goheung127°20′34″ E34°28′52″ N
Gwangyang127°45′17″ E34°54′13″ N
Masan128°35′20″ E35°12′36″ N
Busan129°02′07″ E35°05′47″ N
Yeosu129°23′14″ E35°30′07″ N
Ulsan127°45′57″ E34°44′50″ N
Tongyeong128°26′05″ E34°49′40″ N
Pohang129°23′02″ E36°02′50″ N
Table 2. Regression model test R2 values for each station by typhoon–station distance. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
Table 2. Regression model test R2 values for each station by typhoon–station distance. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
StationDistance (km)
200015001000900800700600500
Gadeokdo0.13960.15420.21320.21520.20870.19950.21040.2057
Geomundo0.23030.22890.24940.25560.25990.25900.25280.2590
Geojedo0.24250.26620.32050.32920.40120.39560.39680.4220
Goheung0.19190.23610.19720.27240.28700.33540.37430.3722
Gwangyang0.36270.24600.53590.53170.52640.53000.54400.5588
Masan0.14380.19850.15460.23940.23070.22450.29180.3012
Busan0.14610.19690.22670.21370.21560.21300.21260.2098
Yeosu0.14230.15810.21060.21080.20530.20130.21650.2304
Ulsan0.16590.18200.24040.23660.22890.23730.22760.1920
Tongyeong0.13590.16150.20370.20170.19720.20000.22020.2155
Pohang0.13330.15730.24620.24280.23600.23630.22300.2011
Total0.13300.16510.19300.19300.18970.18660.18640.1736
Table 3. Regression model test R2 values for each station by typhoon–station distance according to typhoon event-based data grouping. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
Table 3. Regression model test R2 values for each station by typhoon–station distance according to typhoon event-based data grouping. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
StationDistance
200015001000900800700600500
Gadeokdo0.15270.19900.24110.22680.22080.21030.22280.2179
Geomundo0.21450.26660.27810.27780.28090.28510.28990.2992
Geojedo0.27570.25480.38230.38170.35390.37510.36080.3235
Goheung0.10600.18650.16070.29390.31270.40890.40390.3632
Gwangyang0.16200.17410.53390.53130.52650.50410.53350.5696
Masan0.09620.14900.21790.19660.18700.18580.17350.2339
Busan0.15900.25140.24450.22630.22440.21920.22190.2098
Yeosu0.16360.21590.26220.26100.25400.25580.27790.2867
Ulsan0.17200.22700.25030.24770.23150.22850.20960.1638
Tongyeong0.15230.20360.23620.23340.23000.23370.25440.2480
Pohang0.15910.26320.30250.28790.27060.25770.22010.1815
Total0.13160.21100.22610.21890.21190.20660.20500.1863
Table 4. Test R2 values of regression models for each station in the low SSH range (0 ≤ SSH ≤ threshold) according to observed SSH thresholds. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
Table 4. Test R2 values of regression models for each station in the low SSH range (0 ≤ SSH ≤ threshold) according to observed SSH thresholds. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
StationThreshold
0.20.250.30.350.40.450.50.550.6
Gadeokdo0.16460.18570.21530.22500.23250.23550.23680.23920.2412
Geomundo0.16490.19160.21990.24610.26080.27080.27590.27650.2781
Geojedo0.43610.40130.39480.38530.40330.40950.40950.40950.4064
Goheung0.29920.21250.16920.16700.16700.16060.16070.16070.1607
Gwangyang0.43550.47960.49450.49410.50340.52830.52410.53390.5339
Masan0.13540.12990.13320.12440.13130.14580.15350.16370.1637
Busan0.16440.18690.22550.23340.23780.24210.24400.24690.2458
Yeosu0.18190.20550.23020.25980.26340.26640.27430.27580.2776
Ulsan0.16150.20890.22910.24690.24970.25030.25030.25030.2503
Tongyeong0.14070.17260.19970.21580.22450.22880.23340.23440.2371
Pohang0.20380.23900.26450.28450.29730.30120.30120.30250.3025
Total0.14610.17590.20090.21670.22410.22750.22990.23060.2310
Table 5. Test R2 values of regression models for each station in the high SSH range (threshold < SSH) according to observed SSH thresholds. Empty cells indicate cases where regression analysis was not performed due to insufficient data for model construction. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
Table 5. Test R2 values of regression models for each station in the high SSH range (threshold < SSH) according to observed SSH thresholds. Empty cells indicate cases where regression analysis was not performed due to insufficient data for model construction. The “Total” row represents the R2 value derived from regression using the combined dataset from all stations.
StationThreshold
0.20.250.30.350.40.45
Gadeokdo0.41310.43310.50480.5417--
Geomundo0.33730.31750.33700.49590.58280.5191
Geojedo0.5892-----
Goheung0.16210.3508----
Gwangyang------
Masan0.82150.75340.5696---
Busan0.38100.42500.37400.5088--
Yeosu0.27670.34080.38750.44630.43960.3248
Ulsan0.37320.4002----
Tongyeong0.29460.30240.32460.45750.26310.4640
Pohang0.36040.52840.3442---
Total0.22990.25500.24760.25130.25860.1335
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.-A.; Lee, Y. Development of a Storm Surge Prediction Model Using Typhoon Characteristics and Multiple Linear Regression. J. Mar. Sci. Eng. 2025, 13, 1655. https://doi.org/10.3390/jmse13091655

AMA Style

Yang J-A, Lee Y. Development of a Storm Surge Prediction Model Using Typhoon Characteristics and Multiple Linear Regression. Journal of Marine Science and Engineering. 2025; 13(9):1655. https://doi.org/10.3390/jmse13091655

Chicago/Turabian Style

Yang, Jung-A, and Yonggwan Lee. 2025. "Development of a Storm Surge Prediction Model Using Typhoon Characteristics and Multiple Linear Regression" Journal of Marine Science and Engineering 13, no. 9: 1655. https://doi.org/10.3390/jmse13091655

APA Style

Yang, J.-A., & Lee, Y. (2025). Development of a Storm Surge Prediction Model Using Typhoon Characteristics and Multiple Linear Regression. Journal of Marine Science and Engineering, 13(9), 1655. https://doi.org/10.3390/jmse13091655

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop