Advances in Imputation Strategies Supporting Peak Storm Surge Surrogate Modeling

Jung, WoongHee; Irwin, Christopher; Taflanidis, Alexandros A.; Nadal-Caraballo, Norberto C.; Aucoin, Luke A.; Yawn, Madison C.

doi:10.3390/jmse13091678

Open AccessArticle

Advances in Imputation Strategies Supporting Peak Storm Surge Surrogate Modeling

by

WoongHee Jung

¹,

Christopher Irwin

¹,

Alexandros A. Taflanidis

^1,*

,

Norberto C. Nadal-Caraballo

²,

Luke A. Aucoin

²

and

Madison C. Yawn

²

¹

Department of Civil and Environmental Engineering and Earth Sciences, University of Notre Dame, Notre Dame, IN 46556, USA

²

U.S. Army Corps of Engineers, Engineer Research and Development Center, Coastal and Hydraulics Laboratory, Vicksburg, MS 39180, USA

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(9), 1678; https://doi.org/10.3390/jmse13091678

Submission received: 18 July 2025 / Revised: 15 August 2025 / Accepted: 28 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue Machine Learning in Coastal Engineering)

Download

Browse Figures

Versions Notes

Abstract

Surrogate models are widely recognized as effective, data-driven predictive tools for storm surge risk assessment. For such applications, surrogate models (referenced also as emulators or metamodels) are typically developed using existing databases of synthetic storm simulations, and once calibrated can provide fast-to-compute approximations of the storm surge for a variety of downstream analyses. The storm surge predictions need to be established for different geographic locations of interest, typically corresponding to the computational nodes of the original numerical model. A number of inland nodes will remain dry for some of the database storm scenarios, requiring an imputation for them to estimate the so-called pseudo-surge in support of the surrogate model development. Past work has examined the adoption of kNN (k-nearest neighbor) spatial interpolation for this imputation. The enhancement of kNN with hydraulic connectivity information, using the grid or mesh of the original numerical model, was also previously considered. In this enhancement, neighboring nodes are considered connected only if they are connected within the grid. This work revisits the imputation of peak storm surge within a surrogate modeling context and examines three distinct advancements. First, a response-based correlation concept is considered for the hydraulic connectivity, replacing the previous notion of connectivity using the numerical model grid. Second, a Gaussian Process interpolation (GPI) is examined as alternative spatial imputation strategy, integrating a recently established adaptive covariance tapering scheme for accommodating an efficient implementation for large datasets (large number of nodes). Third, a data completion approach is examined for imputation, treating dry instances as missing data and establishing imputation using probabilistic principal component analysis (PPCA). The combination of spatial imputation with PPCA is also examined. In this instance, spatial imputation is first deployed, followed by PPCA for the nodes that were misclassified in the first stage. Misclassification corresponds to the instances for which imputation provides surge estimates higher than ground elevation, creating the illusion that the node is inundated even though the original predictions correspond to the node being dry. In the illustrative case study, different imputation variants established based on the aforementioned advancements are compared, with comparison metrics corresponding to the predictive accuracy of the surrogate models developed using the imputed databases. Results show that incorporating hydraulic connectivity based on response similarity into kNN enhances the predictive performance, that GPI provides a competitive (to kNN) spatial interpolation approach, and that the combination of data completion and spatial interpolation emerges as the recommended approach.

Keywords:

storm surge surrogate modeling; surge metamodeling; surge imputation; dry node imputation; hydraulic connectivity; geospatial interpolation

1. Introduction

The prediction of storm surge hazard and its consequences has become a priority when examining the resilience of coastal communities in planning (pre-disaster), emergency management, and post-disaster contexts [1,2,3]. To address this priority, researchers have made substantial efforts to develop high-fidelity numerical models to establish accurate predictions for the anticipated surge levels driven by tropical cyclones and other storm events [4,5]. These models entail, however, a large computational burden, due to the need to resolve complex hydrodynamic processes in the nearshore region [5,6]. To address this challenge, and improve computational efficiency for regional coastal hazard assessment [7,8] or real-time surge forecasting [9,10], researchers have examined the use of surrogate modeling techniques [11,12,13,14,15,16,17,18,19,20]. Surrogate models, also referenced as metamodels or emulators, correspond to data-driven predictive tools developed using a database of simulation results from a high-fidelity numerical model, and offer a computationally efficient approximation of the input/output relationship of that model. In the context of storm surge emulation, regional flood study databases, consisting of surge predictions for synthetic storm scenarios, are typically used for the surrogate model development. Once properly calibrated, these surrogate models can provide highly accurate and computationally efficient predictions with substantially reduced computational cost, and can be employed to either complement the original high-fidelity model for accelerating the hazard estimation [9] or to fully replace that model for any type of future surge predictions [8], accommodating even the development of real-time decision support tools [21].

The surrogate model surge predictions need to be established for different geographic locations of interest, corresponding typically to the computational nodes of the underlying numerical simulation model or to some subset of them. These nodes cover coastal areas consisting of offshore, onshore, and inland locations, and their number may vary from a few thousand up to more than one million. Frequently, a number of inland nodes will remain dry for some of the database storm scenarios, and therefore, a storm surge estimate will not be provided for them by the numerical model. If the intent of a metamodeling implementation is to provide a binary classification of the node condition (i.e., wet or dry), then the information within the database is sufficient for the emulator development [22]. If, on the other hand, the intent is to provide storm surge predictions, an imputation stage is first needed [23] to fill in the missing data with the so-called pseudo-surge [24]. Spatial interpolation, specifically using weighted k-nearest neighbor (kNN) interpolation, has been shown to be effective, especially for the imputing peak storm surge responses [23]. This approach uses the surge values of neighboring inundated nodes to infer the missing data, with the distances to the neighbors (representing spatial correlation) used as weights in the interpolation scheme. The enhancement of kNN by using the connectivity of the underlying numerical grids as a proxy for the hydraulic connectivity was further examined in [22], reflecting the observation that in domains with complex geomorphologies (e.g., ridges and sharp elevation changes) or flood protection measures (e.g., levees, floodwalls, or seawalls), spatial proximity does not necessarily mean correlated surge responses.

This work revisits the imputation of peak storm surge databases when couched within a specific objective—the development of surrogate models for emulation of the peak-surge. This objective dictates both the specific imputation strategies examined, advancing techniques previously considered for this application, as well as the accuracy metrics utilized to evaluate the performance of the different imputation strategies. Also, the focus is specifically on imputation for the peak-surge and not for the surge time-series. The imputation task for these two types of responses has very different characteristics. For example, time-series databases offer richer information due to the inclusion of an additional dimension (the temporal variation), which permits the consideration of an expanded range of imputation strategies that are appropriate for spatiotemporal problems [25,26,27]. In contrast, imputation of peak storm surge values may pose greater challenges, though it has significant relevance due to the wide use of peak storm surge emulation techniques for coastal hazard estimation [8,14,16].

Specifically, three distinct advancements are considered building upon previous efforts in peak storm surge database imputation. For the first advancement, the use of response-based correlation is examined for infusing hydraulic connectivity information within the spatial interpolation, replacing the previously used proxy based on the numerical model grid connections [22]. This replacement constitutes the first novel contribution of the study. The response-based correlation between nodes is defined using the binary response information for the condition of each node (i.e., inundated/wet or not). Only nodes with correlation higher than a specific threshold are considered as candidate neighbors in the kNN implementation. For the second advancement, a Gaussian Process interpolation (GPI) [28] is examined as an alternative (to the previously used kNN [22]) spatial interpolation-based imputation strategy. GPIs have shown great promise in filling in missing values for spatially correlated data [29], but their application to surge imputation has not been explored to date as they have been hindered by computational challenges due to the large size of the databases, originating from the large number of nodes (e.g., millions of nodes per study). Here, a recently established adaptive covariance tapering scheme [30] is adopted to accommodate an efficient GPI implementation for large datasets, and its advantages are examined compared to kNN-based spatial interpolation. This consideration of GPI for the peak storm surge database imputation represents another novel contribution of the study. For the third advancement, imputation using data completion is examined as an alternative to the aforementioned spatial interpolation techniques. Data completion techniques have been shown to be viable candidates for imputing missing data [31,32], especially for large datasets containing sufficient information for the respective technique to learn the underlying patterns. For storm surge applications, they have been recently shown to be effective for imputing surge time-series predictions [26], but their reliability for peak-surge imputations remains an open question. Here this research gap is addressed, establishing the third novel contribution of the study. Specifically, data completion through a latent space projection using probabilistic principal component analysis (PPCA) [33] is examined for imputation, an approach that has demonstrated high scalability and robust performance [34,35] in similar problems and has connections to the principal component analysis-based dimensionality reduction widely used in storm surge emulation [12]. The combination of spatial imputation with PPCA is also considered, with the objective of enriching the information first before performing PPCA. Establishing an appropriate combination approach constitutes the last novel contribution of this study.

The first two advancements, representing spatial interpolation-based imputation strategies are discussed in Section 5, while the remaining advancement, using data completion (or combination), is presented in Section 6. Before elaborating on these advancements, the problem formulation, database used in the case study and peak storm surge surrogate model developments are reviewed in Section 2, Section 3 and Section 4, respectively.

2. Problem Formulation

The development of the surrogate model relies on the availability of a synthetic storm database, typically developed during regional coastal flood hazard studies [36]. These databases consist of n synthetic storms and provide storm surge predictions for a total of n_z nodes within the respective computational domains. The objective of the surrogate model developed utilizing a synthetic storm database is to predict the surge at the available nodes (output) for new storms (input). To support this objective, each synthetic storm is parameterized through the n_x-dimensional vector

x \in ℝ^{n_{x}}

that will serve as the emulator input, with

x_{i}

denoting the ith input component. Additional details on the selection of x can be found in [23] and in Section 3 that discusses the case study database. Superscript notation will be utilized to distinguish the storms in the database, with

x^{h}

representing the input vector for the hth storm.

For the ith node within the domain, let

s_{i}^{l a t}

and

s_{i}^{l o n}

correspond to the latitude and longitude,

e_{i}

to the node elevation,

z_{i}

to the peak-surge, and

y_{i}

to the inundation state indicator, with

y_{i} = 1

meaning that the node is wet (inundated) and

y_{i} = 0

meaning that the node is dry. When the relationship to the input vector needs to be explicit, notation

z_{i} (x)

[or

y_{i} (x)

] will also be used for the surge response [or inundation state] for the ith node. Define

s_{i} = [\begin{matrix} s_{i}^{l a t} & s_{i}^{l o n}] \end{matrix} \in ℝ^{2}

as the geo-location coordinate vector for the ith node and

S \in ℝ^{2 \times n_{z}}

as the matrix assembling all nodal coordinates. Let

z \in ℝ^{n_{z}}

denote the

n_{z}

-dimensional surge vector, with its components corresponding to the surge values for all individual nodes of interest, and let

y \in ℝ^{n_{z}}

denote the corresponding inundation state vector. All notations including the subscript, superscript, and input dependencies extend to z and y, and to their individual components. For example,

z_{i}^{h} = z_{i} (x^{h})

corresponds to the peak-surge for the ith node for the hth storm [described through the input vector

x^{h}

] and

z (x^{h})

corresponds to the vector of peak-surge for all nodes for the same storm.

The original database includes missing data, i.e., no information available for the storm surge, when inland nodes remain dry. These correspond to dry instances

y_{i}^{h} = 0

and have no corresponding value

z_{i}^{h}

. Herein we will reference as always-wet the nodes that have been inundated across the entire database and as once-dry the nodes with at least one missing value, for which

y_{i}^{h} = 0

at least once across the storm suite database. We will assume that

n_{r}

such nodes (once-dry) exist within the database.

Finally, after the synthetic storm parameterization, the database provides the storm input matrix

X = {[x^{1} x^{2} \dots x^{n}]}^{T} \in ℝ^{n \times n_{x}}

and the output matrix for the surge responses

Z = {[z^{1} z^{2} \dots z^{n}]}^{T} \in ℝ^{n \times n_{z}}

. Based on the latter matrix, the binary inundation state condition matrix

Y = {[y^{1} y^{2} \dots y^{n}]}^{T} \in ℝ^{n \times n_{z}}

is derived, with 0 representing the missing surge values in Z (node dry). The hth rows of all these matrices correspond to the characteristics (parametric input or predictions) for the hth synthetic storm and will be distinguished (when needed) by superscript h, with

Z^{h}

corresponding, for example, to the row vector with the surge responses for the hth storm. Note that the missing data in the Z matrix ultimately correspond to the elements in

Y

equal to 0.

3. Database Description

The high-fidelity storm surge database utilized in this study is part of the U.S. Army Corps of Engineers (USACE) Coastal Hazards System (CHS; https://chs.erdc.dren.mil accessed 27 August 2025) [37]. It specifically corresponds to the CHS Louisiana (CHS-LA) coastal flood hazard study [36], developed for quantifying storm hazards and coastal compound flooding in Louisiana. CHS-LA includes areas in the vicinity of the Greater New Orleans Hurricane Storm Damage Risk Reduction System (HSDRRS), which encompasses the district of focus here. A detailed description of the database has been presented in [22]. Here, a basic overview is presented.

The CHS-LA tropical cyclone (TC) suite consists of

n

= 645 synthetic storms separated into 85 master tracks (MTs), shown in Figure 1. All TCs are characterized by unique combinations of the following parameters: landfall location, defined by the latitude

x_{l a t}

and longitude of the storm’s eye

x_{l o n}

, heading direction of the storm track during the final approach to landfall

β

, central pressure deficit

Δ P

, forward translation speed

v_{t}

, and radius of maximum wind speed

R_{m w}

. Each MT is initially constructed as a unique combination of the storm’s heading direction

β

and its landfall location

[x_{l a t} x_{l o n}]

. The remaining parameters defining the synthetic storms dictate the strength

Δ P

, size

R_{m w}

, and translational speed

v_{t}

characteristics of each synthetic storm, further distinguishing storms that might correspond to the same MT. Table 1 summarizes the range of the TC parameters that constitute the database.

The simulation of the 645 synthetic TCs was performed using high-resolution, high-fidelity atmospheric and hydrodynamic numerical models. The forcing of these models for each storm corresponded to the wind and pressure fields obtained using a nested grid Planetary Boundary Layer (PBL) model with input defined through the parametric description of the storm. The hydrodynamic simulations were performed by coupling the ADCIRC [ADvanced CIRCulation model] [38] and the STWAVE [Steady-State Spectral Wave model] [39]. The ADCIRC mesh consisted of close to 1.6 million nodes and 3.1 million triangular elements. The geographic domain includes various flood protection systems around the New Orleans area. A subset of the entire domain will be considered for the metamodel development, focusing on areas around New Orleans, constrained by latitude [28.5°, 40°] N and longitude [86°, 93.5°] W. This corresponds to a total of

n_{z} =

1,179,179 nodes, with

n_{r} =

488,216 being dry in at least one storm. Figure 2 presents the histogram of the percentage of storms for which each node is inundated. Note that percentage equal to 100% corresponds to the set of always-wet nodes. The spatial distribution of the

n_{r}

nodes with the percentage of storms for which these nodes are inundated will be illustrated later in Section 7.1.

With respect to the input parameterization, as shown in [22] the storm strength, size, and translational speed, characteristics for the CHS-LA suite have been held relatively constant starting at approximately 250 km prior to landfall. This incentivizes the use of the characteristics at a reference landfall location to define

x

. Following the recommendations in [13], a piece-wise linear coastal boundary is utilized for defining this reference landfall location to avoid ambiguous definition of landfall due to the existence of bays. This boundary is also shown in Figure 1 (white solid line). The reference landfall is defined at the intersection of this boundary and the synthetic storm track. A single parameter is adopted for describing this landfall, corresponding to the distance (in km) of the landfall location along this boundary from a reference point (any reference point can be utilized here). This leads to a five-dimensional (

n_{x}

= 5) input vector that includes the distance for the reference landfall along the linearized boundary, along with

β

,

Δ P

,

R_{m w}

, and

v_{t}

at final approach to landfall.

4. Surrogate Modeling Overview

This section briefly reviews different aspects of the surrogate model implementation: (i) the database imputation; (ii) the development and calibration of two surrogate models, one for regression and one for classification, and their combination to establish the final storm surge predictions; and (iii) the validation of the established models. Additional details can be found in [22]. Readers familiar with the surrogate model implementation may proceed directly to Section 5 and Section 6.

4.1. Database Imputation

The development of the regression surrogate model (discussed in Section 4.2) requires imputation of the original database to provide the pseudo-surge [24] for any missing data for Z. For the ith node and hth storm, this pseudo-surge is denoted as

{\underset{˜}{z}}_{i}^{h}

and can be obtained through any of the approaches discussed in detail in Section 5 and Section 6. As discussed in detail in [23], imputation is not guaranteed to lead to pseudo-surge values

{\underset{˜}{z}}_{i}^{h}

that are smaller than the node elevation

e_{i}

, with nodes for which

{\underset{˜}{z}}_{i}^{h} > e_{i}

falsely classified as wet based on the imputed surge values. The solutions for accommodating such erroneous information in the surrogate model development are as follows [23]: (a) adjust the estimated

{\underset{˜}{z}}_{i}^{h}

to be smaller (for example 10 cm smaller) than

e_{i}

, guaranteeing that

{\underset{˜}{z}}_{i}^{h} < e_{i}

and the estimated pseudo-surge provides a correct classification of the node as dry; or (b) retain the (erroneously classified) pseudo-surge values in the development of the regression surrogate model but supplement that regression model with an additional surrogate model for the classification of the inundation condition of each node, established using the original condition matrix Y (which does not include any erroneous information). The term problematic nodes and notation A_pn are introduced herein to describe the subset of nodes that are at least once misclassified during the imputation process. Following [22], the terminology pseudo-surge database and corrected pseudo-surge database will be used to describe, respectively, the imputed database with no modification [case (b)] or with adjustment that guarantee all imputed values lead to correct classification of nodes [case (a)].

As discussed in detail in [22,23], the corrected pseudo-surge database includes discontinuities (jumps) in the imputed observation matrix Z. These discontinuities are represented by the surge gap

η_{i}

for the once-dry nodes, defined using the information in the original database as follows [22]:

η_{i} = \min_{h} (z_{i}^{h}) - e_{i}

(1)

where

\min_{h} (.)

denotes the minimum of the quantity inside the parentheses across all the storms in the database. The larger the value of this gap, the greater the challenges for the regression surrogate model in achieving accurate predictions when the corrected pseudo-surge database is utilized. As discussed in detail in [23], this is the main reason for promoting the use of the pseudo-surge database in the surge surrogate model development. It is expected to provide a higher accuracy predictive model, as long as the erroneous information in this database (originating from the imputed data) can be corrected when supplementing predictions with the ones from the secondary classification surrogate model.

To assess the quality of information associated with the imputed database, the following measures are introduced. The maximum surge misinformation

m s_{i}

, corresponding to the largest difference between the imputed pseudo-surge value and the node elevation:

m s_{i} = \max_{h} {({\underset{˜}{z}}_{i}^{h} - e_{i}) \cdot | y_{i}^{h} - 1 |}

(2)

and the percentage of misinformation

m p_{i}

, corresponding to the relative ratio of storms (with respect to the total database) that a node is misclassified at the imputation stage:

m p_{i} = \frac{1}{n} \sum_{h = 1}^{n} {I [{\underset{˜}{z}}_{i}^{h} > e] \cdot | y_{i}^{h} - 1 |}

(3)

where

I [.]

denotes the indicator function, corresponding to one if the expression inside the brackets is true and to zero otherwise. Note that the term

| y_{i}^{h} - 1 |

restricts the operations in the above equations to only the originally dry instances, i.e., when

y_{i}^{h} = 0

. For each problematic node, the maximum surge misinformation and the percentage of misinformation represent, respectively, the magnitude or amount of erroneous information. A modification of the problematic node definition was proposed in [22] using

m s_{i}

and

m p_{i}

. Any problematic nodes that correspond to small values for both

m s_{i}

and

m p_{i}

can be removed from the set

A_{p n}

; the information quality for them is deemed as sufficiently high (erroneous information is negligible). This modified problematic node set is denoted as

{\bar{A}}_{p n}

herein. In the case study presented later, the thresholds used for

m s_{i}

and

m p_{i}

are 5 cm and 2.0%, respectively, following the recommendation in [22].

4.2. Surrogate Model Development

4.2.1. Overview of Data-Driven Predictive Model Formulation

The storm surge predictive model is composed of two different surrogate models [22]: (a) the primary regression surrogate model, denoted as S_s herein, for approximating the storm surge

z (x)

; and (b) the secondary classification model, denoted as S_c herein, for approximating the inundation state

y (x)

. The regression surrogate S_s is established using a database with input and output matrices corresponding to X and imputed Z, respectively. Either the pseudo-surge database or the corrected pseudo-surge database can be used for Z. The classification surrogate S_c is established using a database with input and output matrices corresponding to X and Y, respectively. Note that the classification surrogate can be established only for the once-dry nodes [total of

n_{r}

nodes] since for the always-wet nodes there is no useful information in the database to support a classification metamodel development (all training points correspond to the same value of 1). The metamodel predictions will be denoted as

{\tilde{z}}_{i} (x)

and

{\tilde{y}}_{i}^{c} (x)

for the ith node and as

\tilde{z} (x)

and

{\tilde{y}}^{c} (x)

for the entire output vector, for the S_s and S_c surrogate models, respectively. Note that the S_s metamodel also establishes predictions for the inundation state, denoted by

{\tilde{y}}_{i}^{s} (x)

for the ith node and given by the following:

{\tilde{y}}_{i}^{s} (x) = I [{\tilde{z}}_{i} (x) > e_{i}]

(4)

where, as defined earlier,

I [.]

denotes the indicator function. The predictions for the entire output vector, with elements

{\tilde{y}}_{i}^{s} (x)

, will be denoted as

{\tilde{y}}^{s} (x)

. Note that superscripts c and s were introduced to distinguish the predictions for the node inundation state according to the S_c and S_s surrogate models, respectively. The final approximation for the inundation state, denoted as

{\tilde{y}}_{i} (x)

for the ith node and as

\tilde{y} (x)

for the entire output vector, can be obtained, as will be discussed in Section 4.2.3, by the appropriate combination of the predictions of each of these metamodels.

4.2.2. Classification and Regression Surrogate Model Details

Any type of surrogate model [15,16] can be utilized for establishing the S_s and S_c metamodels. In the illustrative case study, a Gaussian Process (GP) metamodel [40,41] is considered as the surrogate model for both S_c and S_s, coupled with a dimensionality reduction step, using principal component analysis (PCA), to deal with the potentially large dimension of the output (i.e., large number of nodes). This implementation has been shown to be highly efficient for storm surge surrogate model predictions [12,23]. Details for the surrogate models are presented in Appendix A and Appendix B for S_s and S_c, respectively, with Appendix C providing an overview of the GP formulation. This surrogate modeling implementation establishes ultimately probabilistic predictions for the node inundation state. Instead of providing a binary classification, i.e., predicting node as wet or dry, these predictions provide the probability of the node being inundated. For the ith node these predictions will be denoted herein as

p_{i}^{s} (x)

and

p_{i}^{c} (x)

for the S_s and S_c surrogate models, respectively. The binary classification can be established as

y_{i}^{s} (x) = I [p_{i}^{s} (x) > 0.5]

and

y_{i}^{c} (x) = I [p_{i}^{c} (x) > 0.5]

, respectively, though the probabilities can be explicitly utilized in the combination of the surrogate model predictions, as detailed next.

4.2.3. Combination of Individual Surrogate Model Predictions

The final storm surge surrogate model predictions are established by combining the S_s and S_c metamodels, with both of them leveraged to establish the predictions for the inundation state

{\tilde{y}}_{i} (x)

, and the first one [S_s] used to provide the final surge predictions

{\tilde{z}}_{i} (x)

for the nodes that are predicted to be inundated, i.e., correspond to

{\tilde{y}}_{i} (x) = 1

. A detailed discussion on alternative options for the metamodel combination is presented in [22]. Here, a quick review of the recommended implementation is provided. To accommodate this combination, the following classes of nodes are defined:

The always-wet nodes (inundated across the entire database), with no missing information. These will be denoted as $C^{1}$ class. The node condition classification for them is based entirely on S_s.
The modified problematic nodes ${\bar{A}}_{p n}$ for which imputation was needed, and that imputation led to sufficiently large values of $m s_{i}$ or $m p_{i}$ . These will be denoted as $C^{2}$ class.
The remaining nodes, for which imputation was needed, but that imputation did not lead to significant misclassification. These will be denoted as $C^{3}$ class.

Additionally, the combined surrogate model probabilistic predictions are defined as follows:

p_{i}^{c b} (x) = w_{i}^{c b} p_{i}^{s} (x) + (1 - w_{i}^{c b}) p_{i}^{c} (x)

(5)

where

0 \leq w_{i}^{c b} \leq 1

and

(1 - w_{i}^{c b})

correspond to the weights assigned to the S_s (regression) and S_c (classification) metamodels. Note that if the established metamodels for S_c and S_s do not provide probabilistic predictions, the binary classification can be utilized in the combination. The selection of weights

w_{i}^{c b}

reflects the degree of confidence for each metamodel class with a balanced implementation recommended in [22]. The implementation weighs equally (

w_{i}^{c b} = 1 / 2

) the S_s and S_c predictions for all instances that these predictions could be deemed as reliable. This reliability should be evaluated under the following consideration. S_s predictions for class

C^{2}

are biased towards false positives (dry nodes characterized as wet) due to the underlying erroneous information in the pseudo-surge database (see also Section 4.1 discussion). Based on this tendency, if S_s predictions classify nodes as dry, they can be assessed as reliable, as they are opposite to the underlying metamodel bias. In such cases, both models can be combined to provide

p_{i}^{c b} (x)

. On the other hand, if S_s predictions classify nodes as wet, i.e.,

{\tilde{z}}_{i} (x) \geq e_{i}

, then due to the underlying false positive bias they should be ignored as not credible. The recommendation for the surrogate model combination is finally as follows:

{\tilde{y}}_{i} (x) = {\begin{matrix} C^{1} : y_{i}^{s} (x) = I [{\tilde{z}}_{i} (x) > e_{i}] \\ C^{2} : {\begin{matrix} I [p_{i}^{c b} (x) > 0.5 | w_{i}^{c b}] = 0 if {\tilde{z}}_{i} (x) < e_{i} \\ y_{i}^{c} (x) = I [p_{i}^{c} (x) > 0.5] else \end{matrix} \\ C^{3} : I [p_{i}^{c b} (x) > 0.5 | w_{i}^{c b}] \end{matrix}

(6)

It should be noted that for the S_s surrogate, the relationship

y_{i}^{s} (x) = I [{\tilde{z}}_{i} (x) > e_{i}]

= I [p_{i}^{s} (x) > 0.5]

also holds.

Finally, the surge predictions are established as follows. For the instances corresponding to

{\tilde{y}}_{i} (x) = 1

, the surge estimates are provided directly by the S_s metamodel,

{\tilde{z}}_{i} (x)

, as long as

{\tilde{z}}_{i} (x) > e_{i}

. If

{\tilde{z}}_{i} (x) < e_{i}

and the node would have been classified eventually as dry, even if the combined surrogate model predictions yield

{\tilde{y}}_{i} (x) = 1

, then instead of using directly

{\tilde{z}}_{i} (x)

the predictions are set equal to some margin (2.0 cm used in this study) over the node elevation

e_{i}

. This is necessary so that assigned surge predictions correspond to nodes classified as inundated for all

{\tilde{y}}_{i} (x) = 1

instances.

4.3. Surrogate Model Validation

For validation, a k-fold cross-validation (CV) is adopted. The process for establishing it is as follows: (s.1) the original database is randomly partitioned to k different and equally sized groups; (s.2) sequentially, each group is removed from the database, and the surrogate model is calibrated using as observations the remaining storms; (s.3) predictions for the surge of the removed storms are established using this surrogate model; (s.4) comparing these predictions to the actual storm output (of the removed storms) quantifies the prediction error for them; (s.5) accuracy statistics are estimated by combining error information across all folds. Note that when imputation relying on data completion is utilized for the metamodel development, then imputation is repeated for each k-fold in step (s.2). The k-fold CV implementation ultimately provides for the ith node estimates for the surge

{\tilde{z}}_{i} (x^{h})

and inundation condition

{\tilde{y}}_{i} (x^{h})

established for the hth storm by using astraining database the specific group that does not include the storm.

As validation metrics, the misclassification and surge score are utilized. For a specific node and storm, the total misclassification is given by the following:

M C_{i}^{h} = |{\tilde{y}}_{i} (x^{h}) - y_{i}^{h}| .

(7)

This total misclassification can be decomposed to the false positive and false negative components, with the first corresponding to the instances that the node is predicted wet when dry, and the second to the instances the node is predicted dry when wet. These misclassification types are described, respectively, by the following equations:

\begin{array}{l} {}^{+}{M C}_{i}^{h} = \max (0, {\tilde{y}}_{i} (x^{h}) - y_{i}^{h}) \\ {}^{-}{M C}_{i}^{h} = \max (0, y_{i}^{h} - {\tilde{y}}_{i} (x^{h})) \end{array}

(8)

with max(a,b) corresponding to the maximum of the two arguments a or b. Average misclassification statistics per node or across all nodes can be then estimated to describe the accuracy. These are separately estimated for the false positive/false negative/total misclassification components. For the latter they are denoted as

M C_{i}

for the statistics per node and as

\bar{M C}

for the statistics across all nodes (across entire database) and are given, respectively, by the following:

M C_{i} = \frac{1}{n} \sum_{h = 1}^{n} M C_{i}^{h}; \bar{M C} = \frac{1}{n n_{z}} \sum_{h = 1}^{n} \sum_{i = 1}^{n_{z}} M C_{i}^{h} .

(9)

The equations for the respective statistics for the false positive or the false negative misclassification are identical with the only difference that instead of

n_{z}

, the number of nodes that were dry or wet is utilized, respectively, for each storm for the false positive or negative misclassification statistics. Additionally, statistics can be provided for specific groups instead of the entire database; this is accomplished by examining in the averaging in Equation (9) only the nodes belonging to that group (i.e., summation is performed only with respect to the indices i included that group).

The surge score for a specific node and storm is given by the following [10,24]:

S C_{i}^{h} = \{\begin{matrix} | {\tilde{z}}_{i} (x^{h}) - z_{i}^{h} | \\ {\tilde{z}}_{i} (x^{h}) - e_{i} \\ z_{i} (x^{h}) - e_{i} \\ 0 \end{matrix} \begin{matrix} if {\tilde{z}}_{i} (x^{h}) > e_{i} & z_{i}^{h} > e_{i} \\ if {\tilde{z}}_{i} (x^{h}) > e_{i} & z_{i}^{h} \leq e_{i} \\ if {\tilde{z}}_{i} (x^{h}) \leq e_{i} & z_{i}^{h} > e_{i} \\ if {\tilde{z}}_{i} (x^{h}) \leq e_{i} & z_{i}^{h} \leq e_{i} \end{matrix} = |\max (e_{i}, {\tilde{z}}_{i} (x^{h})) - \max (e_{i}, z_{i}^{h})| .

(10)

This surge score quantifies the absolute discrepancy between the predicted and actual surge, further incorporating the node inundation state: if the node is wet and is also predicted wet, then the absolute value of the predicted surge discrepancy is used as a penalty function; if the node is wet, but it is predicted dry, or vice versa, then the difference between surge (or predicted surge) and node elevation is used as penalty; if the node is actually dry and it is predicted dry, then the penalty is zero. Note that the expression

| \max (e_{i}, {\tilde{z}}_{i} (x^{h})) - \max (e_{i}, z_{i}^{h}) |

represents the compact mathematical representation of the four-branch definition of the surge score. The averaged statistics per node, or across the entire database, can be then obtained, respectively, as the following:

S C_{i} = \frac{1}{n} \sum_{h = 1}^{n} S C_{i}^{h}; \bar{S C} = \frac{1}{n n_{z}} \sum_{h = 1}^{n} \sum_{i = 1}^{n_{z}} S C_{i}^{h} .

(11)

Statistics for the surge score can be examined for specific groups instead of the entire node set, similar to the misclassification case.

5. Imputation Based on Spatial Interpolation

Spatial interpolation is a popular and practical tool for estimating spatially varying variables for geographical locations where they have not been observed, based on observations from neighboring locations [28]. For the application examined here, spatial interpolation techniques can be calibrated utilizing the always-wet nodes and can be subsequently utilized to estimate the pseudo-surge values (i.e., impute the database) based on the storm surge values of other inundated nodes in close proximity. Imputation is separately performed for each storm. As discussed in the introduction, two separate spatial interpolation techniques are examined: weighted kNN and GPI. Weighted kNN has been previously considered for imputation of peak storm surge databases [22,23], and here an extension is proposed integrating a response-based correlation measure to better describe the hydraulic connectivity between nodes. GPI is an attractive spatial interpolation approach [28,42] but, as discussed in Section 1, to date it has not been considered for the application examined here due to computational challenges associated with the large dimension of the database.

5.1. Imputation Using kNN with Enhanced Hydraulic Connectivity

To formalize the kNN implementation, define

d_{i j} = d (s_{i}, s_{j})

as the geo-distance between nodes i and j and

A_{k}^{h} [i]

as the set of k closest nodes to the ith node for the hth storm. Importantly, the only nodes included in set

A_{k}^{h} [i]

correspond to nodes with known surge values; these correspond to originally inundated nodes for the hth storm (nodes for which

y_{i}^{h} = 1

in the original database) and nodes with already imputed values based on the iterative implementation proposed in [23] (and also reviewed next). The pseudo-surge estimate

{\underset{˜}{z}}_{i}^{h}

for the ith node and the hth storm based on the weighted kNN interpolation is given by the following:

{\underset{˜}{z}}_{i}^{h} = \frac{\sum_{j \in A_{k}^{h} [i]} w (d_{i j}) z_{j}^{h}}{\sum_{j \in A_{k}^{h} [i]} w (d_{i j})}

(12)

where

w (d_{i j})

is a distance-dependent weight taken as a power exponential with a cut-off distance

d_{t}

:

w (d_{i j}) = \{\begin{matrix} e^{- {(\frac{d_{i j}}{ψ_{l}})}^{ψ_{e}}} if d_{i j} < d_{t} \\ 0 if d_{i j} \geq d_{t} \end{matrix} .

(13)

The set of

[k d_{t} ψ_{l} ψ_{e}]

corresponds to the kNN hyper-parameters. These can be calibrated to optimize interpolation accuracy across the always-wet nodes [23]. The calibration procedure is reviewed in Appendix D.

The kNN-based imputation is progressively performed [23] within an iterative implementation. At each iteration, only dry nodes that have at least k wet (both imputed and genuine) neighbors in a larger set of k_c nodes are imputed. For nodes for which fewer than k wet neighbors are available, imputation is not performed at the current iteration (these will be imputed in a future iteration). The value of k_c is chosen equal to twice the value of k in this study, adopting the same recommendation as in [23]. Additionally, the use of hydraulic connectivity information within the kNN imputation was introduced in [22] and shown to be advantageous. Instead of the closest nodes based purely on

d_{i j}

, the closest connected neighbors are used, defined by the number of element edges between nodes i and j in the graph representing the underlying numerical grid [for example, ADCIRC in the database discussed in Section 3]. Nodes corresponding to fewer edge connections are given higher priority in estimating the k nearest neighbors independent of the actual distance. This approach ultimately prioritizes relatively ADCIRC connectivity over spatial proximity in selecting the closest neighbors.

The edge connections within the underlying numerical grid offer, though, only a proxy for the hydraulic connectivity between nodes. Here, this concept is advanced to use the inferred connectivity based on the response database Y. Given the fact that the responses are binary, the cosine similarity is adopted as a response-based correlation measure [43], defined for nodes i and j as follows:

C_{s} (i, j) = \frac{\sum_{h = 1}^{n} y_{i}^{h} y_{j}^{h}}{\sqrt{\sum_{h = 1}^{n} {(y_{i}^{h})}^{2}} \sqrt{\sum_{h = 1}^{n} {(y_{j}^{h})}^{2}}} .

(14)

This cosine similarity represents the normalized degree of inundation state alignment between nodes i and j, with a value equal to 1 representing perfect alignment and a value equal to 0 representing perfect misalignment. Another way to interpret the cosine similarity for the binary variables is the following. Consider the resultant partitioning of the outcome of the binary vectors

{y_{i}^{h}; h = 1, \dots, n}

and

{y_{j}^{h}; h = 1, \dots, n}

with the following: N₁₁ representing the total number of storms where both

y_{i}^{h}

and

y_{j}^{h}

are equal to 1; N₁₀ representing the total number of storms where

y_{i}^{h} = 1

and

y_{j}^{h} = 0

; N₀₁ representing the total number of storms where

y_{i}^{h} = 0

and

y_{j}^{h} = 1

; and N₀₀ representing the total number of storms where both

y_{i}^{h}

and

y_{j}^{h}

are equal to 0. Then the cosine similarity can be expressed as follows:

C_{s} (i, j) = \frac{N_{11}}{\sqrt{N_{11} + N_{10}} \sqrt{N_{11} + N_{01}}} .

(15)

Alternative measures for quantifying correlation between binary-valued vectors, for example, Jaccard similarity or Hamming similarity, could also be considered instead of the cosine similarity. Finally, a threshold

C_{s}^{t}

is imposed for the cosine similarity with nodes considered connected only if

C_{s} (i, j) > C_{s}^{t}

. The nearest neighbors for node i are finally chosen only from the candidate connected neighbors. This implementation avoids creating connections between nodes with close spatial proximity that do not appearto be correlated based on the response database Y.

5.2. Imputation Using GPI with Adaptive Covariance Tapering

The GPI is established based on the formulation of Appendix C for input and output choices

u = s

[node spatial coordinates] and

v = z^{h}

[spatially varying surge for each storm], respectively, leading to

n_{u} = 2

. It is separately implemented for each storm. To formalize the implementation, let

i_{w}^{h}

and

i_{r}^{h}

represent the indices corresponding to the wet and dry nodes for the ith storm, with the number of elements in

i_{w}^{h}

denoted as

n_{w}^{h}

and the number of elements in

i_{r}^{h}

as

n_{r}^{h}

. The training data for the hth storm are

n_{w}^{h}

, with input matrix

U

and output vector

V

(based on Appendix C terminology) corresponding to the columns of S and

Z^{h}

for indices

i_{w}^{h}

, with

n_{G P} = n_{w}^{h}

. The metamodel predictions given by Equation (A6) for each of the inputs corresponding to the remaining columns of S, for indices

i_{d}^{h}

, provide the desired pseudo-surge estimates. Ultimately, this corresponds to predictions for

n_{d}^{h}

new inputs. Similarly to the kNN implementation, a progressive imputation can be achieved for each storm instead of estimating all pseudo-surge values in one iteration. A common correlation kernel used for spatial interpolation is the Matérn function [28,44], expressed for nodes i and j as a function of their distance

d_{i j} = d (s_{i}, s_{j})

and given by the following:

R (s_{i}, s_{j} | γ) = \frac{1}{2^{v - 1} Γ (v)} {(\frac{d_{i j}}{d_{r}})}^{v} K_{v} (\frac{d_{i j}}{d_{r}})

(16)

where

K_{v}

is the modified Bessel function (of the second kind) of order

v

, and

d_{r}

controls the rate of decay with distance while

v

controls the GP smoothness. The hyper-parameter vector of the correlation kernel is

γ = [d_{r} v]

. Note that the correlation kernel is strictly a function of the distance between nodes, i.e.,

R (d_{i j} | γ) = R (s_{i}, s_{j} | γ)

.

The challenge in the GPI implementation originates from the large number of wet nodes for each storm, since formulation (see details in Appendix C) requires the inversion of a correlation matrix R [with elements

R (d_{i j} | γ)

] of dimension

n_{G P} = n_{w}^{h}

. This task has computational complexity of order

O [n_{G P}^{3}]

[41]. A variety of solutions [45,46,47,48] have been established for promoting computationally efficiency for GP implementations dealing with such large datasets. For spatial interpolation, covariance tapering is an especially appealing solution due its intuitive interpretation, as it relies on spatial correlation/connectivity between nodes. Additionally, some of its variants [46] are appropriate for irregular grids, like the ones encountered for storm surge surrogate modeling applications. It should be noted that the foundation of covariance tapering resembles the one for kNN: only a reduced number of (close proximity) neighbors should impact the GPI.

Mathematically, covariance tapering leverages inversion algorithms for sparse matrices [49] to reduce the computational burden of inverting large-dimensional covariance matrices. Sparsity is infused by setting unimportant correlations to zero in the GP correlation matrix R, such that the resultant matrix R consists of a very large number of zeros. A smooth taper function with compact support

0 \leq T (d_{i j} | φ_{i}, φ_{j}) \leq 1

is introduced to achieve this objective, with

φ_{i}

representing the hyper-parameter defining the compact support for each node, the so-called taper range (or taper length). When the nodal distance exceeds the compact support (defined through the respective taper ranges), zero correlation is presumed. By replacing

R (d_{i j} | γ)

with

R (d_{i j} | γ) \cdot T (d_{i j} | φ_{i}, φ_{j})

the desired sparsity in R can be achieved. To distinguish the resultant sparse matrix when tapering is used, the notation R_s will be used herein. Let ρ and

ρ_{i}

represent the degrees of global and local sparsity, respectively. These correspond, respectively, to the proportion of non-zero entries in the overall matrix R_s, or in its ith column. The computational cost of inverting R_s for large degrees of sparsity is reduced to approximately

O (n_{G P} \cdot {(ρ n_{G P})}^{3})

[50]. Note that the exact computational burden depends on the details of the implementation. The taper range can be selected to achieve the desired degree of global sparsity to achieve some targeted reduction in the computational burden.

The simplest implementation of covariance tapering is a constant tapering approach using the same taper range across all nodes

φ_{i} = φ

[48], chosen based on the aforementioned target global sparsity ρ. Unfortunately for irregular grids, like the ones frequently encountered in storm surge applications [38], the constant tapering approach promotes non-uniform local sparsity, with some nodes having too few neighbors and others having too many, something that can adversely affect predictive accuracy. An adaptive covariance tapering formulation [46], explicitly trying to achieve the target local sparsity for each node, circumvents this challenge. An adaptive taper is utilized in this case for which

φ_{i} \neq φ_{j}

for i ≠ j. An example of an adaptive taper, and the one utilized in this study, is as follows [46]:

T (d_{i j} | φ_{i}, φ_{j}) = \frac{1}{π R_{i j} r_{i j}} \{\begin{matrix} π r_{i j}^{2} & if d_{i j} < R_{i j} - r_{i j} \\ V_{2} (R_{i j}, \frac{d_{i j}^{2} + R_{i j}^{2} - r_{i j}^{2}}{2 d_{i j}}) + V_{2} (r_{i j}, \frac{d_{i j}^{2} + r_{i j}^{2} - R_{i j}^{2}}{2 d_{i j}}) & if R_{i j} - r_{i j} \leq d_{i j} < R_{i j} + r_{i j} \\ 0 & otherwise, \end{matrix}

(17)

where

r_{i j} = \min (φ_{i}, φ_{j}) / 2

,

R_{i j} = \max (φ_{i}, φ_{j}) / 2

, and

V_{2} (r, x) = \{\begin{matrix} r^{2} \cos^{- 1} (x / r) - x \sqrt{r^{2} - x^{2}} & |x| < r \\ 0 & else . \end{matrix}

(18)

This selection leads to compact support

d_{i j} \leq (φ_{i} + φ_{j}) / 2

for the taper, and to a non-stationary, sparse covariance matrix that can have uniform local sparsity per node (i.e., constant

ρ_{i}

) even for highly irregular datasets.

A key aspect in the implementation of adaptive covariance tapering is the selection of the taper ranges per node

{φ_{i}; i = 1, \dots, n_{z}}

to achieve the desired local sparsity

{ρ_{i}; i = 1, \dots, n_{z}}

. An iterative optimization was established in [46] to support this selection, but that procedure cannot scale to large datasets (beyond a few thousand nodes). Recently [30], a computationally efficient implementation of adaptive covariance tapering was established that can seamlessly scale to applications even with a few hundred thousand nodes. This is the formulation adopted in this study for performing imputation using GPI with adaptive covariance tapering. According to this formulation, a small subset of nodes, termed inducing points, is chosen for guiding the identification of the taper ranges. The ranges are only explicitly chosen for the inducing points, something that greatly simplifies the corresponding optimization, achieving the desired computational efficiency. The taper ranges for the remaining points are inferred from the taper range values chosen for the inducing points. The corresponding algorithm established in [30] for the identification of the taper ranges, termed M-IAT (iterative adaptive taper selection), is reviewed in Appendix E. M-IAT can only directly control the local sparsity of the inducing points, with the local sparsity for the remaining points influenced indirectly by the distribution of those inducing point set. The selection of the inducing points becomes, therefore, critical so that the decisions made based on the inducing points are compatible with the decisions that would have been made if all the points were used. An adaptive, iterative selection was proposed in [30] for the inducing points, and is the one adopted here. This selection starts with a small number of inducing points, performs the taper identification utilizing them (step 1), and then augments them with additional points (step 2) corresponding to regions for which the discrepancy from the target local sparsity (based on the step 1 tapers) is the largest. The latter points represent the ones (among the remaining points not already chosen as inducing points) with the expected largest information value in guiding the taper range selection for achieving the target local sparsity throughout the entire domain. Steps 1 and 2 are repeated until convergence, defined as establishing a discrepancy from the target local sparsity below a specific threshold, is achieved. The corresponding algorithm established in [30] for selecting the inducing points, termed IIP (iterative inducing point identification), is reviewed in Appendix F.

5.3. Computational Complexity and Scalability of Spatial Interpolation Techniques

The computational complexity of kNN is of order

O (h (n_{z} k) + n_{z}^{2})

if the implementation relies on the estimation of distances and cosine similarity measures between all pairs of nodes. With some intelligent implementation to restrict this estimation to only a smaller number of

k_{p r}

candidate neighbors (for example within a rectangular domain centered at each node), this complexity may be reduced to

O (h (n_{z} k) + n_{z} k_{p r})

. As such the computational burden of kNN may be considered to scale practically linearly with the size of the database (either storms or nodes), showing that it can be easily extended to larger datasets. On the other hand, the computational complexity of GPI is of order

O (n_{z} {(ρ n_{z})}^{3} + h n_{z}^{2})

. Depending on the adopted value of ρ, this means that the computational burden of GPI scales linearly with the number of storms and at best quadratically with the number of nodes. As such, extension to very large datasets is expected to face challenges, depending on the available computational resources. A partial remedy to this challenges can be provided by employing GPU resources for the relevant matrix manipulations [51].

Note that the iterative implementation of the spatial interpolation techniques to support the progressive imputation of the database will increase the aforementioned computational costs, but the exact cost (or total number of iterations) is database dependent as it depends on the exact pattern of the missing data.

6. Imputation Based on Data Completion

6.1. Motivation for Imputation Utilizing Data Completion

Instead of trying to establish a predictive model by inferring spatial correlations, imputation approaches based on data completion utilize solely the original observation matrix Z. They infer patterns based on all observed data and then project these patterns into the missing data to accomplish the imputation task. As they do not utilize any spatial information, they avoid the pitfalls that might exist for spatial interpolation approaches, when spatial proximity does not necessarily translate into surge response correlation. For imputation of storm surges, they will be implemented utilizing all data, trying to infer patterns across storms and between nodes, to fill in the missing values.

A range of imputation methods relying on data completion exist; for example, variants of low-rank matrix completion [52,53] or principal component analysis (PCA) [54,55]. In general, such approaches perform well when there is sufficient information in the original data to infer the underlying patterns. For peak-surge imputation, they are expected to face challenges due to the limited information that is available for some nodes. This is evident in the results shown in Figure 2 for the case study database, with a significant portion of the nodes (over 15%) being inundated in less than 10% of the storms (i.e., over 90% missing data rate). For this reason, a two-stage imputation approach will also be explored, filling originally missing values based on spatial interpolation, to enrich information for the implementation of the data completion-based imputation in the second stage. The specific data completion imputation method utilized here is probabilistic principal component analysis (PPCA) [33], as it can easily scale to large datasets, like the ones frequently encountered in storm surge imputation, and is well aligned with the PCA-based dimensionality reduction utilized when S_c and S_s metamodels are established based on GP surrogate modeling principles.

6.2. Imputation Using PPCA

PPCA-based imputation [33,56] establishes first a dimensionality reduction by projecting the original data (with missing values) onto its principal components (latent outputs) and subsequently estimates the missing values by recovering the data from its dimension-reduced form. The specific PPCA-based imputation variant utilized here follows the implementation adopted in [56]. A succinct description for it is provided next.

Formulation assumes a linear relationship between the original surge vector and the vector of principal components as the following:

z = W t + μ + ε

(19)

where

t \in ℝ^{n_{t}}

denotes the

n_{t}

-dimensional principal component vector,

W \in ℝ^{n_{z} \times n_{t}}

the projection matrix,

μ \in ℝ^{n_{z}}

the mean vector with the ith component representing the mean of

z_{i}

(mean of the ith column of

Z

), and

ε \in ℝ^{n_{z}}

the isotropic noise. Within the PPCA implementation, the probabilistic distributions for

t

and

ε

are assumed, respectively, as

t ~ N (0, I)

and

ε ~ N (0, σ^{2} I)

, where

N (m, Σ)

represents the multivariate Gaussian distribution with the mean vector

m

and covariance matrix

Σ

,

0

the vector of zeros,

I

the identity matrix, and

σ^{2}

the magnitude of the error. These assumptions lead to the conditional distribution of the original surge vector

z

given

t

as the following:

z | t ~ N (W t + μ, σ^{2} I) .

(20)

Based on probability models in Equations (19) and (20), and using the surge response data (matrix Z) as samples of

z

, the complete-data log-likelihood of the n observations (correspond to the different storms) can be expressed as the following:

L = \sum_{h = 1}^{n} \ln {p (z^{h}, t^{h})} = \sum_{h = 1}^{n} [\ln {p (z^{h} | t^{h})} + \ln {p (t^{h})}]

(21)

where

t^{h} = t (x^{h})

denotes the latent outputs for the hth storm. For maximizing this complete-data log-likelihood and, ultimately facilitating the data imputation, an Expectation-Maximization (EM)-based [57] iterative formulation is adopted for computational efficiency. The expectation step estimates the latent outputs using the observed portion of Z only, and then fills in missing entities in Z based on the estimated latent outputs (along with fixed

W

and

σ^{2}

). The maximization step updates estimates for

W

and

σ^{2}

based on the updated complete matrix Z. Vector

μ

can be either updated at each iteration using imputed values or fixed to some reasonably initialized values at the beginning.

Following recommendations in [56], the adopted EM implementation also incorporates prior knowledge about expected patterns during the data completion process. This is critical for providing imputed data that exhibit similar patterns. Specifically, to the storm surge imputation problem, the pattern that we try to retain is to have the missing data in the storm surge database be below the node elevation (i.e., the pseudo-surge values corresponding to dry inundation state). For this reason, the missing entries in Z are initialized within the EM algorithm slightly (5 cm) below the corresponding node elevation.

6.3. Two-Stage Imputation Combining Spatial Interpolation and Data Completion Techniques

To enrich the information available for imputation relying on data completion, a two-stage formulation is promoted. In the first stage, spatial interpolation is first performed separately for each storm; for example, using kNN or GPI as detailed in Section 5. The incorrectly classified imputed data, corresponding to pseudo-surge values

{\underset{˜}{z}}_{i}^{h} > e_{i}

, are still considered to correspond to missing data. This leads to a partially imputed database that is subsequently imputed in the second stage using a data completion imputation technique. Through the introduction of the first stage, the database utilized in the data completion imputation is substantially enriched, while the enriched data are also guaranteed to facilitate correct classification based on the pseudo-surge.

6.4. Computational Complexity and Scalability of PPCA Imputation

Though the computational complexity of PPCA depends on the specifics of the EM algorithm used [58], with an appropriate EM implementation it can be reduced to be of order

O (h n_{z} n_{t} T)

, where T is the total number of EM iterations. This shows that, similar to kNN, PPCA may be considered to scale practically linearly with the size of the database (either the number of storms or nodes), showing that it can be easily extended to larger datasets. Compared to the computational complexity of the spatial interpolation techniques presented in Section 5.3, PPCA is expected to outperform GPI and have a similar computational burden as kNN. However, in general, PPCA outperforms even kNN if the progressive imputation of the latter requires a significant number of iterations.

7. Illustrative Case Study

The illustrative case study considers implementations of the proposed advances for establishing surrogate models for the CHS-LA database described in Section 3. The advantages of the different imputation advancements are examined with respect to the accuracy of the surrogate models that are subsequently calibrated using the imputed databases. The validation adopts a 10-fold CV implementation, following guidelines in Section 4.3.

7.1. Variants Examined

Results are presented for the entire database consisting of

n_{z} =

1,179,179 nodes, with

n_{r} =

488,216 being dry in at least one storm, as well as for a smaller domain around New Orleans with

n_{z} =

200,200 nodes, shown with the red boundary in Figure 3, with

n_{r} =

79,536 being dry in at least one storm, respectively. The smaller database is introduced in order to accommodate the efficient implementation of the GPI-based imputation. Even with adaptive covariance tapering, this implementation cannot scale to the

n_{z} =

1,179,179 nodes case unless a large-memory GPU-based workstation is utilized (see discussions in Section 5.3). A partitioning of the entire domain into smaller subdomains is needed to facilitate the GPI-based imputation. Without loss of generality, in the case study, the focus is placed on one of these subdomains only, with the understanding that implementation across the entire domain can be facilitated by repeating the process for other subdomains. The consideration of the subdomain allows, additionally, examination of the robustness of the different imputation strategies [especially of data-completion strategies as will be showcased in the discussions in Section 7.4] with respect to the size of the considered database. The notations D_f (full domain) and D_s (subdomain) are introduced to distinguish the different domains (and corresponding datasets). The spatial distribution of nodes, along with the proportion of instances in which they are inundated, is presented for D_f and D_s in Figure 3a and Figure 3b, respectively. Additionally, surge gaps of once-dry nodes within D_f and D_s are presented, respectively, in Figure 4a and Figure 4b. It is evident from these figures that the large storm surge gaps appear in regions with predominantly dry nodes, appearing next to flood protection systems around New Orleans. Note that Figure 3 and Figure 4 will be revisited in Section 7.5 when examining the spatial distribution of surrogate model prediction errors.

Imputation is performed using (i) spatial interpolation (kNN, GPI) or (ii) data completion (PPCA) as well as a combination of (i) and (ii) through the two-stage implementation presented in Section 6.3. Notations kNN, GPI, PPCA or, for the combined implementations, kNN-PPCA and GPI-PPCA will be used to distinguish the different cases. Note that the GPI variants (GPI and GPI-PPCA) will be only examined for the D_s domain. Additionally, for the kNN implementation, two formulations will be explored for incorporating hydraulic connectivity: the previous strategy [22] using the ADCIRC grid to define this connectivity and the newly proposed approach using a response-based connectivity. Notation kNN_rc will be used to differentiate the approach when the response-based connectivity is utilized, with notation kNN used for the original [22] formulation. The GPI implementation corresponds to sparsity

ρ_{i} = 0.12 %

[250 connected neighbor per node]. The surrogate models are established using either the pseudo-surge database or the corrected pseudo-surge database. Results are presented for the combination of S_s and S_c surrogate models for the pseudo-surge database and for the use of only the S_s surrogate model for the corrected pseudo-surge database, as the intention of the modification of the pseudo-surge within this database is to avoid the use of the secondary classification surrogate.

7.2. Imputation Results

Results are initially discussed for the imputation statistics. To better frame discussion, some statistics for the databases are first reviewed in Table 2. This table presents the proportion of the nodes belonging to different groups: the once-dry nodes as well as nodes with a surge gap over specific thresholds. The respective percentages (%) of instances, where these nodes are inundated within the original database, is also shown in this table. Table 3 and Table 4 then present similar statistics for the different imputation approaches, with Table 3 presenting results for the individual imputation variants, and Table 4 presenting results for the two-stage combination variants. Specifically, these tables present the proportion of the once-dry nodes that belong to the problematic node set

A_{p n}

, as well as the nodes that belong to groups

C^{2}

and

C^{3}

. Recall that group

C^{2}

corresponds to the modified problematic node set

{\bar{A}}_{p n}

and

C^{3}

to its complement. Similarly to Table 3, the respective percentages (%) of instances that these nodes are inundated within the original database is also shown.

Results show that the problematic nodes (group

A_{p n}

or

C^{2}

) correspond to predominantly dry nodes in the original database, a trend agreeing with the one illustrated in [22]. Comparing across the imputation techniques relying on spatial interpolation (kNN variants and GPI), the kNN with response-based connectivity between nodes (kNN_rc) shows some small degradation of performance over kNN, with percentages of nodes belonging to groups

A_{p n}

or

C^{2}

increasing. GPI demonstrates an additional, larger performance degradation over the kNN variants. It should be pointed out that these preliminary results should not be interpreted as preference for kNN over kNN_rc and GPI; the performance of the imputation strategies needs to be examined with respect to the accuracy of the supported surrogate model and not with respect to small increases in the number of problematic nodes. On the other hand, the data completion strategy, PPCA, leads to a significant reduction in imputation quality, with almost all imputed nodes falling into the problematic category according to the results in Table 3. Though not definitive, this pattern provides a strong indication that this imputation approach will not be beneficial for peak-surge imputation.

Results for the two-stage combination variants in Table 4 show a significant improvement compared to PPCA, and moderate improvement for the kNN and GPI variants with respect to both

A_{p n}

and

C^{2}

. For PPCA-based imputation, comparisons across Table 3 and Table 4 verify the expectations presented in Section 6.3. Enriching the information for predominantly dry nodes through initial spatial interpolation is highly beneficial for the data completion imputation strategies. The two-stage combination also illustrates improvements over the pure spatial interpolation strategies, especially GPI. To further examine the improvement achieved through the combination, the histograms of the misclassification of the imputation for the once-dry nodes, with (subplots in the left column) or without the data completion (subplots in the right column), are presented in Figure 5 and Figure 6, respectively, for D_f (full domain) and D_s (subdomain). In Figure 6, the results for kNN and kNN-PPCA are not presented as they are almost identical to those for kNN_rc and kNN_rc-PPCA. The results in the figures clearly demonstrate that the second stage of implementing PPCA-based data completion helps to reduce the portion of nodes with especially high misclassification (close to or over 80%), which is consistent with the observations from Table 3 and Table 4. As discussed in the previous paragraph, the important question is whether these benefits translate to improved predictive accuracy for the surrogate model developed based on the imputed datasets. It is important to note that the trends across all tables are very comparable between the domains D_f and D_s, illustrating a degree of similarity between them.

7.3. Surrogate Model Performance for the Entire Domain

The performance of the imputation strategies is evaluated next with respect to the accuracy of the established surrogate models. In this section, results for the entire database D_f are discussed, meaning that only kNN variants are examined for imputation using spatial interpolation. Results for the subdomain D_s will be examined in the next section. The validation metrics correspond to the misclassification and surge score discussed in Section 4.3. Results are presented for different node groupings: the entire domain, the once-dry nodes, nodes set for different surge gaps, and groups

C^{2}

and

C^{3}

. It should be pointed out that groups

C^{2}

and

C^{3}

change across the variants, as illustrated earlier in Table 3 and Table 4, meaning that comparisons of their statistics across the variants do not yield any meaningful trends. Results for these groups are only presented for examining trends within each variant. Table 5 shows results for the surge score

\bar{S C}

, Table 6 results for the total misclassification

\bar{M C}

, Table 7 for the false positive misclassification

{}^{+}{\bar{M C}}

, and Table 8 for the false negative

{}^{-}{\bar{M C}}

misclassification. As discussed in Section 7.1, results are presented for the combination of S_s and S_c surrogate models for the pseudo-surge database and for the use of only the S_s surrogate model for the corrected pseudo-surge database. To examine in detail the imputation quality, corresponding results when using the S_s surrogate model for the pseudo-surge database are presented in Table A1, Table A2, Table A3 and Table A4 in Appendix G. The presentation mirrors the ones in Table 5, Table 6, Table 7 and Table 8. It should be pointed out that this formulation is not a recommended implementation for storm surge surrogate modeling [22]; the two formulations whose results are presented in Table 5, Table 6, Table 7 and Table 8 are the promoted alternatives.

Before discussing the results, it is important to stress that to evaluate and ultimately compare the performance of the imputation strategies, all the tables need to be comprehensively examined altogether, with primary focus on the surge score (Table 5) which directly represents the predictive accuracy of the resultant surrogate models, and secondary focus on the total misclassification (Table 6) which quantifies the quality of the inundation state predictions. The false positive and false negative misclassification results can be used to interpret the trends in the performance and should be linked to the surge score and total misclassification, and to the characteristics of each imputation strategy, as discussed in Section 5 or Section 6. For example, a low false negative rate that is not combined with a low total misclassification rate does not show superiority, rather that there is an over prediction bias. Such a bias will be naturally reflected with a higher surge score and therefore will be identified when results across all tables are comprehensively evaluated.

Performance across the different databases or combination of regression and classification options verify the trends observed in [22]. The combination of S_s and S_c provides the best overall performance with results relying on the use of S_s only consistently yielding lower accuracy (larger surge score values in Table 5 and Table A1 and larger misclassification in Table 6 and Table A2), predominantly originating from false positive predictions as evident from Table 7and Table A3. The performance deteriorates significantly when only S_s is implemented utilizing the pseudo-surge database without any corrections (Table A1, Table A2, Table A3 and Table A4). As discussed in Section 4, these trends, especially the propensity for false positives, are expected and originate from the wrong information introduced into the database during the imputation process supporting the development of the regression surrogate model. Note that this propensity still exists when the corrected pseudo-surge database is utilized (Table 5, Table 6, Table 7 and Table 8). The propensity, in this case, can be attributed [22] to the inability of the correction (setting surge value to be always smaller than the node elevation) to capture the appropriate underlying variation characteristics for the pseudo-surge across the storms with different characteristics in the original database. However, the predictive accuracy of all considered surrogate model implementations deteriorates consistently for the problematic nodes corresponding to large surge gaps, demonstrating the challenges associated with this type of nodes.

Moving now to the comparison across the different imputation variants, focusing first on the different kNN strategies, both exhibit similarly good performance in general (Table 5, Table 6, Table 7 and Table 8). However, the use of the response-based connectivity between nodes, i.e., the kNN_rc case, results in the best performance (lower surge score and best misclassification results) for the surrogate model implementation corresponding to the combination of S_s and S_c, especially for the problematic nodes with larger surge gaps (Table 5). These are important results, as they demonstrate that incorporating hydraulic connectivity based on surge response similarity into the imputed database can improve the predictive performance of the surrogate model for more challenging regions without compromising its accuracy for other regions, which was the motivation for introducing the new concept of hydraulic connectivity. When predictions rely strictly on the use of the regression surrogate model (S_s), either utilizing the pseudo-surge database or the corrected pseudo-surge database, kNN demonstrates superior performance compared to kNN_rc, primarily originating from better false positive misclassification accuracy (Table 7 and Table A3). This pattern agrees with the trends reported earlier in Table 3 related to kNN yielding smaller percentage of problematic nodes. In this case, however, these trends suggest that while S_s developed from the kNN-imputed database tends to underestimate surge levels for nodes with large surge gaps, the higher false positive rate of S_s developed from the kNN_rc-imputed database can actually improve final prediction accuracy for these nodes when combined with S_c. This stresses, again, an advantage for introducing response-based connectivity information in the imputation process, as the approach can ultimately promote more accurate predictions for the storm surge.

Imputation based entirely on data completion (PPCA) shows superior performance compared to the other imputation strategies when considering the implementation that uses S_s only, based on the pseudo-surge database (Table A1, Table A2, Table A3 and Table A4). Its superiority originates from better false positive misclassification for the highly problematic nodes with larger surge gaps. This should be attributed to the ability of the promoted data completion strategy to better fit the underlying trends for the original database. Still, this performance is substantially worse compared to strategies relying on the combination of S_s and S_c surrogates or the use of the corrected pseudo-surge database (Table 5, Table 6, Table 7 and Table 8), and so it should not be interpreted as superiority of the PPCA imputation strategy. In fact, in Table 5, Table 6, Table 7 and Table 8, PPCA slightly underperforms the spatial interpolation variants, with vulnerabilities originating from the false positive misclassification for the corrected pseudo-surge database strategy (use of S_s only) and from the false negative misclassification for the pseudo-surge database strategy (use of combination of S_s and S_c). This change in the patterns should be attributed to the larger portion of nodes identified as problematic for the PPCA implementation (Table 2). Overall, the discussions indicate a small preference for the spatial imputation strategies over the pure data completion ones.

Finally, examining the two-stage imputation strategies, kNN-PPCA and kNN_rc-PPCA, they demonstrate the best overall performance and the most robust trends across the different surrogate modeling strategies. This is not surprising, as they combine the two imputation approaches to fully leverage their complementary strengths and include an intermediate check that ultimately reduces the number of problematic nodes, as evident in the results in Table 4 (compare values to those in Table 3). This intermediate check and the enrichment of the information for PPCA contributes to a remarkable improvement, compared to the alternative imputation variants, when considering implementation using the S_s only, based on the pseudo-surge database (Table A1, Table A2, Table A3 and Table A4). More importantly, it yields some improvements, though admittedly smaller compared to the other imputation variants, for the promoted surrogate modeling alternatives presented in Table 5, Table 6, Table 7 and Table 8. The improved performance seems to originate from better misclassification rates. Overall the discussion shows that for peak storm surge predictions, the combination of spatial interpolation (stage 1) and data completion (stage 2) imputation strategies supports higher accuracy surrogate modeling predictions, a trend that agrees with the results reported in [26] for time-series predictions.

7.4. Surrogate Model Performance for the Subdomain

This section evaluates the performance of the imputation strategies with respect to the accuracy of the established surrogate models for the subdomain D_s. Results are presented for the two promoted surrogate model alternative implementations in Table 9, Table 10, Table 11 and Table 12, mirroring the presentation format of Table 5, Table 6, Table 7 and Table 8. Additionally, results are presented in Table A5, Table A6, Table A7 and Table A8 for the implementation using only the S_s surrogate model with the pseudo-surge database, mirroring the presentation format of Table A1, Table A2, Table A3 and Table A4. In all instances, results for the GPI imputation strategy are also reported in this section.

With respect to the remaining imputation strategies, excluding the GPI implementation, the trends are similar to the ones presented in Section 7.3, with the primary difference being a relatively small reduction in the performance of the imputation strategies utilizing data completion, especially PPCA, but also kNN-PPCA and kNN_rc-PPCA. This should be attributed to the 6-fold reduction in the database size (from D_f to D_s). This negatively influences the quality of imputation by data completion, and subsequently the performance of the established regression surrogate model. The consistency of the remaining trends, and the justified small reduction in the performance of the approaches relying on data completion imputation, provide a strong validation of the results and arguments presented in Section 7.3, demonstrating that they are not database dependent.

Focusing now on the performance of GPI imputation, it offers comparable results to the other imputation approaches relying on spatial interpolation, kNN and kNN_rc, demonstrating that it is a viable alternative. This holds true both on its own (GPI) and when combined with data completion within the two-stage imputation strategy (GPI-PPCA). The trends vary across the different surrogate model strategies, i.e., whether the combination of S_s and S_c is used or whether predictions are established using solely S_s based on the pseudo-surge database or the corrected pseudo-surge database. The best overall performance is established when utilizing the combination of S_s and S_c, as expected, and performing imputation based on GPI or GPI-PPCA. For the highly problematic nodes, corresponding to larger surge gaps, kNN_rc still remains the preferred choice.

Overall, these comparisons, coupled with the discussions in Section 7.3, demonstrate that spatial interpolation with global characteristics (i.e., GPI) tends to better capture the global trends for the surge behavior compared to spatial interpolation based on localized characteristics (kNN variants). Still, the latter have the ability to better capture the discontinuous behavior around problematic nodes corresponding to large surge gaps, especially when connectivity information based on flooding patterns is leveraged within the imputation process, as evident by the superiority of kNN_rc over kNN. Combining spatial interpolation with data completion imputation strategies offers additional robustness in the predictions and has the potential to further improve accuracy, depending on the quality of the information (size of database) available for the data completion approaches.

7.5. Trends for Distribution of Errors

This section examines the distribution of errors (surge score and misclassification) for the best performing variants that were identified in Section 7.3 and Section 7.4. These correspond to kNN_rc and GPI as well as their counterparts when coupled with PPCA, kNN_rc-PPCA, and GPI-PPCA (all variants corresponding to those using pseudo-surge database with surrogate model combination of S_s and S_c). Figure 7 and Figure 8 present results for the spatial distribution of the surge score across the nodes, with Figure 7 offering results for the entire domain D_f and Figure 8 for the subdomain D_s. Figure 9 and Figure 10 present results for the histogram of misclassification for the same implementation variants illustrated in Figure 7 and Figure 8, corresponding to the entire domain D_f and Figure 8 for the subdomain D_s, respectively.

Focusing first on the spatial distributions of the surge score (Figure 7 and Figure 8), the surge scores of nodes that are closer to those with large surge gaps are generally larger (see the locations of once-dry nodes and their surge gaps in Figure 4), whereas predominantly dry nodes (those with lower wetness in Figure 3) have small surge score values regardless of their proximity to the nodes with large surge gaps. The former trend is expected as nodes with larger surge gaps pose greater challenges in achieving accurate surge predictions, both for themselves as well as for nearby nodes. The better accuracy for the predominantly dry nodes can be attributed to the incorporation of the classification surrogate model S_c, which enhances the predictive accuracy of node conditions significantly. When comparing surge score distributions between variants, the spatial distribution trends remain largely the same, with very small differences across the different variants. This is true even when focusing on the comparisons to the variants with (subplots in the right column) and without data-completion (subplots in the left column); the two-stage imputation strategy contributes to a slight general improvement in the accuracy, consistent with observations reported from Table 5 and Table 9, though the distributions do not significantly change. The patterns are the same for both the entire domain D_f and the subdomain D_s.

Examining next the histograms of the misclassification (Figure 9 and Figure 10), small misclassification is reported for most of the domain, with nodes that exhibit higher than 5% misclassification corresponding to only a small portion of the database: less than 15% for the entire domain D_f and less than 10% for the subdomain D_s. This result shows a remarkable robustness with respect to appropriately classifying the correct inundation state (dry or wet) for the vast majority of the nodes within the domain of interest. Again, small differences are exhibited across the variants, verifying the trends exhibited in Figure 7 and Figure 8 for similarity of the predictive performance across the nodes within the broader domain. This shows that the different variant implementations promoted here share the same robustness when evaluated with respect to the worst-case performance for individual nodes within the domain, either with respect to surge score (Figure 7 and Figure 8) or misclassification (Figure 9 and Figure 10) of the node inundation state.

8. Conclusions

The development of surrogate models or metamodels for the prediction of peak storm surge requires imputation of the original simulation data for inland nodes that have remained dry in some of the synthetic storm simulations to estimate the so-called pseudo-surge. This paper examined three distinct advancements for the imputation of peak storm surge databases. In the first advancement, the use of response-based correlation was examined for infusing hydraulic connectivity information within a kNN based spatial interpolation. The response-based correlation between nodes was defined using the binary response information representing the condition of each node (i.e., inundated or not), with only nodes whose correlation with the target node is higher than a specific threshold considered as candidate neighbors in the kNN implementation. In the second advancement, a Gaussian Process interpolation (GPI) was examined as a spatial imputation strategy, relying on a recently established adaptive covariance tapering scheme for accommodating an efficient GPI implementation for large datasets. In the third advancement, imputation using data completion, specifically utilizing PPCA, was examined as an alternative to spatial interpolation techniques. Additionally, the combination of spatial imputation with PPCA was also considered, with the objective to enrich the information before performing PPCA. The advantages of the advancements are examined with respect to the accuracy of the surrogate model that is subsequently calibrated using the imputed databases.

An illustrative case study regarding surrogate modeling predictions for Louisiana, focusing on New Orleans, was considered utilizing (and imputing) the Coastal Hazards System-Louisiana (CHS-LA) coastal storm hazards study database. A comprehensive k-fold cross validation implementation was established, and the key outcomes of the investigation are as follows:

As reported in previous studies, the best overall performance can be established by combining a regression and a classification surrogate models, with the regression model developed utilizing directly the imputed database without any corrections.
Spatial interpolation using response-based connectivity information can better capture discontinuous patterns in surge responses, demonstrating superiority in supporting surrogate model development with improved predictive accuracy at problematic nodes. Note that such nodes can be identified by the larger surge gaps in the original data.
Imputation utilizing GPI is a viable alternative (compared to kNN) imputation strategy based on spatial interpolation, and can better capture the global surge trends, offering superior overall performance. However, it cannot capture discontinuous patterns in surge responses, and extending GPI using response-based connectivity information within the covariance kernel is not straightforward. Nevertheless, this is a topic that should be investigated in the future.
Data completion imputation for the peak-surge demonstrates very poor performance, originating from the fact that for many (predominantly dry) nodes, very limited information is available in the original database to establish the underlying patterns to fill in the missing data. An enrichment stage of the original data, using spatial interpolation in the first stage, is needed for data completion to offer any advantages.
The two-stage imputation strategy combining spatial interpolation (stage 1) with data completion (stage 2) for the misclassified instances from the first stage offers additional robustness in the predictions and has the potential to further improve accuracy compared to the approach implemented in stage 1. The degree of improvement depends on the quality of information available for the second stage, which is a function of both the size of the database as well as the degree of enrichment established in the first stage for the originally predominantly dry nodes.

It is important to recognize that a limitation of the present study is that the quantification of performance improvement focused solely on the predictive accuracy of the established surrogate models (calibrated using the imputed databases). However, as coastal hazard is ultimately represented and communicated through risk descriptions derived from these predictions, such as storm surge thresholds with prescribed annual exceedance probabilities, future work should examine the impact of the different imputation strategies on the accuracy of such downstream hazard estimates.

Author Contributions

Conceptualization, A.A.T., W.J., C.I., and N.C.N.-C.; methodology, W.J., C.I., and A.A.T.; software, W.J. and C.I.; validation, W.J., C.I., and A.A.T.; data curation, N.C.N.-C., L.A.A., and M.C.Y.; writing—original draft preparation, A.A.T., W.J., and C.I.; writing—review and editing, A.A.T., W.J., C.I., N.C.N.-C., L.A.A., and M.C.Y.; funding acquisition, A.A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been done under contract with the U.S. Army Corps of Engineers (USACE), Engineer Research and Development Center, Coastal and Hydraulics Laboratory (ERDC-CHL), under grant number W912HZ-22-C-0041.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Database of synthetic storms used in this study is part of the U.S. Army Corps of Engineers (USACE) Coastal Hazards System (CHS) program (https://chs.erdc.dren.mil, accessed 27 August 2025).

Acknowledgments

The support of the USACE Civil Works R&D and the Coastal Hazards System (CHS) program is gratefully acknowledged.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

The following nomenclature and abbreviations are used throughout this manuscript:

Terminologies
always-wet nodes	Nodes that have been inundated across all storms in the database
once-dry nodes	Nodes that have been dry at least once (i.e., with missing data) in the database
problematic nodes	Nodes whose imputed pseudo-surge values were misclassified (above node elevations) at least once
pseudo-surge database	Database in which all missing data are filled with pseudo-surge values through imputation
corrected pseudo-surge database	Database in which all missing data are filled with pseudo-surge values through imputation, with any misclassified pseudo-surge values corrected to be below node elevations
Notations
$A_{p n}$	Group of problematic nodes
${\bar{A}}_{p n}$	Modified group $A_{p n}$ containing only problematic nodes with high misclassification rates and large discrepancies from node elevations
$C^{1}$	Class of always-wet nodes
$C^{2}$	Class of once-dry nodes whose pseudo-surge values exhibit high misclassification rates and large discrepancies from node elevations (corresponding to ${\bar{A}}_{p n}$ )
$C^{3}$	Class of once-dry nodes whose pseudo-surge values exhibit low misclassification rates and small discrepancies from node elevations
S_s	Regression surrogate model used to predict surge levels
S_c	Classification surrogate model used to predict node condition (dry or wet)
D_f	Full domain containing 1,179,179 nodes
D_s	Subdomain of the full domain (D_f) containing 200,200 nodes
Validation metrics
$S C_{i}$	Averaged surge score for the ith node across all storms in the database
$M C_{i}$	Misclassification rate for the ith node across all storms in the database
${}^{+}{M C}_{i}$	False positive rate for the ith node across all storms in the database
${}^{-}{M C}_{i}$	False negative rate for the ith node across all storms in the database
$\bar{V}$	Averaged validation metric $V_{i}$ across all nodes ( $V_{i}$ = $S C_{i}$ , $M C_{i}$ , ${}^{+}{M C}_{i}$ , or ${}^{-}{M C}_{i}$ )
Examined imputation strategies
kNN	Spatial interpolation-based imputation using weighted k-nearest neighbor (kNN) interpolation, where the connectivity of original numerical model grids is considered as hydraulic connectivity
kNN_rc	Spatial interpolation-based imputation using weighted kNN interpolation, where response similarity between nodes is considered as hydraulic connectivity, as proposed in this study
GPI	Spatial interpolation-based imputation using Gaussian Process interpolation (GPI) with adaptive covariance tapering scheme
PPCA	Data completion-based imputation using probabilistic principal component Analysis (PPCA)

Appendix A. Review of GP-Based Regression Ss Metamodel

This appendix reviews the GP-based S_s metamodel that is utilized in the case study. This metamodel is formulated [12] using as input matrix X and as output matrix the imputed Z, and is established through the following steps:

Step 1: Dimensionality reduction. To address the high dimensionality of the output z, principal component analysis (PCA) [59] is used as a dimensionality reduction technique [12]. PCA is performed on Z and identifies, through a linear projection matrix $P_{s}$ , a small number of $m_{s}$ latent outputs (principal components) ${{\underline{z}}_{t}; t = 1, \dots, m_{s}}$ that best explain the variance of the original observation matrix Z, with $m_{s} < n < < n_{z}$ . For each component, it also provides the vector ${\underline{Z}}_{t} \in ℝ^{n}$ of observations whose hth row corresponds to the latent output for the hth storm, and its relative importance ω_t corresponding to the portion of the variance of the original observation matrix Z that can be explained by the respective component. The latter information is leveraged to select the number of principal components to retain [59]. The PCA transformation also utilizes the mean (across the storm database) of the observations for each $z_{i}$ , denoted herein as μ_i. This simply corresponds to the mean of each column of Z.
Step 2: Principal component GP calibration. For each of the $m_{s}$ principal components, a separate GP is developed using input–output observation pair X- ${\underline{Z}}_{t}$ . This is accomplished through the formulation reviewed in Appendix C for definitions $u = x$ , $v = {\underline{z}}_{t}$ , $U = X$ , and $V = {\underline{Z}}_{t}$ with $n_{G P} = n$ and $n_{u} = n_{x}$ . The GP provides the predictive mean ${\tilde{\underline{z}}}_{t} (x)$ and variance ${(σ_{t}^{\underline{z}} (x))}^{2}$ for each latent component given by Equations (A6) and (A7), respectively. Note that the dimensionality reduction established in Step 2 accommodates a computationally efficient GP calibration separately for each principal component, since $m_{s}$ is typically small (relative to both n and, especially, n_z).
Step 3: Storm surge predictions. Combining the predictions for all principal components, the metamodel approximation for the storm surge is obtained using the linear PCA transformation. Adopting notation ${[.]}_{i t}$ to denote the ${i, t}$ element of a matrix (corresponding to the ith row and tth column), the mean and variance predictions for the ith node are as follows [12]:

{\tilde{z}}_{i} (x) = μ_{i} + \sum_{t = 1}^{m_{s}} {[P_{s}]}_{i t} {\tilde{\underline{z}}}_{t} (x),

(A1)

{(σ_{i} (x))}^{2} = \sum_{t = 1}^{m_{s}} {[P_{s}]}_{i t}^{2} {(σ_{t}^{\underline{z}} (x))}^{2} .

(A2)

Utilizing the Gaussian probabilistic nature of the GP estimates under the PCA linear transformation [12], the inundation probability

p_{i}^{s} (x)

according to S_s is finally given by the following [22]:

p_{i}^{s} (x) = Φ [\frac{{\tilde{z}}_{i} (x) - e_{i}}{σ_{i} (x)}]

(A3)

where

Φ [.]

denotes the standard Gaussian cumulative distribution function.

Appendix B. Review of GP-Based Classification Sc Metamodel

This appendix reviews the GP-based S_c metamodel that is utilized in the case study. This metamodel is formulated [23] using as input matrix X and as output matrix Y, and is established through the following steps:

Step 1: Dimensionality reduction and projection to continuous output. To address the high dimensionality of the output y and accommodate a projection to a continuous space, logistic principal component analysis (LPCA) [60] is used as a dimensionality reduction technique [23]. LPCA is performed by maximizing the likelihood of observations Y given the compact representation for the natural parameter (log-odds) vector θ of the underlying logistic function describing y [60]. It identifies a small number of $m_{c}$ latent outputs ${{\underline{θ}}_{t}; t = 1, \dots, m_{c}}$ along with the projection matrix $P_{c}$ , the bias $Δ_{t}$ for each component, and the latent observation matrix ${\underline{Θ}}_{t} \in ℝ^{n}$ whose hth row corresponds to the latent output for the hth storm. The value of $m_{c}$ can be selected based on parametric sensitivity analysis to avoid overfitting [23]. The latent outputs describing the natural parameters correspond to a continuous variable, even though the original data corresponded to a binary variable. This facilitates the use of a regression surrogate model for predicting ${\underline{θ}}_{t}$ in Step 2.
Step 2: Principal component GP calibration. For each of the $m_{c}$ principal components, a separate GP is developed using input–output observation pair X- ${\underline{Θ}}_{t}$ . This is accomplished through the formulation reviewed in Appendix C for definitions $u = x$ , $v = {\underline{θ}}_{t}$ , $U = X$ , and $V = {\underline{Θ}}_{t}$ , with $n_{G P} = n$ and $n_{u} = n_{x}$ . The GP provides the predictive mean ${\tilde{\underline{θ}}}_{t} (x)$ given by Equation (A6). Note that the dimensionality reduction established in Step 2 accommodates a computationally efficient GP calibration separately for each principal component, since $m_{c}$ is typically very small (relative to both n and, especially, n_z).
Step 3: Classification predictions. Combining the predictions for all principal components, the metamodel approximation for the inundation state for the ith node is first established by obtaining the natural parameter approximation for that node through the linear transformation:

{\tilde{θ}}_{i} (x) = Δ_{i} + \sum_{t = 1}^{m_{ψ}} {[P_{c}]}_{i t} {\tilde{\underline{θ}}}_{t} (x),

(A4)

and then utilizing the logistic function:

p_{i}^{c} (x) = \frac{1}{1 + e^{- {\tilde{θ}}_{i} (x)}} .

(A5)

Appendix C. Review of GP Formulation

This appendix reviews the GP implementation for a generic input

u \in ℝ^{n_{u}}

and scalar output v. For appropriate selection of input and output, this formulation accommodates the different GP variants discussed in this paper. For the metamodel development we assume that the training set with

n_{G P}

data is available, with superscript o utilized to distinguish the input

u^{o}

and output

v^{o}

for the oth data point, and corresponding input matrix

U \in ℝ^{n_{G P} \times n_{u}}

[each observation represents a different row] and output matrix (vector in this instance)

V \in ℝ^{n_{G P}}

.

The fundamental building blocks for the GP are the n_b-dimensional basis vector f(u) and the correlation function between inputs

R (u, u^{'} | γ)

, with

γ

denoting the hyper-parameter vector that needs to be calibrated. Selection of these functions are discussed in the main manuscript. Let

F = {[f (u^{1}) \dots f (u^{n_{G P}})]}^{T} \in ℝ^{n_{G P} \times n_{b}}

denote the basis matrix over database U,

r (u) = [R (u, u^{1} | γ) \dots R (u, u^{n_{G P}} | γ)]^{T} \in ℝ^{n_{G P}}

the correlation vector between u and each of the elements of U, and

R \in ℝ^{n_{G P} \times n_{G P}}

the correlation matrix over the database U with the ojth element defined as

R (x^{o}, x^{j} | γ); o, j = 1, \dots, n_{G P}

. To improve the surrogate model’s numerical stability or even its accuracy when fitting noisy data [40], a nugget is included in the formulation of the correlation function

\underline{R} = R + δ I

, with δ denoting the nugget value and

I

the identity matrix. The GP predictive mean is given by the following [40,61]:

\tilde{v} (x) = f {(x)}^{T} β^{*} + r {(x)}^{T} {\underline{R}}^{- 1} (V - F β^{*})

(A6)

where

β^{*} = {(F^{T} {\underline{R}}_{k}^{- 1} F)}^{- 1} F^{T} {\underline{R}}^{- 1} V

is the weighted least squares regression solution. The probabilistic characteristics of the GP are quantified through the GP predictive variance given by the following [40,61]:

σ_{v} {(x)}^{2} = {\tilde{σ}}_{v}^{2} [1 + γ {(x)}^{T} {F^{T} {\underline{R}}^{- 1} F}^{- 1} γ_{k} (x) - r {(x)}^{T} {\underline{R}}^{- 1} r (x)]

(A7)

where

γ (x) = F^{T} {\underline{R}}^{- 1} r (x) - f (x)

and the process variance is as follows:

{\tilde{σ}}_{v}^{2} = \frac{{(V - F β^{*})}^{T} {\underline{R}}^{- 1} (V - F β^{*})}{n_{G P}} .

(A8)

The calibration of the GP pertains to the selection of the hyper-parameters

[γ δ]

, and can be performed using maximum likelihood estimation (MLE), leading to the following [40,61]:

{[\begin{matrix} γ & δ \end{matrix}]}^{*} = \underset{[\begin{matrix} γ & δ \end{matrix}]}{\arg \min} (\ln (\det (\underline{R})) + n_{G P} \ln {\tilde{σ}}_{v}^{2})

(A9)

where det(.) stands for the determinant of a matrix. The optimization of Equation (A9) is well known to have multiple local minima [62] and non-smooth characteristics for small values of δ [63]. To address these challenges, all numerical optimizations that are performed in this study use a pattern-search optimization algorithm [64].

Appendix D. Weighted k-Nearest Neighbor (kNN) Calibration

The appendix briefly reviews the kNN interpolation calibration [23]. Let

A_{w}^{t}

, with

n_{w}^{t}

nodes, denote a subset of the always-wet nodes set that the calibration is based upon.

A_{w}^{t}

may be chosen identical to the set of always-wet nodes, denoted

A_{w}^{f}

herein, though it is recommended to only include nodes corresponding to smaller depths so that the calibration is based on predictions for nearshore nodes only. The kNN surge estimate for each node in

A_{w}^{t}

and for each storm are obtained using Equation (12) considering its respective neighbors. Denoting as

{\underset{˜}{z}}_{i}^{h}

the predictions for the ith node and the hth storm, the calibration is finally expressed through the optimization of the hyper-parameters [23]:

\begin{array}{l} {[k, d, ψ_{l}, ψ_{e}]}^{*} = \arg \min (\sum_{h = 1}^{n} \sum_{i \in A_{w}^{t}} |z_{i}^{h} - {\underset{˜}{z}}_{i}^{h}|) \\ k \in ℕ, 1 \leq k \leq k^{\max} \\ 0 < d_{t} \leq d_{t}^{\max}, 0 < ψ_{l} \leq ψ_{l}^{\max}, ψ_{e}^{\min} \leq ψ_{e} \leq ψ_{e}^{\max} \end{array}

(A10)

with appropriate box-bounded constraints for minimum (superscript min) and maximum (superscript max) value of each of the hyper-parameters (note that k_c from Section 5.1 has been replaced by k^max here). Numerical details for efficiently performing this calibration are discussed in [23].

Appendix E. Adaptive Taper Selection Using Inducing Points

This appendix reviews the M-IAT algorithm [30] for the estimation of taper ranges per node

{φ_{i}; i = 1, \dots, n_{z}}

to achieve the desired local sparsity

{ρ_{i}; i = 1, \dots, n_{z}}

within adaptive covariance tapering for taper function selection given by Equation (17). For achieving computational efficiency, a set of inducing points is leveraged to guide the selection, whereas an iterative optimization is adopted to select the taper values.

To formalize implementation, define distance matrix

D \in ℝ^{n_{z} \times n_{z}}

with ij element

{[D]}_{i j} = d_{i j}

and taper range matrix

C \in ℝ^{n_{z} \times n_{z}}

with elements

{[C]}_{i j} = (φ_{i} + φ_{j}) / 2

. Herein, notation

{[A]}_{i j}

indicates the element located in the ith row and jth column of matrix A, with an asterisk for either i or j indicating the selection of all columns or rows, respectively. Also, let

n_{t}^{i} = ρ_{i} n_{z}

define the target local sparsity for the ith node based on number of neighbors. Let

i_{d} \in ℝ^{n_{d}}

denote the set of utilized inducing points, where

n_{d}

is the number of inducing points in the set. Define the matrices

\bar{D} = {[D]}_{i_{d} *} \in ℝ^{n_{d} \times n_{z}}

and

\bar{C} = {[C]}_{i_{d} *} \in ℝ^{n_{d} \times n_{z}}

corresponding to the subsets of D and C, respectively, for rows pertaining to the inducing points only. The full vector of taper ranges and the corresponding subset for the inducing points are denoted as

φ \in ℝ^{n_{z}}

and

\bar{φ} = {[φ]}_{i_{d}} \in ℝ^{n_{d}}

, respectively. Since the M-IAT algorithm adjusts only

\bar{φ}

, a projection mapping

φ = g (\bar{φ})

is established (for example, using scatter interpolation based on Delaunay triangulation) to estimate the remaining taper ranges using

\bar{φ}

.

To accommodate the iterative implementation, superscript ^(k) is used in this appendix to distinguish the quantities pertaining to the kth iteration of the M-IAT algorithm. The iterative formulation is established as follows. First, set the variables defining convergence: the number of allowable iterations

k_{m a x}

, the tolerance threshold

n_{t o l}

for the maximum discrepancy, the tolerance

N_{t o l}

for the number of nodes with considerable discrepancy, and the value

n_{d s}

defining the considerable discrepancy. Next, initialize the vector of taper ranges

{\bar{φ}}^{(0)} = {{\bar{φ}}_{i}^{(0)}; i = 1, \dots, n_{d}}

, select mapping

φ = g (\bar{φ})

, and estimate

φ^{(0)} = g ({\bar{φ}}^{(0)})

. After the initialization, calculate

{\bar{C}}^{(0)}

and the initial number of non-zero elements for each inducing point

{n_{n z}^{i (0)}; i \in i_{d}}

by identifying the entries satisfying

{[\bar{D}]}_{i *} - {[{\bar{C}}^{(0)}]}_{i *} < 0

. Note that updating

φ = g (\bar{φ})

has a non-negligible computational burden for large

n_{z}

values and for this reason it is not repeated for every iteration of M-IAT, rather it is performed every

n_{s c}

iterations. At the kth iteration of the M-IAT algorithm, perform the following steps:

Step 1: Check total number of neighbors. Evaluate the appropriate correction to obtain the most benefits for the overall number of non-zero elements. For the inducing points, perform the following operations: [23]:

$\begin{array}{l} If \sum_{i \in i_{d}} n_{n z}^{i (k - 1)} < \sum_{i \in i_{d}} n_{t}^{i} proceed to Step 2 a \\ If \sum_{i \in i_{d}} n_{n z}^{i (k - 1)} > \sum_{i \in i_{d}} n_{t}^{i} proceed to Step 2 b \\ If \sum_{i \in i_{d}} n_{n z}^{i (k - 1)} = \sum_{i \in i_{d}} n_{t}^{i} then \{\begin{matrix} if \max_{i \in i_{d}} (| n_{n z}^{i (k - 1)} - n_{t}^{i} |) = \max_{i \in i_{d}} (n_{n z}^{i (k - 1)} - n_{t}^{i}) proceed to Step 2 a \\ else proceed to Step 2 b . \end{matrix} \end{array}$

(A11)
Step 2: Adjust most influential taper. Depending on the outcome of Equation (A11).

Step 2a: Adjust most influential taper if total number of non-zeros are too many. Select the inducing point that corresponds to the maximum of

{n_{n z}^{i (k - 1)} - n_{t}^{i}; i \in i_{d}}

. If multiple inducing points satisfy this criterion, select one of them at random. Let m denote the chosen inducing point.

Step 2b: Adjust most influential taper if total number of non-zeros are too few. Select the inducing point that corresponds to the minimum of

{n_{n z}^{i (k - 1)} - n_{t}^{i}; i \in i_{d}}

. If multiple inducing points satisfy this criterion, select one of them at random. Let m denote the chosen inducing point.

Step 3: Taper range adjustment. Compute the taper range adjustment for the mth node, so that $n_{n z}^{m (k - 1)} = n_{t}^{m}$ . This is achieved by ordering the elements ${[\bar{D}]}_{m *} - {[{\bar{C}}^{(k - 1)}]}_{m *}$ , then randomly choosing a number, denoted φ_adj, between the $n_{t}^{m}$ and $n_{t}^{m} + 1$ ordered values. Finally, update the mth taper range to be ${\bar{φ}}_{m}^{(k)} = {\bar{φ}}_{m}^{(k - 1)} + 2 {\bar{φ}}_{a d j}$ .
Step 4: Update taper vector and taper thresholds. The new taper range vector φ^(k) is obtained by replacing ${\bar{φ}}_{m}^{(k - 1)}$ with ${\bar{φ}}_{m}^{(k)}$ . If $k - k_{l a s t} = n_{s c}$ , proceed to Step 5 and update $k_{l a s t} \leftarrow k$ . Otherwise, update the taper threshold matrix based on the new ${\bar{φ}}_{m}^{(k)}$ by setting ${[{\bar{C}}^{(k)}]}_{m *} = {[{\bar{C}}^{(k - 1)}]}_{m *} + φ_{a d j}$ (updating the mth row) and ${[{\bar{C}}^{(k)}]}_{* i_{d} (m)} = {[{\bar{C}}^{(k - 1)}]}_{* i_{d} (m)} + φ_{a d j}$ (updating the i_d(m)th column), and proceed to the next iteration by returning to Step 1.
Step 5: Use interpolation to update entire vector. Update the entire taper $φ^{(k)}$ based on projection mapping $φ^{(k)} = g ({\bar{φ}}^{(k)})$ . This provides the taper ranges for the remaining points (i.e., non-inducing points) based on the taper ranges for the inducing points.
Step 6: Global update of taper thresholds and number of non-zeros. Using the new vector $φ^{(k)}$ , calculate ${\bar{C}}^{(k)}$ and the new number of non-zero elements for each inducing point ${n_{n z}^{i (k)}; i \in i_{d}}$ by identifying the entries satisfying ${[\bar{D}]}_{i *} - {[{\bar{C}}^{(k)}]}_{i *} < 0$ .
Step 7: Assess stopping criteria. If $k = k_{m a x}$ , set $φ = φ^{(k)}$ and stop iterating since the allowable computation burden has been exceeded. If $k < k_{m a x}$ , examine the convergence criteria. For the inducing points, examine the following conditions:

$\max_{i \in i_{d}} (| n_{n z}^{i (k)} - n_{t}^{i} |) < n_{t o l} and \sum_{i \in i_{d}} I [| n_{n z}^{i (k)} - n_{t}^{i} | > n_{d s}] < N_{t o l} .$

(A12)

If both conditions are satisfied, the algorithm has convergence and iterations may stop (and set $φ = φ^{(k)}$ ). Otherwise, set $k \leftarrow k + 1$ and proceed to the next iteration by returning to Step 1.

Appendix F. Adaptive Taper Selection Using Inducing Points

This appendix reviews the IIP algorithm [30] for the iterative identification of inducing points. For formalizing implementation, let

S_{d}^{[l]}

denote the set of inducing points at the lth iteration of the IIP algorithm. Superscript ^[l] (in brackets) is used to distinguish the quantities pertaining to the lth iteration of the IIP algorithm. Use a modest number of

n_{d}^{[0]}

nodes to define an initial set

S_{d}^{[0]}

. For example,

S_{d}^{[0]}

may be obtained via k-means clustering. The iterative implementation proceeds as follows. First, select the maximum number of allowable inducing points

n_{d}^{m a x}

, the number of inducing points to be added in each iteration

n_{d}^{a d d}

, the maximum allowable discrepancy n_pr, and the (relative) weight c_n used to prioritize nodes with too few non-zero elements. At the lth iteration of the IIP algorithm, perform the following steps:

Step 1: Perform M-IAT. Perform M-IAT using $S_{d}^{[l]}$ as the current set of inducing points. If $l > 1$ , then the taper range vector φ^[l] used within M-IAT may be initialized based on the taper range values identified at the previous iteration of IIP.
Step 2: Check convergence. Estimate the number of non-zero elements ${n_{n z}^{i [l]}; i = 1, \dots, n_{z}}$ for all nodes in the domain by identifying nodes satisfying ${[\bar{D}]}_{i *} - {[{\bar{C}}^{[l]}]}_{i *}$ . If $\max_{i} (| n_{n z}^{i [l]} - n_{t}^{i} |) < n_{p r}$ or $n_{d}^{[l]} > n_{d}^{m a x}$ is satisfied, convergence has been achieved and $S_{d}^{[l]}$ is the final set of inducing points. Otherwise, proceed to Step 4.
Step 3: Select new inducing points. The remaining $n_{z} - n_{d}^{[l]}$ nodes (i.e., non-inducing points) are separated into two groups: (a) those for which there are sufficiently too many non-zero elements (i.e., $n_{n z}^{i [l]} - n_{t}^{i} > n_{z}^{+ [l]}$ ); and (b) those for which there are sufficiently too few non-zero elements (i.e., $n_{t}^{i} - n_{n z}^{i [l]} > n_{z}^{- [l]}$ ). These groups are denoted $G_{+}^{[l]}$ and $G_{-}^{[l]}$ , respectively, and contain n_G₊ and n_G_– nodes, respectively. Separately, perform clustering within each group ( $G_{+}^{[l]}$ and $G_{-}^{[l]}$ ) and select the number of clusters (within each group) to be proportional to the number of total nodes in the group while incorporating the weight c_n. This yields $n_{+} = r o u n d [n_{G +} / (n_{G +} + c_{n} n_{G -}) \cdot n_{d}^{a d d}]$ clusters for $G_{+}^{[l]}$ and $n_{-} = n_{d}^{a d d} - n_{+}$ clusters for $G_{-}^{[l]}$ . The n_G₊ nodes within $G_{+}^{[l]}$ are clustered into n₊ clusters based on spatial location, and the worst-performing node (i.e., the node with the largest discrepancy $| n_{n z}^{i [l]} - n_{t}^{i} |$ ) within each cluster is selected as the representative point to be added to the existing set of inducing points. Denote these nodes $S_{d +}^{[l]}$ . Similarly, the n_G_– nodes within $G_{-}^{[l]}$ are clustered into n_– clusters based on spatial location, and the worst-performing node within each cluster is selected as the representative point to be added to the existing set of inducing points. Denote these nodes $S_{d -}^{[l]}$ . To obtain the updated set of inducing points, augment the existing set such that $S_{d}^{[l + 1]} = {S_{d}^{[l]}, S_{d +}^{[l]}, S_{d -}^{[l]}}$ . The new set of inducing points contains $n_{d}^{[l + 1]}$ nodes. Finally, set $l \leftarrow l + 1$ and proceed to the next iteration by returning to Step 1.

Appendix G. Performance of the Regression Surrogate Model Developed Based on the Pseudo-Surge Database Without Correction

This appendix presents the predictive performance of the S_s surrogate model, developed using the pseudo-surge database imputed by different imputation variants. Table A1, Table A2, Table A3 and Table A4 provide results for the entire domain (D_f), mimicking Table 5, Table 6, Table 7 and Table 8 in Section 7.3, respectively, while Table A5, Table A6, Table A7 and Table A8 show results for the subdomain (D_s), mimicking Table 9, Table 10, Table 11 and Table 12 in Section 7.4, respectively.

Table A1. Surge score

\bar{S C}

(cm) for different node groups for D_f domain for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

Table A1. Surge score

\bar{S C}

(cm) for different node groups for D_f domain for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

		Pseudo-Surge Database (S_s Only)
		kNN	kNN_rc	PPCA	kNN-PPCA	kNN_rc-PPCA
All nodes		13.39	13.90	9.25	7.91	7.98
Once-dry		19.08	20.28	8.93	6.08	6.22
Surge gap> (m)	0.25	59.78	72.37	5.39	4.11	4.90
	0.5	70.61	88.33	4.68	3.64	4.85
	0.75	87.09	106.73	4.72	3.73	5.30
	1	109.14	122.55	5.26	4.21	5.82
	1.5	139.46	142.92	7.95	6.43	7.67
Node groups	$C^{2}$	38.95	36.06	9.03	7.88	7.73
Node groups	$C^{3}$	5.04	5.16	7.30	4.82	4.88

Table A2. Misclassification

\bar{M C}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

Table A2. Misclassification

\bar{M C}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

		Pseudo-Surge Database (S_s Only)
		kNN	kNN_rc	PPCA	kNN-PPCA	kNN_rc-PPCA
All nodes		7.19	7.68	10.82	3.70	4.17
Once-dry		17.22	18.38	25.99	8.79	9.91
Surge gap> (m)	0.25	46.94	47.91	19.90	11.68	15.86
	0.5	53.76	55.55	18.12	11.34	17.78
	0.75	63.20	65.39	17.94	12.09	20.34
	1	71.31	74.12	18.81	12.99	21.25
	1.5	72.73	78.47	23.42	15.48	21.87
Node groups	$C^{2}$	35.15	32.99	27.33	15.15	16.33
Node groups	$C^{3}$	4.55	4.38	2.97	4.36	4.23

Table A3. False positive misclassification

{}^{*}{\bar{M C}}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

Table A3. False positive misclassification

{}^{*}{\bar{M C}}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

		Pseudo-Surge Database (S_s Only)
		kNN	kNN_rc	PPCA	kNN-PPCA	kNN_rc-PPCA
All nodes		6.46	7.00	10.36	2.96	3.47
Once-dry		15.61	16.90	25.02	7.14	8.39
Surge gap> (m)	0.25	46.79	47.81	19.85	11.46	15.71
	0.5	53.65	55.49	18.06	11.16	17.67
	0.75	63.12	65.34	17.87	11.93	20.23
	1	71.26	74.09	18.73	12.85	21.12
	1.5	72.73	78.47	23.33	15.37	21.74
Node groups	$C^{2}$	34.28	32.18	26.40	14.18	15.41
Node groups	$C^{3}$	2.41	2.26	1.36	2.25	2.16

Table A4. False negative misclassification

{}^{-}{\bar{M C}}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

Table A4. False negative misclassification

{}^{-}{\bar{M C}}

(%) for different node groups from D_f for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

		Pseudo-Surge Database (S_s Only)
		kNN	kNN_rc	PPCA	kNN-PPCA	kNN_rc-PPCA
All nodes		0.73	0.68	0.46	0.74	0.70
Once-dry		1.61	1.48	0.97	1.65	1.53
Surge gap> (m)	0.25	0.15	0.10	0.06	0.22	0.15
	0.5	0.11	0.06	0.06	0.18	0.11
	0.75	0.08	0.05	0.07	0.16	0.12
	1	0.05	0.03	0.08	0.14	0.13
	1.5	0.01	0.01	0.09	0.11	0.13
Node groups	$C^{2}$	0.87	0.81	0.93	0.97	0.91
Node groups	$C^{3}$	2.14	2.13	1.61	2.12	2.07

Table A5. Surge score

\bar{S C}

(cm) for different node groups from D_s for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

Table A5. Surge score

\bar{S C}

(cm) for different node groups from D_s for surrogate models (S_s only) based on pseudo-surge database imputed by different variants.

		Pseudo-Surge Database (S_s Only)
		kNN	kNN_rc	GPI	PPCA	kNN-PPCA	kNN_rc-PPCA	GPI-PPCA
All nodes		26.31	31.17	29.54	11.11	9.96	10.13	9.91
Once-dry		48.82	61.01	57.15	9.78	7.22	7.54	7.43
Surge gap> (m)	0.25	108.97	151.33	129.24	4.73	3.75	5.04	3.92
	0.5	118.19	164.75	140.34	3.96	3.21	4.78	3.37
	0.75	131.17	177.32	153.55	3.62	2.98	4.76	3.11
	1	145.46	178.84	165.69	3.73	3.01	4.60	3.17
	1.5	177.34	190.90	195.68	5.88	4.66	5.42	5.09
Node groups	$C^{2}$	87.01	104.95	84.83	10.28	7.86	8.45	8.22
Node groups	$C^{3}$	6.93	6.87	6.64	6.49	6.60	6.55	6.27

Table A6. Misclassification

\bar{M C}