Prediction of Wave Transmission Characteristics of Low-Crested Structures with Comprehensive Analysis of Machine Learning

The adoption of low-crested and submerged structures (LCS) reduces the wave behind a structure, depending on the changes in the freeboard, and induces stable waves in the offshore. We aimed to estimate the wave transmission coefficient behind LCS structures to determine the feasible characteristics of wave mitigation. In addition, various empirical formulas based on regression analysis were proposed to quantitatively predict wave attenuation characteristics for field applications. However, inherent variability of wave attenuation causes the limitation of linear statistical approaches, such as linear regression analysis. Herein, to develop an optimization model for the hydrodynamic behavior of the LCS, we performed a comprehensive analysis of 10 types of machine learning models, which were compared and reviewed on the prediction accuracy with existing empirical formulas. We found that, among the 10 models, the gradient boosting model showed the highest prediction accuracy with MSE of 1.0 × 10−3, an index of agreement of 0.996, a scatter index of 0.065, and a correlation coefficient of 0.983, which indicates a performance improvement over the existing empirical formulas. In addition, based on a variable importance analysis using explainable artificial intelligence, we determined the significant importance of the input variable for the relative freeboard (RC/H0) and the relative freeboard to water depth ratio (RC/h), which confirms that the relative freeboard was the most dominant factor for influencing wave attenuation in the hydraulic behavior around the LCS. Thus, we concluded that the performance prediction method using a machine learning model can be applied to various predictive studies in the field of coastal engineering, deviating from existing empirical-based research.


Introduction
Artificial structures for wave mitigation, such as breakwaters, headlands, detached breakwaters, and submerged breakwaters, are utilized to control coastal erosion problems by reducing incident wave energy and reducing sediment transports. Recently, shoreline deformation from beach erosion and scouring by coastal development has been rapidly increasing, along with sea level rise and external force of storm wave increases due to climate change [1]. Among these, coastal erosion and sedimentation caused by morphological change can lead to changes in the natural environment and ecosystem of coastal areas [2,3]. These problems directly/indirectly affect various factors involved in local economic activities in related field such as fishery and tourisms.
Low-crested submerged structures (LCS), such as detached breakwaters and artificial reefs, reduce the wave behind a structure according to the change in the freeboard at the still water level, thereby protecting onshore environment by inducing the wave [4]. Since the geometrical specifications of such LCS should be set under conditions to obtain the target wave transmission coefficient, the calculation or prediction of the transmission coefficient The linear regression model has the advantage that the parameters are linear and can be easily interpreted and analyzed quickly. Linear regression models were developed over 100 years ago and have been widely used over the past decades. However, a very restrictive shape results in low accuracy for data with nonlinear relationships. Linear regression creates a regression model using one or more characteristics and finds parameters w and b that minimize the mean squared error (MSE) between the experimental value (y) and predicted value (ŷ) (Equations (1) and (2)).

Lasso Regression
In the existing linear regression method, overfitting with poor predictive performance may occur when new data are provided. To solve this problem, a lasso regression was developed using L1 regulation to forcibly constrain the model (Equation (3)).
Here, m is the number of weights, and α is a penalty parameter that determines w and b that minimize the sum of the MSE and penalty terms.

Ridge Regression
Ridge regression is a model with an added L2 constraint to solve the overfitting problem of the linear regression model. The model not only fits the data of the learning algorithm, but also keeps the weights of the model as small as possible (Equation (4)).
The weights become zero in lasso regression, whereas in ridge regression, the weights become close to zero but not zero. The difference is that if some of the input variables are important, lasso regression will have a higher accuracy, and if the importance of the input variables is similar overall, the ridge model will have a higher accuracy.

SVM
The SVM was introduced by Boser et al. [22], inspired by the concept of statistical learning theory. The SVM is a method of finding a hyperplane composed of support vectors that can classify vectors of linearly different classes with the maximum margin for the distance between them [23]. The machine learning algorithm reflects data that cannot be classified linearly in a low-dimensional space in a high-dimensional space using a kernel function, and classifies it using a hyperplane. Representative types of kernel functions include polynomial, sigmoid, and radial basis function (RBF). In this study, the Gaussian RBF kernel was used and applied to the model [24].

Gaussian Process Regression (GPR)
The GPR model is a probabilistic model based on nonparametric kernels. The Gaussian regression analysis model can be performed when the wave attenuation coefficient, which is the dependent variable, has a Gaussian shape [25]. Specifically, if a specific wave Sensors 2021, 21, 8192 4 of 21 attenuation coefficient (K t * ) is assumed as a random variable that includes an error, the expected wave attenuation coefficient with the error removed can be expressed as a covariance function between the mean and the error (K t * = K t + ε). Assuming that this error covariance can be interpreted as a kernel function, Bayesian analysis model can predict wave attenuation characteristics [26].

Ensemble Method
The ensemble method was created to improve the performance of the classification and regression tree (CART). The method creates a more accurate prediction model by creating several classifiers and combining their predictions. In other words, the method derives a highly accurate prediction model by combining several weak classifier models, and not using a single strong model. Ensemble models can be broadly divided into bagging and boosting models. Bagging method reduces variance by using average or voting methods for the results predicted by various models, and boosting method synthesizes weak classifiers into strong classifiers. In this study, we performed a predictive study using boosting and random forest (RF) ensemble methods.
(1) Random Forest (RF) RF is a method that is employed to improve defects such as the variance and the performance fluctuation range of the decision tree being large. RF combines the concept and properties of bagging with randomized node optimization to overcome the shortcomings of existing decision trees and improve the generalization performance. In line with bagging method, the process of extracting bootstrap samples and creating a decision tree for each bootstrap is similar, but instead of selecting the optimal partition within all predictors for each node, the RF randomly extracts predictors and creates optimal partitions within the extracted variables [27]. In other words, RF creates several learners of low importance since it determines slightly different training data through bootstrap to give maximum randomness, and simultaneously combines the randomization of predictors. Important hyperparameters of RF include max_features, whether to use bootstrap, and n_estimator. The max_features parameter represents the maximum number of features to be used in each node, bootstrap is the option to allow duplication in data sampling conditions for each classification model, and n_estimator means the number of trees to be created in the model [28].
(2) Boosting method Boosting is a method that is used to create strong classifiers from a few weak classifiers, and is a model created by further boosting the weights on the data at the boundary. AdaBoost is the most common and widely used ensemble learning algorithm, and is specifically one of the boosting families of ensemble learning. The main feature of AdaBoost is that after generating a weak classifier using initial training data, the distribution of the training data is adjusted according to the prediction performance dependents on the weak classifier training. The weight of the training sample with low prediction accuracy was increased by using the information received from the classifier in the previous stage. In other words, the method improves learning accuracy by adaptively changing the weights of samples with a low prediction accuracy in the previous classifier. The method combines these weak classifiers with low prediction performance to create a strong classifier with slightly better performance. Gradient boosting method, which is applied in this study, also sequentially adds multiple models in the same way as the Adaboost model [29]. The biggest difference between the two algorithms is the recognition of weak classifiers. While AdaBoost recognizes values that are more difficult to classify by weighting them, Gradient Boost uses a loss function to classify errors. In other words, the loss function is an indicator that can evaluate the performance of the model in learning specific data, and the model result can be interpreted differently depending on which loss function is used. The correlation coefficient, which indicates the correlation between the predicted output value and the measured value of the model, is an important factor for evaluating the predictive performance of a machine learning model. To analyze the predictive performance of the model, in this study, we measured the performance of the model using the mean square error (MSE), index of agreement (I), scatter index (SI), and R 2 , which represent the correlation coefficient. For each dataset, the correlation coefficient between the experimental and predicted values is as shown in Equations (5)- (8).
Here, x i and y i are the experimental and predicted values, respectively,x andȳ are the mean values of the experimental and predicted values, respectively, and n is the sample number. Statistically, the closer R 2 and I are to 1 and the smaller the MSE and SI, the higher is the reliability.

Analysis Method of Feature Importance
(1) eXplainable Artificial Intelligence (XAI) XAI was developed to help users understand the overall characteristics of how an AI system works and correctly interprets the final result. XAI is a surrogate model that makes possible to explain the process of calculating results for the correlation between input variables and dependent variables by determining the major factors that affect the prediction of a machine learning model. The interpretation of such a machine learning model is an important analysis method for deriving a suitable learning model according to various conditions or to increase the prediction stability of the model through quantitative analysis of predicted values through input variables.
The analysis method of the artificial intelligence system analyzes the characteristics of input variables to interpret the model learning and prediction process, and is divided into global and local interpretations. Global interpretation is a method of interpreting the overall analysis process and results of a model, and local interpretation is a method of interpreting model predictions for a single observation or part of a data set, and interpreting the results derived from the model for one specific input data.
(2) Shapley Additive exPlanations (SHAP) The Shapley value is the mean value of the marginal contribution for all possible sets to understand the importance of one characteristic based on game theory (theorizing about what decisions or actions each other takes in situations where multiple themes influence each other). Lundberg et al. [30] developed the SHAP machine analysis model, which achieves the highest accuracy, with a solid theoretical background among the machine learning analysis models that have been released to date, and the Shapely value is given in Equation (9).
Here, S is a subset of the features used in the model, {i} is the vector of feature values of observations for explanation, n is the number of characteristics. f(S) is the value obtained by subtracting the predicted value from one observation from the average predicted value obtained from the data for a combination of feature values. SHAP values are obtained by using the conditional expected value function of the machine learning model for Shapley values. The Shapley values for all input features are obtained, and the SHAP values can be interpreted locally and globally with the SHAP mean for each feature for each observation. The input feature importance can be expressed by visualizing the input feature based on the dataset through the average or sum of the absolute values of the SHAP values. In the case of the partial dependence plot provided by SHAP, the value of the input characteristic of each instance and the corresponding SHAP values are expressed as dots for all instances, and the average of the predicted values is calculated by changing the specific characteristic value of each instance. In this study, we analyzed the characteristics of a machine learning model built using the SHAP model.

Empirical Formula of Wave Transmission Coefficient
The wave transmission coefficient represents the ratio of the incident wave height before passing through the LCS and the average wave height after passing through the LCS (Figure 1). ), front slope of structure (tan α), and wave steepness ( H i L 0 ⁄ ), are classified as the factors related to wave decay around the LCS, and the empirical formula for this is presented. Figure 1 shows cross-section of LCS structure. Herein, the crest freeboard (R C = h c − h) indicates the differences between structure depth and water depth, which has a positive value in emerged state and negative value in submergence state in still-water level.
In this study, we compared and reviewed the results of calculating the wave transmission coefficient using the existing empirical formula (Equations (10)- (14)), and the prediction results using a machine learning model.

Machine Learning Automatic Pipeline Model
In this study, we applied 10 machine learning models, namely linear regression, kernel ridge (KR), ridge, lasso, GPR, SVM, RF, artificial neural network (ANN), gradient boosting regressor (GBR), and AdaBoost, to compare and review the performance dependencies on the characteristics of each model. To determine the optimal conditions of the automatic pipeline model dependent on input data characteristics, we adjusted hyperparameters using Grid-searchCV, and constructed automatic models for 10 machine learning models using the scikit-learn pipeline. The optimal machine learning model, selected through the automatic model, was analyzed to determine the importance of variables affecting the wave control of LCS using the machine learning analysis package SHAP.

Machine Learning Model Configuration and Input Conditions
The 260 items of input data were obtained and applied in this study, with reference to the results of hydraulic model experiments with existing LCS by Seelig [34], Daemrich and Kahle [35], van der Meer [20], and Daemen [36]. Data on the wave transmission coefficient were obtained from DELOS database for permeable structures. The 260 data consist of: Existing theoretical and empirical equations for the LCS are based on the experimental results of the hydraulic model, and many researchers have proposed empirical equations to predict the wave transmission coefficient using experimental data [31][32][33].
The suggested equation for the wave transmission coefficient by D'Angremond et al. [32] is as follows (Equations (10) and (11)): Here, R c is the crest freeboard, H i is the incident wave height, B is the crown width ξ is the surf similarity coefficient for breakwater (ξ = tan α/ √ H i/L 0 ). However, in the aforementioned equation, the effective range of the wave transmission coefficient is limited to 0.075-0.8. Van der Meer [31] suggested a wave transmission coefficient equation based on the breakwater coefficient to improve the accuracy of the wave transmission coefficient (Equations (12) and (13)).
Bleck and Oumeraci [33] proposed the exponential decay equation of the wave transmission coefficient of the LCS, according to the relative freeboard (Equation (14)).
In previous studies, factors such as relative freeboard R c H i , relative crest width ( B H i ), front slope of structure (tan α), and wave steepness ( H i/L 0 ), are classified as the factors related to wave decay around the LCS, and the empirical formula for this is presented. Figure 1 shows cross-section of LCS structure. Herein, the crest freeboard (R C = h c − h) indicates the differences between structure depth and water depth, which has a positive value in emerged state and negative value in submergence state in still-water level.
In this study, we compared and reviewed the results of calculating the wave transmission coefficient using the existing empirical formula (Equations (10)- (14)), and the prediction results using a machine learning model.

Machine Learning Automatic Pipeline Model
In this study, we applied 10 machine learning models, namely linear regression, kernel ridge (KR), ridge, lasso, GPR, SVM, RF, artificial neural network (ANN), gradient boosting regressor (GBR), and AdaBoost, to compare and review the performance dependencies on the characteristics of each model. To determine the optimal conditions of the automatic pipeline model dependent on input data characteristics, we adjusted hyperparameters using Grid-searchCV, and constructed automatic models for 10 machine learning models using the scikit-learn pipeline. The optimal machine learning model, selected through the automatic model, was analyzed to determine the importance of variables affecting the wave control of LCS using the machine learning analysis package SHAP.

Machine Learning Model Configuration and Input Conditions
The 260 items of input data were obtained and applied in this study, with reference to the results of hydraulic model experiments with existing LCS by Seelig [34], Daemrich and Kahle [35], van der Meer [20], and Daemen [36]. Data on the wave transmission coefficient were obtained from DELOS database for permeable structures. The 260 data consist of:  [Daemen].
In the data applied to the model, wave attenuation characteristics behind the structure were analyzed using random wave, and the data in range of 0.021 to 0.231 m were applied to wave height, and 0.91-3.66 s to wave period (Table A1). Through various studies, research results on the wave attenuation mechanisms of LCS and various factors that have a dominant influence on wave attenuation are presented. Van der Meer [31] proposed a wave [38] performed research on porosity and h c /h. Therefore, in this study, we used seven dimensionless numbers (X = {X 1 , X 2 ,...., X 7 }) as input variables (Table 1) based on previous studies. Here, R c /H 0 is the relative freeboard, B/H 0 is the relative crest width, ξ is the surf similarity parameter, B/L 0 is the ratio of the crest width to the wavelength, R c /h is the relative freeboard to water depth ratio, Dn 50 /h c is the ratio of the nominal diameter to the crest height, and h c /h is the relative structure height. Herein, D n50 means nominal diameter, which is the ratio of median mass of unit (M 50 ) and mass density of the rock (ρ r ) (D n50 = ( M 50/ρ r ). Therefore, Dn 50 /h c parameter is a factor related to the effect of voids as the ratio of the structure height to the nominal diameter. Surf similarity parameter (ξ = tan α/ √ H i/L 0 ) represents the ratio of the front slope (tan α) and wave slope ( H i/L 0 ), which is an important parameter in relation to wave breaking. The front slope of the structure applied in this study was in the range of 1:1.38-1:4, and various slope conditions were considered.  Figure 2 depicts the statistical distributions of the input and output variables. To reflect the same feature scale, the input variable was converted to a range of 0 to 1 using max-min normalization.

Comparison of Machine Learning Model and Model Selection
Recently, research on the development and application of various machine learning techniques has been performed in the field of computer science. The performance of these machine learning models differs according to the characteristics of the input variables. Therefore, to model the fluid mechanical behaviors of the LCS, we analyzed the performance of the model by using 10 linear and nonlinear regression models. Figure 3 presents the performance results of the machine learning model of the artificial coral reef data derived from the machine learning pipeline model. Among the 10 machine learning models, GBR showed the highest model performance with an R 2 = 0.983, and the linear regression method showed the lowest performance with an R 2 = 0.814. Table 2 shows the model performance results based on the application of the 10 machine learning methods. The ensemble method (Adaboost, GBR, and RF), including the ANN method, showed the highest model accuracy (under 1.3 × 10 −3 MSE) and model performance (>0.979 of R 2 ). Among them, GBR yielded the highest model prediction accuracy. This indicates that the boosting method reinforces the weak classifier. In addition, the linear regression models (linear, ridge, and lasso) showed low accuracy, indicating that the application of the linear model cannot reflect the nonlinear characteristics of the data. Furthermore, it is believed that the linear model shows low model prediction performance when the model has nonlinearity between the input variable and the dependent variable. However, the performance of the model could be increased by regulating the L 1 and L 2 weights. We presented the wave transmission prediction results for the LCS using a machine learning model, as shown in Figure A1, which shows the distribution of the experimental values and prediction values of the test set. In terms of designing coastal structure, the estimation of the wave transmission coefficient with high accuracy is most important. As a result of comparing machine learning models, the GBR model shows the highest accuracy in terms of predicting the wave transmission coefficient for LCS structure. Therefore, we performed an analysis by applying the GBR model, which showed the highest accuracy in predicting the hydraulic characteristics around the LCS. of the test set. In terms of designing coastal structure, the estimation of the wave transmission coefficient with high accuracy is most important. As a result of comparing machine learning models, the GBR model shows the highest accuracy in terms of predicting the wave transmission coefficient for LCS structure. Therefore, we performed an analysis by applying the GBR model, which showed the highest accuracy in predicting the hydraulic characteristics around the LCS.   To determine the most accurate parameter of the GBR model, we divided the collected data into training data and test data. Traditionally in machine learning, when the number of data is small, the training data and test data are divided by 7:3; however, recently, when the number of data is large, the dataset can be divided by 9:1. We divided the data set into 7:3, 8:2, and 9:1 conditions to perform sensitivity analysis on model accuracy. Table 3 shows the model performance results according to data splitting condition, and the highest R 2 can be obtained under the conditions of 9:1 and 8:2. As the training set ratio increases, the number of training data increases, which allows the GBR model to produce a strong learner. However, since overfitting of the model and generalization of the model may be difficult due to insufficient data in the test set under 9:1 condition, we built the model by applying the 8:2 condition.  Figure 4 shows the prediction results of the training data and test data considering seven input variables using gradient boosting; the horizontal axis represents the experimental values, and the vertical axis represents the distribution of predicted values. As for the results, I was 0.999, SI was 0.032, and R 2 was 0.999 for the training data set, and the MSE was 0.8 × 10 −3 , I was 0.997, SI was 0.058, and R 2 was 0.988 for the test data set, indicating the excellent prediction performance for the wave transmission coefficient. As a result, it is deemed that the performance prediction method using such a machine learning model can be applied to various predictive studies in the field of coastal engineering, deviating from existing empirical-based research.

10-Fold Validation Analysis
To verify the model performance of the GBR model, we utilized 10-fold cross-validation. This method was developed to minimize the bias associated with random sampling of the training set. The entire data sample was divided into 10 parts: nine were used for training, and one was used for model validation. The pro

10-Fold Validation Analysis
To verify the model performance of the GBR model, we utilized 10-fold cross-validation. This method was developed to minimize the bias associated with random sampling of the training set. The entire data sample was divided into 10 parts: nine were used for training, and one was used for model validation. The process of cross-validation was performed ten consecutively. The 10-fold cross-validation method ensured the generalization and reliability of the model performance. Figure 5 shows the model performance results obtained using the 10-fold cross-validation method. Figure 5a shows the results of R 2 according to each fold and shows slight fluctuations; however, the minimum and maximum values were 0.958 and 0.987, respectively. Figure 5b shows the minimum value of 0.97 × 10 −3 , and the maximum value of 2.70 × 10 −3 for MSE, showing that all errors are minimal, and a high level of accuracy is maintained. tion method ensured the generalization and reliability of the model performance. Figure  5 shows the model performance results obtained using the 10-fold cross-validation method. Figure 5a shows the results of R 2 according to each fold and shows slight fluctuations; however, the minimum and maximum values were 0.958 and 0.987, respectively. Figure 5b shows the minimum value of 0.97 × 10 −3 , and the maximum value of 2.70 × 10 −3 for MSE, showing that all errors are minimal, and a high level of accuracy is maintained.  Table 4 shows the model performance and statistical information using the 10-fold cross-validation method. The mean R 2 is 0.973, and the std is 0.009, demonstrating that the results have a small deviation. In addition, the MAE and MAPE are 0.027 and 0.080, respectively, indicating small prediction errors.   Table 4 shows the model performance and statistical information using the 10-fold cross-validation method. The mean R 2 is 0.973, and the std is 0.009, demonstrating that the results have a small deviation. In addition, the MAE and MAPE are 0.027 and 0.080, respectively, indicating small prediction errors.   As a result of the variable importance analysis, the SHAP value of relative fre (Rc/H0) was 0.116, which verifies that Rc/H0 was the most dominant parameter fo control and wave energy reduction in hydrodynamics behavior around the LCS. N relative freeboard to water depth ratio (Rc/h) and relative structure height (hc/h 0.062, 0.042, respectively, and the results show that the input variable related to th board has a dominant influence over 80% of the total wave height attenuation beh structure. The freeboard should be prioritized for wave control in the design of the ture, as the attenuation effect of wave energy passing over the LCS along with wave ing increases as the freeboard increases. The SHAP values of the ratio of the crest

Feature Importance Analysis
Around the coastal structures, the wave attenuation effect is not an action that is independent of the input variables, but rather a complex interaction dependent on the variables. Thus, the relative importance of each variable in the model unit to the total observations should also be analyzed. Since the importance of the input variable is a measure of how much the variable affects the dependent variable, analysis for the correlation between the input and dependent variables is important. Therefore, we analyzed the importance of variables that affect the wave attenuation of the LCS. Figure 7 shows the importance of input variables that affect the dependent variable (wave transmission coefficient) when applying the 260 hydraulic model experiment results to the GBR model. Figure 7a shows the variable importance, and the x-axis represents the average of the absolute values of the Shapley values of the input variables throughout the data. In short, this means that the average influence of the input variable on the dependent variable, and the larger the x-axis value, the greater the influence on wave attenuation.
As a result of the variable importance analysis, the SHAP value of relative freeboard (R c /H 0 ) was 0.116, which verifies that R c /H 0 was the most dominant parameter for wave control and wave energy reduction in hydrodynamics behavior around the LCS. Next, the relative freeboard to water depth ratio (R c /h) and relative structure height (h c /h) were 0.062, 0.042, respectively, and the results show that the input variable related to the freeboard has a dominant influence over 80% of the total wave height attenuation behind the structure. The freeboard should be prioritized for wave control in the design of the structure, as the attenuation effect of wave energy passing over the LCS along with wave breaking increases as the freeboard increases. The SHAP values of the ratio of the crest width to wavelength (B/L 0 ) and relative crest width (B/H 0 ) were 0.023 and 0.017, respectively, indicating that the input variable related to the crest width has an effect of more than 8.6% of the total on the wave attenuation. Figure 7b shows a summary plot combining the feature importance and feature effects of the input variables. Here, they are arranged in order of importance, so that the one with the highest feature importance is placed at the top. The stronger the red shading of the corresponding feature value, the more positive is the influence on the wave transmission coefficient (K t ), and the stronger the blue shading, the more negative is the influence. As a result, as R c /H 0 , R c /h, h c /h, B/L 0 , and B/H 0 increased, the wave transmission coefficient decreased, and as the surf similarity coefficient (ξ) increased, the wave transmission coefficient tended to increase. The results showed that this sensitivity trend was in line with engineering practice and physical background. As a result of the variable importance analysis, the SHAP value of relative freeboard (Rc/H0) was 0.116, which verifies that Rc/H0 was the most dominant parameter for wave control and wave energy reduction in hydrodynamics behavior around the LCS. Next, the relative freeboard to water depth ratio (Rc/h) and relative structure height (hc/h) were 0.062, 0.042, respectively, and the results show that the input variable related to the freeboard has a dominant influence over 80% of the total wave height attenuation behind the structure. The freeboard should be prioritized for wave control in the design of the structure, as the attenuation effect of wave energy passing over the LCS along with wave breaking increases as the freeboard increases. The SHAP values of the ratio of the crest width to wavelength (B/L0) and relative crest width (B/H0) were 0.023 and 0.017, respectively, indicating that the input variable related to the crest width has an effect of more than 8.6% of the total on the wave attenuation. Figure 7b shows a summary plot combining the feature importance and feature effects of the input variables. Here, they are arranged in order of importance, so that the one with the highest feature importance is placed at the top. The stronger the red shading of the corresponding feature value, the more positive is the influence on the wave transmission coefficient (Kt), and the stronger the blue shading, the more negative is the influence. As a result, as Rc/H0, Rc/h, hc/h, B/L0, and B/H0 increased,

Influence of Input Variable Number
In this study, we analyzed the model accuracy by applying an input variable consisting of seven dimensionless numbers (X = {X 1 :R c /H 0 , X 2 :B/H 0 , X 3 :ξ, X 4 :B/L 0 , X 5 :R c /h, X 6 :Dn 50 /h c , and X 7 :h c /h}). If it is possible to build a model with high accuracy by excluding insignificant input variables and constructing a model with only important input variables, it is also possible to reduce computational complexity and to derive good results in terms of time efficiency. Accordingly, we analyzed the effects on model performance when some input variables or data were not reflected through various combinations. Table 5 presents the model performance results based on the eight combinations of input variables, and Figure 8 presents the results of the predicted values and experimental values for the eight combinations.  Combination 1 showed the model performance results when applying pristine seven dimensionless input variables, indicating the highest accuracy with an MSE of 0.8 × 10 −3 , and R 2 of 0.988. In contrast, combination 7, which applied four input variables (X 2 :B/H 0 , X 3 : ξ, X 6 :Dn 50 /h c ) showed the lowest accuracy with an MSE of 22.9 × 10 −3 , R 2 of 0.668. In addition, combination 8, which applied three input variables (X 1 :R c /H 0 , X 5 :R c /h, X 7 :h c /h), showed relatively high accuracy with an MSE of 3.7 × 10 −3 , R 2 of 0.947, despite the application of a small number of input variables. It is worth noting that the accuracy of the model does not simply increase with an increase in the number of input variables, as can be seen from the combination 1-8 results. In addition, in combination 2, the relative freeboard (X 1 :R c /H 0 ), which was classified as the most important factor in the sensitivity analysis, was not taken into account; however, combination 2 had a relatively high accuracy with an MSE of 1.3 × 10 −3 and an R 2 of 0.981. Even if the relative freeboard (X 1 :R c /H 0 ) is ignored, it is judged that combination 2 showed high accuracy by considering the factors (X 5 :R c /h, X 7 :h c /h) related to the freeboard. However, combinations 6-7, which ignored the factors related to the crest height (X 1 :R c /H 0 , X 5 :R c /h, X 7 :h c /h), showed low accuracy. In summary, the factors related to the freeboard (X 1 :R c /H 0 , X 5 :R c /h, X 7 :h c /h) are the most important input variables to consider when obtaining predictions with high accuracy. Figure 9a,b show the results of substituting all experimental data into the empirical formula for the wave transmission coefficient of LCS at the low-crest submerged breakwater suggested by Van der Meer [31] and D'Angremond [32]. However, the results out of the effective range (0.075 < K t < 0.8) suggested by the empirical formula were excluded. As a result of the prediction of the wave transmission coefficient using the Van der Meer empirical equation, the MSE was 0.009, and the determination coefficient (R 2 ) was 0.81, indicating that the overall result of the empirical formula was overestimated, with respect to the experimental value ( Figure 9a). As a result of the prediction of the wave transmission coefficient using the D'Angremond empirical formula, the MSE was 0.006 and the correlation coefficient (R 2 ) was 0.84, showing fewer errors than the Van der Meer empirical formula and high prediction accuracy (Figure 9b). However, in the case of empirical formulas proposed by Van der Meer [31] and D'Angremond [32], applicable formulas are classified according to the surf similarity parameter (ξ = tan α/ √ H i/L 0 ) and the relative crest width (B/H 0 ), and the effective range of the wave transmission coefficient is limited to 0.075 < K t < 0.8, which has a disadvantage in that uncertainty increases for other ranges. Figure 9c shows the results of substituting all experimental data into the empirical formula for the wave transmission coefficient of a low-crest submerged breakwater proposed by Bleck and Oumeraci [33]. In the case of the prediction results for the wave transmission coefficient using the empirical formula, the MSE was 0.017, and the determination coefficient (R 2 ) was 0.71, indicating that the result of the empirical formula showed a low similarity to the experimental value overall. In addition, for the experimental value under 0.32, the transmission coefficient was 0.170, with a low prediction accuracy. The wave transmission coefficient of the LCS should consider the influence of various factors, such as crown depth, crown width, and porosity; however, the experimental formula of Bleck and Oumeraci [33] only considered the relative freeboard (R C /H 0 ), which led to a lower prediction accuracy than the other empirical formulas. Table 6 shows the results of comparing the statistical indicators of the existing empirical formula and the GBR model. Overall, all statistical indicators showed that the results of the boosting model showed higher prediction accuracy than that of the existing empirical formula. Furthermore, unlike the existing empirical formula, the boosting model does not need to set the effective range of the wave transmission coefficient, and does not require a separate formula dependent on the input variable ( Figure 9d). In summary, a very accurate wave transmission coefficient can be predicted by inputting the seven input variables required by the machine learning model. This is because machine-learning models can interpret the non-linear relationships between independent and dependent variables. In the case of the empirical formula, analysis is possible only in the effective range of the wave transmission coefficient, whereas when the GBR model is applied, it shows good predictive performance in all ranges.   Table A1. The indication of the range of the parameters.

Conclusions
In this study, we investigated the hydrodynamic performance modeling of a lowcrested structure using 10 machine learning models, including linear and non-linear models. To construct the model, we used 260 hydraulic model test data for training (80%) and prediction (20%). To predict the wave transmission coefficient behind the structure, we applied seven dimensionless parameters (R C /H 0 , B/H 0 , ξ, B/L 0 , R C /h, Dn 50 /h c , and h c /h). In addition, we evaluated the correlation between the input variable and dependent variable by analyzing the main factors that affect the prediction of machine learning models using XAI. The wave transmission coefficient for the linear model (M8, M9, and M10) among the machine learning models showed low prediction accuracy; however, the ensemble technique, the GBR model (M2) in particular, showed the highest accuracy to predict the wave transmission coefficient of a structure with a given input variable. To validate the machine learning models, we performed a 10-fold cross-validation, which indicates that the resulting R 2 was 0.973, and the mean MAPE was 2.7%, confirming a significantly low prediction error. This small degree of error proves the generalization of the model reasonably. Based on the sensitivity analysis, we confirmed that the input variable for the relative freeboard (R C /H 0 ), and the relative freeboard to water depth ratio (R C /h) show that the importance of independent variables is significant. As a result, freeboard was found to be the most dominant factor influencing wave attenuation in the hydraulic behaviors around the LCS. In addition, we comprehensively analyzed the results of the empirical formulas and machine learning models. In the wave transmission prediction of the trained gradient boosting model, the MSE was 0.8 × 10 −3 , I was 0.997, SI was 0.058, and R 2 was 0.988, which indicates high prediction accuracy and improved wave transmission coeffcient prediction performance, compared with existing empirical results. Since the prediction using machine learning can perform analysis non-linearly, the wave transmission coefficient of a LCS can be predicted precisely and efficiently, in contrast to the regression method adopted by the exiting empirical formula. It is determined that the constructed machine learning automated pipeline model can be utilized for not only wave attenuation studies on LCS, but also various applications in coastal engineering.
Author Contributions: Conceptualization, T.K. and Y.K.; development, T.K. and S.K.; writing, T.K., S.K. and Y.K.; data analysis T.K. and Y.K. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declared no conflict of interest.