Nearshore Wave Predictions Using Data Mining Techniques during Typhoons: A Case Study near Taiwan’s Northeastern Coast

: Seasonal typhoons provide energy into the wave ﬁeld in summer and autumn in Taiwan. Typhoons lead to abundant wave energy near the coastal area and cause storm surges that can destroy offshore facilities. The potential for wave energy can be obtained from analyzing the wave height. To develop an effective model for predicting typhoon-induced wave height near coastal areas, this study employed various popular data mining models—namely k-nearest neighbors (kNN), linear regressions (LR), model trees (M5), multilayer perceptron (MLP) neural network, and support vector regression (SVR) algorithms—as forecasting techniques. The principal component analysis (PCA) was then performed to reduce the potential variables from the original data at the ﬁrst stage of data preprocessing. The experimental site was the Longdong buoy off the northeastern coast of Taiwan. Data on typhoons that occurred during 2002–2011 and 2012–2013 were collected for training and testing, respectively. This study designed four PCA cases, namely EV1, TV90, TV95, and ORI: EV1 used eigenvalues higher than 1.0 as principal components; TV90 and TV95 used the total variance percentages of 90% and 95%, respectively; and ORI used the original data. The forecast horizons varying from 1 h to 6 h were evaluated. The results show that (1) in the PCA model’ cases, when the number of attributes decreases, computing time decreases and prediction error increases; (2) regarding classiﬁed wave heights, M5 provides excellent outcomes at the small wavelet wavelet level; MLP has favorable outcomes at the large wavelet and small/moderate wave levels; meanwhile, SVR gives optimal outcomes at the long wave and high/very high wave levels; and (3) for performance of lead times, MLP and SVR achieve more favorable relative weighted performance without consideration of computational complexity; however, MLP and SVR might obtain lower performance when computational complexity is considered.


Introduction
Taiwan is situated on one of the main paths of western North Pacific typhoons and is affected by an average of four typhoons each year [1]. As a typhoon approaches Taiwan, the sea level rises abnormally due to storm surges caused by strong winds and atmospheric pressure disturbances [2]. Typhoons lead to abundant wave energy near the coastal area and cause storm surges that can destroy offshore facilities.
The central mountain range (CMR, shown in Figure 1), which is 340 km long and 80 km wide with an average height of 2500 m, is Taiwan's principal mountain range. As a typhoon approaches Taiwan, the topography significantly influences its track and intensity [3,4]. Consequently, typhoon circulation interaction with the topography produces considerable mesoscale variations in pressure, wind, precipitation, and offshore waves near Taiwan [5][6][7][8]. Typhoon-complicated wind distribution often causes storm surges that can destroy offshore facilities and coastal infrastructure as well as A useful scheme for nearshore wave forecasts during typhoons is highly desirable in Taiwan. Data mining, which has popularized data-driven models, has been in widespread use in coastal and oceanic applications. These methods are based on the analysis of all data characterizing the system under study to determine an unknown mapping or dependencies between the system's input and output from the available data [11]. Numerous applications are used in water-related modeling, including some in wind-wave modeling [12][13][14][15]. For example, Londhe [16] used artificial neural networks (ANNs) and genetic programming for the estimation of missing wave heights in the eastern Gulf of Mexico. Etemad-Shahidi and Mahjoobi [17] used model trees (M5) for the same purpose, and the results were compared with those of ANNs. Tsai and Tsai [18] established a backpropagation (BP) neural network model to predict four surface waves by using acoustic wave and pressure data measured by ultrasonic wave gauges. Deka and Prahlada [19] developed a hybrid model of wavelets and an ANN to forecast significant wave heights for higher lead times up to 48 h on the west coast of India. Asma et al. [20] used multiple linear regressions (LR) and ANNs to describe the significant wave height off Goa, located on the west coast of India. The aforementioned studies have been widely used in hindcasting and forecasting of wave parameters. They indicated that neural networks can provide a good alternative to statistical regression, time series analysis and numerical methods. The advantages are due to the improved accuracy, simplicity, smaller computational efforts and in some cases less data requirements. However, the complicated meteorological conditions resulting from interactions between typhoons and terrain effects in Taiwan continue to pose tremendous challenges to current wave models in the predicting of accurate wave heights.
The presented data mining models are designed for hindcasting significant wave heights for nearshore during tropical cyclones (typhoons) in the paper. As mentioned, when a typhoon approaches Taiwan, the nearshore waves are significantly interacted by a typhoon's circulation and topography factor. This study developed a methodology to address the complicated problem of wave forecasts during typhoons, and designed a series of steps to deal with the modeling process (see Section 3). The novelty of this research has two main points. First, this study investigated several observations from meteorological stations, which are near the study site (a buoy), and collected the typhoon characteristics, and buoy atmospheric and maritime properties. Second, because a large number of attributes would be employed as model inputs, this study developed an improved approach that combines the principal component analysis (PCA) technique and data mining models. Some researchers, such as [21][22][23][24], concluded that PCA can be employed to find a A useful scheme for nearshore wave forecasts during typhoons is highly desirable in Taiwan. Data mining, which has popularized data-driven models, has been in widespread use in coastal and oceanic applications. These methods are based on the analysis of all data characterizing the system under study to determine an unknown mapping or dependencies between the system's input and output from the available data [11]. Numerous applications are used in water-related modeling, including some in wind-wave modeling [12][13][14][15]. For example, Londhe [16] used artificial neural networks (ANNs) and genetic programming for the estimation of missing wave heights in the eastern Gulf of Mexico. Etemad-Shahidi and Mahjoobi [17] used model trees (M5) for the same purpose, and the results were compared with those of ANNs. Tsai and Tsai [18] established a backpropagation (BP) neural network model to predict four surface waves by using acoustic wave and pressure data measured by ultrasonic wave gauges. Deka and Prahlada [19] developed a hybrid model of wavelets and an ANN to forecast significant wave heights for higher lead times up to 48 h on the west coast of India. Asma et al. [20] used multiple linear regressions (LR) and ANNs to describe the significant wave height off Goa, located on the west coast of India. The aforementioned studies have been widely used in hindcasting and forecasting of wave parameters. They indicated that neural networks can provide a good alternative to statistical regression, time series analysis and numerical methods. The advantages are due to the improved accuracy, simplicity, smaller computational efforts and in some cases less data requirements. However, the complicated meteorological conditions resulting from interactions between typhoons and terrain effects in Taiwan continue to pose tremendous challenges to current wave models in the predicting of accurate wave heights.
The presented data mining models are designed for hindcasting significant wave heights for nearshore during tropical cyclones (typhoons) in the paper. As mentioned, when a typhoon approaches Taiwan, the nearshore waves are significantly interacted by a typhoon's circulation and topography factor. This study developed a methodology to address the complicated problem of wave forecasts during typhoons, and designed a series of steps to deal with the modeling process (see Section 3). The novelty of this research has two main points. First, this study investigated several observations from meteorological stations, which are near the study site (a buoy), and collected the typhoon characteristics, and buoy atmospheric and maritime properties. Second, because a large number of attributes would be employed as model inputs, this study developed an improved approach that combines the principal component analysis (PCA) technique and data mining models. Some researchers, such as [21][22][23][24], concluded that PCA can be employed to find a set of orthogonal components that minimize the error in the reconstructed data. This paper presents popular data mining techniques, including k-nearest neighbors (kNN), LR, M5, multilayer perceptron (MLP) neural network, and support vector regressions (SVR) that are often used for ensemble forecasting problems. The paper selected these models because the kNN, a typical non-parametric classifier, involves using neighbor search algorithms to achieve computational tractability [25]. LR is a traditional regression analysis modeling technique, and can be used as a benchmark model. M5, developed by [26], is a technique for dealing with continuous class learning problems that provides a structural representation of the data and a piecewise linear fit of the class. An ANN-based MLP network is extensively used to model an unknown system with observable inputs and outputs, and widely employed because of its flexibility and ease of use. Support vector machines (SVM), developed by [27], adhere to the principle of structural risk minimization seeking to minimize an upper bound of generalization error. SVM has been extended to solve nonlinear regression estimation problems, known as SVR. These models were employed for forecasting hourly wave heights during typhoon invasions.
The remainder of this paper is organized as follows. Section 2 describes the study area and selected typhoons and data. Section 3 describes the proposed prediction model. Section 4 outlines the model development regarding PCA approach and data mining models. Section 5 evaluates the performance of predicted results, and various lead time forecasts. Finally, Section 6 offers some conclusions. Figure 1 shows the experimental area and location (121.92 • E, 25.10 • N) of the Longdong buoy near Taiwan's northeastern coast. As mentioned, when a typhoon approaches Taiwan, the terrain of the CMR significantly affects a typhoon's circulation and its track. Therefore, this study investigated three meteorological stations, namely Keelung, Penjia Islet, and I-lan, which are near the Longdong buoy, and collected their meteorological observations.

Location and Typhoons
The typhoon data used in this study were recorded in the western Pacific Ocean, northeast of Taiwan. Table A1 of Appendix A lists 67 typhoons affecting offshore northeast Taiwan in the years 2002-2013 and their corresponding wind intensity scales. The Saffir-Simpson wind scale (Table 1) classifies tropical cyclones into seven categories (i.e., categories 1-5, tropical storm, and tropical depression) depending on the intensities of their maximum sustained winds. Table 1 shows the classification of the 67 typhoons into these seven categories. In total, 24 typhoons were classified as tropical storms, and 10 as category 3, the strongest among the studied typhoons. Whenever a typhoon approached the study area, various meteorological and oceanic characteristics are measured by the Central Weather Bureau (CWB) of Taiwan.

Typhoon Characteristics
The attributes of typhoon characteristics, denoted as {H}, are latitude and longitude of the typhoon center (unit: deg) (denoted as h 1 and h 2 ), pressure at the typhoon center (mb) (h 3 ), radius of the typhoon (km) (h 4 ), moving speed of the typhoon (km/h) (h 5 ), and intensity of the typhoon (m/s) (h 6 ). For the 10 category 3 typhoons, the mean value of typhoon radius is 237.1 km (which is 1.18 times the average of all records) and the mean value of typhoon intensity is 45.7 m/s (1.36 times).

Buoy Atmospheric Properties
The atmospheric properties of the Longdong buoy at 3 m height, denoted as {P}, are wind speed at the buoy (m/s) (p 1 ), wind direction at the buoy (deg) (p 2 ), gusty wind speed at the buoy (m/s) (p 3 ), temperature above the buoy ( • C) (p 4 ), and atmospheric pressure above the buoy (hPa) (p 5 ). For the 10 category 3 typhoons, the mean value of wind speed at the buoy is 8.5 m/s (which is 1.21 times the average of all records).

Buoy Maritime Properties
The maritime properties of the Longdong buoy, denoted as {B}, are wave height (h) at the buoy (m) (b 1 ), maximum wave period (s) (b 2 ), average wave period (s) (b 3 ), wave direction (deg) (b 4 ), and sea surface temperature ( • C) (b 5 ). Attribute b 1 refers to significant wave height, h 1/3 . Significant wave height is defined as the average of the highest one-third of the crest to trough waves in that segment of time series by sifting through each individual trough to crest waves in the data [28]. For the 10 category 3 typhoons, the mean value of wave height at the buoy is 2.8 m (which is 1.31 times the average of all records).

Methodology
This section presents a methodology for developing a usable scheme for forecasting wave heights during typhoon periods. As described in the previous section, the model inputs comprise the six attributes {H} for typhoon characteristics, 33 attributes {M} at three ground meteorological stations (11 attributes per station), and 10 attributes at a buoy (five attributes {P} for buoy atmospheric properties and five attributes {B} for buoy maritime properties). In total, 50 input attributes are used for modeling. First, for lead time = 1 h, the model target is the 1-h ahead observed wave heights {b 1,t+1 } at the buoy. To reduce the dimension of model inputs, the study employed the PCA approach (theorem see Section 3.1) to generate input-output relationships. PCA can gather highly correlated independent variables into a principal component, and all principal components are independent of each other so that the analysis' only function is to transform a set of correlated variables into a set of uncorrelated principal components [29].
The main processing stages are plotted in Figure 2. First, the collected attributes are preprocessed as the model inputs. This study then employs the PCA approach to derive new variables from original attributes, and select the eigenvalues greater than a threshold as principal components. Thus, the new PCA-derived attributes can be obtained and used to formulate the wave-height data-driven prediction models using PCA-derived attributes. The five data mining algorithms, kNN, LR, M5, MLP, and SVR, are involved. Here, the height data-driven prediction models using the original attributes are employed as a benchmark model. To evaluate the utility of these data-driven approaches, results were compared with those from various model cases, namely EV1, TV90, TV95, and ORI: EV1 used eigenvalues higher than 1.0 as principal components; TV90 and TV95 used the total variance percentages of 90% and 95%, respectively; and ORI used the original data. To examine the robustness for the various lead times, the forecast horizons vary from 1 to 6 h in the nearshore of northeast Taiwan. Finally, the most accurate model is suggested as the prediction model. The theories of PCA technique and five data mining algorithms were described in the sequence sections.

Methodology
This section presents a methodology for developing a usable scheme for forecasting wave heights during typhoon periods. As described in the previous section, the model inputs comprise the six attributes {H} for typhoon characteristics, 33 attributes {M} at three ground meteorological stations (11 attributes per station), and 10 attributes at a buoy (five attributes {P} for buoy atmospheric properties and five attributes {B} for buoy maritime properties). In total, 50 input attributes are used for modeling. First, for lead time = 1 h, the model target is the 1-h ahead observed wave heights { 1, +1 } at the buoy. To reduce the dimension of model inputs, the study employed the PCA approach (theorem see Section 3.1) to generate input-output relationships. PCA can gather highly correlated independent variables into a principal component, and all principal components are independent of each other so that the analysis' only function is to transform a set of correlated variables into a set of uncorrelated principal components [29].
The main processing stages are plotted in Figure 2. First, the collected attributes are preprocessed as the model inputs. This study then employs the PCA approach to derive new variables from original attributes, and select the eigenvalues greater than a threshold as principal components. Thus, the new PCA-derived attributes can be obtained and used to formulate the wave-height data-driven prediction models using PCA-derived attributes. The five data mining algorithms, kNN, LR, M5, MLP, and SVR, are involved. Here, the height data-driven prediction models using the original attributes are employed as a benchmark model. To evaluate the utility of these data-driven approaches, results were compared with those from various model cases, namely EV1, TV90, TV95, and ORI: EV1 used eigenvalues higher than 1.0 as principal components; TV90 and TV95 used the total variance percentages of 90% and 95%, respectively; and ORI used the original data. To examine the robustness for the various lead times, the forecast horizons vary from 1 to 6 h in the nearshore of northeast Taiwan. Finally, the most accurate model is suggested as the prediction model. The theories of PCA technique and five data mining algorithms were described in the sequence sections.

PCA Approach
In data analysis, reducing the dimensionality of data is vital because it helps understand the data and decreases computational cost [30]. The PCA approach is used in the proposed prediction model to simplify and orthogonalize the original data set in order to optimize the models because a small set of uncorrelated variables is easier to understand and use in further analyses than is a larger set of correlated variables [22]. PCA was first introduced by Pearson [31] and developed by Hotelling [32]. PCA is a linear decomposition of data that results in a set of eigenvectors that form the basis for a new coordinate system [33]. Principal components performed on a data matrix of (N samples × P variables) yield linearly transformed random variables that exhibit special properties in terms of variances. In effect, transforming the original vector variables to the vector of principal components to a rotation of coordinate axes to a new coordinate system that possesses inherent statistical properties [34]. The first principal component (Y 1 ) can be defined as a linear combination of the elements of the data matrix, that is [23]: where coefficients chosen to maximize the variance represented by the first principal component are simply the eigenvectors of the symmetric covariance matrix. The eigenvalues of the covariance matrix represent the variation of each principal component, where Var(Y i ) = λ i . Ideally, a PCA yields several components that describe the majority of the total variation of the data set.

kNN
The difficulty in using kNN is determining the nearest k neighbors for a query point from a given data set. This difficulty occurs in scientific and engineering applications, including pattern recognition, data clustering, and function approximation [35]. kNN is a type of instance-based learning in which a function is only approximated locally, and all calculations are deferred until classification [36]. This method can be used for regression by simply assigning the property value for an object as the average of the values of its kNN [37]. The kNN can be defined by [38]. Given a set of N points, [p i ] N i=1 , defined in real d-dimensional space, S ⊂ d , and a query point, q ∈ d , the k points of S with a minimal Euclidean distance to q are determined, where k ≥ 1. This problem is solved simply by calculating the distances between the N points in S to the query point q. The k points with the smallest distances comprise the desired subset of S; that is, the kNN to q.

LR
LR is the most widely used of all statistical techniques. LR estimates the most efficient-fitting linear equation for predicting the output field according to the input fields. The regression equation represents a straight line or plane that minimizes the squared differences between predicted and actual output values [39,40]. The multiple LR can be defined as where Y is the (p × 1) vector of observations, X is the (p × q) vector containing q input factors, β is the (q × 1) vector containing the regression coefficients, and E is the (p × 1) vector containing the noise terms.

M5
M5 combines a conventional decision tree with the possibility of LR functions at the leaves [41]. An M5 tree-based model is constructed using the divide-and-conquer method. Splitting in M5 ceases when the class values of all the instances that reach a node vary only slightly, or when only a few Energies 2018, 11, 11 7 of 23 instances remain. The splitting criterion is based on treating the standard deviation of the class values that reach a node as a measure of the error at that node, and calculating the expected reduction in error as a result of testing each attribute at the node [42]. The attribute that maximizes the expected error reduction is chosen. Standard deviation reduction (SDR) is calculated by the formula where T is the set of examples that reaches the node, T i is the subset of cases that have the ith outcome of the potential test, and sd() is the function of the standard deviation.

MLP
ANNs are the information processing systems that are inspired by the models of biological neural networks. MLP is a feedforward network in which there may be one or more hidden layers besides one input and one output layer. Studies have suggested that one hidden layer may be sufficient for most problems [43][44][45]. All layers except the output layer contain a bias or threshold node whose output is set to a fixed value of 1. All the nodes of a lower layer are connected to all the nodes of an upper layer through links called weights. For MLP, the procedure used to perform the learning process is called the learning algorithm, the function of which is to modify the synaptic weights of the network in an orderly fashion to attain a desired design objective [46]. The BP algorithm, a generalized steepest descent algorithm, is the most popular learning technique used to train the MLP. The weights of the MLP are updated by using the BP algorithm during the training phase. The knowledge acquired by the network after learning is stored in its weights in a distributed manner.

SVR
SVR is a learning machine based on statistical learning theory. Let where x is the input vector; f (x) is the output vector; ω and b are the slope and offset of the regression line, respectively; and R d is d-dimensional input space. In SVR, the regression function is calculated by minimizing: where (1/2)||w|| 2 is the term characterizing the model complexity, M is the number of support vectors, and c(f (x i ), y i ) is the loss function determining how the distance between f (x i ) and the target values y i should be penalized. The constrained optimization problem can be reformulated into dual problem formalism by using Lagrange multipliers [27], which leads to the solution: where α i and α * i are the Lagrange multipliers, and K(x i , x) represents the kernel function. A suitable kernel function makes it possible to map a nonlinear input space to a high-dimensional feature space, in which LR can be performed [47]. The researcher adopted the prevalent radial basis function (RBF) kernel. The Euclidean distance-based RBF-kernel function is defined as follows: Energies 2018, 11, 11 8 of 23 where σ is the standard deviation that determines the width of the RBF kernel.

Model Development
The collected typhoons were classified into training and testing sets when modeling. The training data (years 2002-2011, 56 typhoons) were used to build the models, and the testing data (years 2012-2013, 11 typhoons) were used to evaluate the performance levels of the built models.

PCA
PCA was performed on the original data to explore possibilities for data dimension reduction. Figure 3 plots the eigenvalue of components and total variance explained for wave-height predictions. On the basis of [48], the components with an eigenvalue greater than one are suggested to replace the original attribute values for further analysis. Thus, the researcher selected eigenvalues greater than 1.0 as principal components. As Figure 3 shows, 10 components are greater than 1.0, and the corresponding total variance percentage of these components is 77.29%. Through PCA, the PCA-derived components (variables) then replaced the original ones. This model case is called EV1 in the subsequent section. The relationships between new variables and original inputs can be expressed in a matrix form, such as where h 1 is the normalized input h 1 , and y j are the PCA-derived inputs (j = 1, 10).

Model Development
The collected typhoons were classified into training and testing sets when modeling. The training data (years 2002-2011, 56 typhoons) were used to build the models, and the testing data (years 2012-2013, 11 typhoons) were used to evaluate the performance levels of the built models.

PCA
PCA was performed on the original data to explore possibilities for data dimension reduction. Figure 3 plots the eigenvalue of components and total variance explained for wave-height predictions. On the basis of [48], the components with an eigenvalue greater than one are suggested to replace the original attribute values for further analysis. Thus, the researcher selected eigenvalues greater than 1.0 as principal components. As Figure 3 shows, 10 components are greater than 1.0, and the corresponding total variance percentage of these components is 77.29%. Through PCA, the PCA-derived components (variables) then replaced the original ones. This model case is called EV1 in the subsequent section. The relationships between new variables and original inputs can be expressed in a matrix form, such as where ℎ 1  is the normalized input h1, and yj are the PCA-derived inputs (j = 1, 10). To compare with EV1, the researcher also selected the total variance percentage of 90% and 95%, whose corresponding components are 19 and 25, respectively (Figure 3). The two model cases were named TV90 and TV95. For TV90, the relationships between new variables and original inputs was expressed as For TV95, the relationships between new variables and original inputs was expressed as To compare with EV1, the researcher also selected the total variance percentage of 90% and 95%, whose corresponding components are 19 and 25, respectively (Figure 3). The two model cases were named TV90 and TV95. For TV90, the relationships between new variables and original inputs was expressed as Additionally, the model case that used the original data was designed as another case, namely, ORI.

Model Constructions
The parameter calibrations of various model cases were subjected to sensitivity analysis. The root mean square error (RMSE) was used to measure the errors of the results, that is, where O pre i is the prediction for record i, O obs i is the observation for record i, and n r is the number of records. The smaller the RMSE criteria, the more favorable is the performance of the predicted outcome.
First, the processes of modeling the EV1 are described. For kNN, the favorable choice of k depends on the data. In this study, the k neighbor points were chosen by performing sensitivity analysis. Figure 4a plots the RMSE calculations using the testing data set. In the figure, the optimal RMSE value was at k = 14. For LR, the researcher used a stepwise regression method and specified selection criteria based on the statistical probability (p value) associated with each field. The criteria and stepwise estimation were used to add and remove fields. The p value used was 0.05.
Additionally, the model case that used the original data was designed as another case, namely, ORI.

Model Constructions
The parameter calibrations of various model cases were subjected to sensitivity analysis. The root mean square error (RMSE) was used to measure the errors of the results, that is, where O pre is the prediction for record i, O obs is the observation for record i, and nr is the number of records. The smaller the RMSE criteria, the more favorable is the performance of the predicted outcome. First, the processes of modeling the EV1 are described. For kNN, the favorable choice of k depends on the data. In this study, the k neighbor points were chosen by performing sensitivity analysis. Figure 4a plots the RMSE calculations using the testing data set. In the figure, the optimal RMSE value was at k = 14. For LR, the researcher used a stepwise regression method and specified selection criteria based on the statistical probability (p value) associated with each field. The criteria and stepwise estimation were used to add and remove fields. The p value used was 0.05. For M5, the minimum number of instances to allow at a leaf node were chosen by performing sensitivity analysis. According to Figure 4b, the suitable minimum number of instances can be chosen between 1 and 4. To train an MLP network, a three-layer feedforward network was created. This study introduced sigmoid and linear activity functions in the hidden and output layers, respectively. The number of neurons in the hidden layer was determined according to the method For M5, the minimum number of instances to allow at a leaf node were chosen by performing sensitivity analysis. According to Figure 4b, the suitable minimum number of instances can be chosen between 1 and 4. To train an MLP network, a three-layer feedforward network was created. This study introduced sigmoid and linear activity functions in the hidden and output layers, respectively. The number of neurons in the hidden layer was determined according to the method proposed by [49], namely adding up the number of neurons in the input and output payers, minusing the sum by 1, and dividing this number by 2. In addition, the parameters of learning rate and momentum were validated according to a sensitivity analysis. According to Figure 4c,d, the suitable learning rate and momentum can be set at 0.6 and 0.3, respectively. For SVR, the parameters involve the penalty parameter and small positive number, in which both parameters are embedded in dual problem formalism. Here, this study sets penalty parameter and small positive number at 1.0 and 0.001, respectively. As mentioned, the RBF kernel was employed. According to sensitivity analysis (Figure 4e), the favor σ value can be set at 2. The other model cases (ORI, TV95, and TV90) were calibrated by using the same processes as those in EV1.

Evaluation
As stated in the previous section, the various trained models were validated by the 11 testing typhoons (years 2012-2013). Figure 5 plots the prediction results of testing typhoons for four PCA cases using five models.
Energies 2018, 11, 11 10 of 23 proposed by [49], namely adding up the number of neurons in the input and output payers, minusing the sum by 1, and dividing this number by 2. In addition, the parameters of learning rate and momentum were validated according to a sensitivity analysis. According to Figure 4c,d, the suitable learning rate and momentum can be set at 0.6 and 0.3, respectively. For SVR, the parameters involve the penalty parameter and small positive number, in which both parameters are embedded in dual problem formalism. Here, this study sets penalty parameter and small positive number at 1.0 and 0.001, respectively. As mentioned, the RBF kernel was employed. According to sensitivity analysis (Figure 4e), the favor σ value can be set at 2. The other model cases (ORI, TV95, and TV90) were calibrated by using the same processes as those in EV1.

Evaluation
As stated in the previous section, the various trained models were validated by the 11 testing typhoons (years 2012-2013). Figure 5 plots the prediction results of testing typhoons for four PCA cases using five models.

Effect of Dimension Reduction
First, the researcher evaluated the effect of data dimension reduction using a PCA method on the prediction ability and computation efficiency. Figure 6 illustrates the relationships among RMSE prediction errors, computational time, and four designed cases by using a testing set. The total numbers of attributes (comprising input attributes and a target) are 11, 20, 26, and 51 for EV1, TV90, TV95, and ORI, respectively. As seen in these subfigures, when the number of model attributes decreases, the computing efficiency increases because of a decrease in computational time. For the kNN model,  kNN, LR, M5, MLP, and SVR models, respectively. LR takes advantage of computational time, because LR whose regressions depend linearly on their unknown parameters are easier to fit than are models (such as MLP and SVR) that are nonlinearly related to their parameters. Moreover, the researcher found that when the number of model attributes decreases, the number of prediction errors increase. The reason for this might be that although PCA processes data dimension reduction, it may partially lose pattern information.

Effect of Dimension Reduction
First, the researcher evaluated the effect of data dimension reduction using a PCA method on the prediction ability and computation efficiency. Figure 6 illustrates the relationships among RMSE prediction errors, computational time, and four designed cases by using a testing set. The total numbers of attributes (comprising input attributes and a target) are 11, 20, 26, and 51 for EV1, TV90, TV95, and ORI, respectively. As seen in these subfigures, when the number of model attributes decreases, the computing efficiency increases because of a decrease in computational time. For the kNN model, the computing time increased from 0.8 to 2.0 s using the four cases. For other models, the computing times range of 0.02-0.78 s, 2.95-4.70 s, 7.0-237.94 s, and 22.31-570.43 s were for the LR, M5, MLP, and SVR models, respectively. This study found that the computing times for kNN, LR, and M5 are short regardless of input attribute amounts. By contrast, the computing time on MLP and SVR rise considerably when the number of input attributes increases. For prediction errors, the results show that the RMSE values are in the range 0.988-1.032 m, 0.555-0.816 m, 0.553-0.793 m, 0.553-0.766 m, and 0.557-0.775 m for the kNN, LR, M5, MLP, and SVR models, respectively. LR takes advantage of computational time, because LR whose regressions depend linearly on their unknown parameters are easier to fit than are models (such as MLP and SVR) that are nonlinearly related to their parameters. Moreover, the researcher found that when the number of model attributes decreases, the number of prediction errors increase. The reason for this might be that although PCA processes data dimension reduction, it may partially lose pattern information. To easily present the overall performance with respect to the prediction errors and computational time, Table 3 lists the average of RMSE measures and computational times of the four PCA cases. As Table 3 shows, (1) in terms of prediction errors, the researcher determined that the optimal one occurs at MLP (RMSE = 0.691 m), and the secondary favorable one occurs at SVR (RMSE = 0.694 m), in which the RMSE is close to that in MLP; meanwhile, the least favorable one appears at kNN (RMSE = 1.034 m). (2) In terms of computational time, the favorable one is LR (0.2 s), and the most time-consuming is SVR (169.6 s). To easily present the overall performance with respect to the prediction errors and computational time, Table 3 lists the average of RMSE measures and computational times of the four PCA cases. As Table 3 shows, (1) in terms of prediction errors, the researcher determined that the optimal one occurs at MLP (RMSE = 0.691 m), and the secondary favorable one occurs at SVR (RMSE = 0.694 m), in which the RMSE is close to that in MLP; meanwhile, the least favorable one appears at kNN (RMSE = 1.034 m). (2) In terms of computational time, the favorable one is LR (0.2 s), and the most time-consuming is SVR (169.6 s).

Error Measures
To evaluate the performance of different wave strengths which are defined by CWB, the wave heights were classified into five levels, namely, small wavelets (<0.6 m), large wavelets (0.6-1.5 m), small and moderate waves (1.5-2.5 m), long waves (2.5-5.0 m), and high and very high waves (>5.0 m). The criterion used to assess the wave strengths were the mean absolute error (MAE), RMSE and correlation coefficient (r). MAE is defined as: Then, the r is defined as:  The analysis shows that the error measures from MAE and RMSE belong to "absolute errors." To assess the "relative errors" regarding the classified wave-height predictions, the relative MAE (rMAE) and coefficient of variation of the RMSE (CVRMSE) were calculated. rMAE is computed using Then, the CVRMSE is computed using Generally, precise predictions are those whose rMAE and CVRMSE are nearly 0. Figure 8 plots the performance rMAE and CVRMSE measures of the classified wave heights for the four model cases. Results from rMAE (Figure 8a-d) and RMSE (Figure 8h) show that the highest relative predicted errors seem to appear at the small wavelet level, whereas the lowest predicted errors occur at the long wave level. The analysis shows that the error measures from MAE and RMSE belong to "absolute errors." To assess the "relative errors" regarding the classified wave-height predictions, the relative MAE (rMAE) and coefficient of variation of the RMSE (CV RMSE ) were calculated. rMAE is computed using Then, the CV RMSE is computed using Generally, precise predictions are those whose rMAE and CV RMSE are nearly 0. Figure 8 plots the performance rMAE and CV RMSE measures of the classified wave heights for the four model cases. Results from rMAE (Figure 8a-d) and RMSE (Figure 8h) show that the highest relative predicted errors seem to appear at the small wavelet level, whereas the lowest predicted errors occur at the long wave level.

Performance Graphically Using Taylor Diagrams
In this section, the researcher employed Taylor diagrams to graphically indicate which of various prediction models is most realistic. The Taylor diagram, invented by [50], is an alternative means of understanding the relative merits of the forecast models. The diagram can provide a concise statistical summary of how well patterns match each other in terms of their correlation, their root-mean-square (RMS) difference, and the standard deviation.
The Taylor diagram shown in Figure 9 provides a summary of the relative skill with which five models simulate the various wave levels: (a) small wavelet, (b) large wavelet, (c) small/moderate wave, (d) long wave, and (e) high/very high wave. For Figure 9a, M5 agrees best with observations, with the lowest centered RMS error. Of the models, kNN has the lowest pattern correlation. For Figure 9b, MLP has a slightly higher correlation with observations and a slightly lower standard deviation as the observed. Although kNN and LR have about the same centered RMS errors, LR

Performance Graphically Using Taylor Diagrams
In this section, the researcher employed Taylor diagrams to graphically indicate which of various prediction models is most realistic. The Taylor diagram, invented by [50], is an alternative means of understanding the relative merits of the forecast models. The diagram can provide a concise statistical summary of how well patterns match each other in terms of their correlation, their root-mean-square (RMS) difference, and the standard deviation.
The Taylor diagram shown in Figure 9 provides a summary of the relative skill with which five models simulate the various wave levels: (a) small wavelet, (b) large wavelet, (c) small/moderate wave, (d) long wave, and (e) high/very high wave. For Figure 9a, M5 agrees best with observations, with the lowest centered RMS error. Of the models, kNN has the lowest pattern correlation. For Figure 9b, MLP has a slightly higher correlation with observations and a slightly lower standard Energies 2018, 11, 11 15 of 23 deviation as the observed. Although kNN and LR have about the same centered RMS errors, LR predicts the wave heights much better than kNN, resulting in a higher correlation with observations. For Figure 9c, MLP has the lowest standard deviation, and the highest correlation with observations. For Figure 9d, SVR has a slightly higher correlation with observations and a slightly lower standard deviation as the observed. Finally, in Figure 9e, SVR, MLP, M5 and LR agree best with observations, each with about the same RMS errors. SVR, however, has a slightly higher correlation with observations. Hand et al. [51] reported that ANN-based models (e.g., MLP and SVR) can more easily overcome the high dimensionality problem, that is, the problem that arises when a high number of input variables are present relative to the number of available observations. Energies 2018, 11, 11 15 of 23 predicts the wave heights much better than kNN, resulting in a higher correlation with observations. For Figure 9c, MLP has the lowest standard deviation, and the highest correlation with observations. For Figure 9d, SVR has a slightly higher correlation with observations and a slightly lower standard deviation as the observed. Finally, in Figure 9e, SVR, MLP, M5 and LR agree best with observations, each with about the same RMS errors. SVR, however, has a slightly higher correlation with observations. Hand et al. [51] reported that ANN-based models (e.g., MLP and SVR) can more easily overcome the high dimensionality problem, that is, the problem that arises when a high number of input variables are present relative to the number of available observations.

Error Measures
In this section, the researcher examines the forecast horizons varying from 1 to 6 h, and using the PCA case of TV90 as a demonstration. Figure 10 shows the observations and predictions at lead

Error Measures
In this section, the researcher examines the forecast horizons varying from 1 to 6 h, and using the PCA case of TV90 as a demonstration. Figure 10 shows the observations and predictions at lead time from 2 h to 6 h (predictions of 1 h can see Figure 5c). In addition, Figure 11 shows the scattered plots contrasting the observations with the predictions at the lead time at 1, 3, and 6 h. The plots that the slopes calculated are greater in the MLP model than in other models, and the kNN was the least correlation among the models. Figure 12 shows the model performance measures with corresponding rMAE, CV RMSE , and r values to enable a comparison of forecast horizons varying from 1-6 h. According to figure, as expected, the researcher found that the 1-h-ahead predictions were more accurate than the 2-h-ahead predictions, and the 6-h-ahead predictions were the least accurate among the predictions. This indicated that increased forecasting horizons yielded increased errors in the prediction of wave heights. time from 2 h to 6 h (predictions of 1 h can see Figure 5c). In addition, Figure 11 shows the scattered plots contrasting the observations with the predictions at the lead time at 1, 3, and 6 h. The plots that the slopes calculated are greater in the MLP model than in other models, and the kNN was the least correlation among the models. Figure 12 shows the model performance measures with corresponding rMAE, CVRMSE, and r values to enable a comparison of forecast horizons varying from 1-6 h. According to figure, as expected, the researcher found that the 1-h-ahead predictions were more accurate than the 2-h-ahead predictions, and the 6-h-ahead predictions were the least accurate among the predictions. This indicated that increased forecasting horizons yielded increased errors in the prediction of wave heights.      To determine the prediction ability, the researcher calculated the average performance measure for 1-6 h predictions (Table 4). The results showed that the MLP achieved more favorable performance (lower rMAE and CV RMSE , and higher r values) than did other models. To further determine the most suitable model for different lead times, the model accuracy and complexity were estimated and then the relative weighted performance was computed. The relative weighted performance comprises two terms: the ranking performance, which was estimated for each of the models based on r, MAE and RMSE, and the computational complexity, which comprises both the model training time and the test set evaluation time [52]. For the ranking performance, the best performing model on each of these measures is assigned the rank of 1 and the worst is 0. For MAE and RMSE, the rank of the model i is calculated as [53,54]: where e i is the measured value. For r, the rank is calculated as The total number of the best and worst ranking models is computed by where ρ = 2 is the weight shifting parameter, s i is the total number of successes or best cases for the model i, f i is the total number of failures or worst cases for the same model, and n is the total number of datasets. Subsequently, the relative weighted performance can be measured by where α and β are the weight parameters for ranking average accuracy against computational complexity. The average accuracy and computational complexity are denoted by a i and t i respectively.
Higher Z values indicate superior metrics. Figure 13 shows the relative weighted performance for the 1, 3, and 6 h lead forecast with respect to different β values. The Z values were calculated by assuming α = 1 and varying β from 0 to 2. From Figure 13a, MLP obtained the highest Z values when β = 0 (without consideration of computational complexity); however, M5 obtained the highest Z values when the β value increases. Similarly, Figure 13b,c, MLP and SVR obtained the higher Z values than M5, LR, and kNN when β = 0; however, MLP and SVR might obtained lower Z values than others when considering computational complexity. This indicates that the training periods of the MLP and SVR were much longer than those of the kNN, LR, and M5, because fitting all the training data and generating a global approximation in the MLP and SVR training process is typically time consuming.

Conclusions
This paper reports a comparative study of PCA-derived data-driven wave-height prediction models during typhoon periods. The used data mining models are kNN, LR, M5, MLP, and SVR. The experimental site was the Longdong buoy, off the northeastern coast of Taiwan. The processing stages of model development were undertaken. First, the researcher performed PCA and used the results to reduce the potential attributes from the original data at the data preprocessing stage. On the basis of the PCA, the researcher designed four model cases: EV1, TV90, TV95, and ORI. Data concerning typhoons occurring during 2002-2011 and 2012-2013 were collected for training and testing, respectively. After calibrating the parameters of various model cases by using a training set, the simulation results made by a testing set were evaluated and discussed. The researcher obtained the following findings: First, comparisons among four PCA cases: The results show that (1) when the number of attributes decreases, computing time decreases, and the prediction error increases. Reasons for this could be that although PCA performs the processing of data dimension reduction, pattern information is partially lost. (2) MLP and SVR result in the optimal ones compared with other models when averaging the RMSE measures of the four model cases.
Second, comparisons among the classified wave heights: The results show that (1) M5 provide the superior outcomes at small wavelet level compared with other models, (2) MLP has the optimum outcomes at the large wavelet and small/moderate wave levels compared with other models, and (3) SVR provides the optimal outcomes at the long wave and high/very high wave levels of all models.

Conclusions
This paper reports a comparative study of PCA-derived data-driven wave-height prediction models during typhoon periods. The used data mining models are kNN, LR, M5, MLP, and SVR. The experimental site was the Longdong buoy, off the northeastern coast of Taiwan. The processing stages of model development were undertaken. First, the researcher performed PCA and used the results to reduce the potential attributes from the original data at the data preprocessing stage. On the basis of the PCA, the researcher designed four model cases: EV1, TV90, TV95, and ORI. Data concerning typhoons occurring during 2002-2011 and 2012-2013 were collected for training and testing, respectively. After calibrating the parameters of various model cases by using a training set, the simulation results made by a testing set were evaluated and discussed. The researcher obtained the following findings: First, comparisons among four PCA cases: The results show that (1) when the number of attributes decreases, computing time decreases, and the prediction error increases. Reasons for this could be that although PCA performs the processing of data dimension reduction, pattern information is partially lost. (2) MLP and SVR result in the optimal ones compared with other models when averaging the RMSE measures of the four model cases.
Second, comparisons among the classified wave heights: The results show that (1) M5 provide the superior outcomes at small wavelet level compared with other models, (2) MLP has the optimum outcomes at the large wavelet and small/moderate wave levels compared with other models, and (3) SVR provides the optimal outcomes at the long wave and high/very high wave levels of all models. Third, evaluation of various lead times: The results show that MLP and SVR achieved more favorable relative weighted performance without consideration of computational complexity; however, MLP and SVR might obtained lower performance when computational complexity is considered.