Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux

Gómez-Orellana, Antonio Manuel; Fernández, Juan Carlos; Dorado-Moreno, Manuel; Gutiérrez, Pedro Antonio; Hervás-Martínez, César

doi:10.3390/en14020468

Open AccessArticle

Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux

by

Antonio Manuel Gómez-Orellana

^*

,

Juan Carlos Fernández

^*

,

Manuel Dorado-Moreno

^*

,

Pedro Antonio Gutiérrez

^*

and

César Hervás-Martínez

^*

Department of Computer Science and Numerical Analysis, University of Cordoba, 14071 Córdoba, Spain

^*

Authors to whom correspondence should be addressed.

Energies 2021, 14(2), 468; https://doi.org/10.3390/en14020468

Submission received: 24 December 2020 / Revised: 11 January 2021 / Accepted: 13 January 2021 / Published: 17 January 2021

(This article belongs to the Special Issue Soft Computing Techniques in Energy System)

Download

Browse Figures

Versions Notes

Abstract

Meteorological data are extensively used to perform environmental learning. Soft Computing (SC) and Machine Learning (ML) techniques represent a valuable support in many research areas, but require datasets containing information related to the topic under study. Such datasets are not always available in an appropriate format and its preparation and pre-processing implies a lot of time and effort by researchers. This paper presents a novel software tool with a user-friendly GUI to create datasets by means of management and data integration of meteorological observations from two data sources: the National Data Buoy Center and the National Centers for Environmental Prediction and for Atmospheric Research Reanalysis Project. Such datasets can be created using buoys and reanalysis data through customisable procedures, in terms of temporal resolution, predictive and objective variables, and can be used by SC and ML methodologies for prediction tasks (classification or regression). The objective is providing the research community with an automated and versatile system for the casuistry that entails well-formed and quality data integration, potentially leading to better prediction models. The software tool can be used as a supporting tool for coastal and ocean engineering applications, sustainable energy production, or environmental modelling; as well as for decision-making in the design and building of coastal protection structures, marine transport, ocean energy converters, and well-planned running of offshore and coastal engineering activities. Finally, to illustrate the applicability of the proposed tool, a case study to classify waves depending on their significant height and to predict energy flux in the Gulf of Alaska is presented.

Keywords:

environmental prediction; renewable energy resource evaluation; meteorological data; reanalysis data; marine energy; soft computing

1. Introduction

A better understanding of the environment is of vital importance for science, contributing not only to more efficient exploitation of natural resources but also to the development of new strategies aimed at its protection. In that sense, meteorological observations provide an essential and valuable source of information which is widely used by researchers to address environmental learning, comprehension, prediction and conservation in numerous oceanic and atmospheric studies of a wide variety of areas (e.g., energy, climate change, agriculture, etc.). Some specific examples of the diversity of fields in which meteorological data can be used in are, among others: global solar radiation estimation [1], directional analysis of sea storms [2], estimation of hybrid energy systems taking into account economic and environmental objectives [3], wind power ramp events prediction [4], sea surface temperature prediction [5], study of the responses exhibited by plankton to fluid motions [6], trends in solar radiation [7] or simulation of extreme near shore sea conditions [8]. All these studies require a prior data collection and its adaptation to a specific format that allows the interpretation of them.

Once quality and well-formed data are obtained, these can be used to extract information and build prediction models that explain the behavior of a certain problem. The choice of the appropriate model, either in engineering problems or in any other problem, is also an important factor in addition to data [9,10]. In this sense, public and research organisms are increasingly promoting the use of open and robust datasets to boost policies coherent with the environmental exploitation, protection and conservation, as well as modelling tools available to the scientific community and decision-makers [11]. Although Soft Computing (SC) and Machine Learning (ML) techniques have the ability to handle uncertainty in data and are extensively used for modelling purposes, the real challenge in modelling studies is due to the inadequacy of data, since the adequacy of the models depends mainly on the the quality of the information used, so that, if a researcher does not have quality data, there will be no quality models.

Continuing with this line, special purpose software is usually developed to help researchers to advance in their studies related to energy and environmental modelling, becoming a great support for decision-making in the exploitation and protection of the environment. In [12], a software package in R called “ForecastTB” is developed for comparing the performance of distinct prediction methods, presenting the software as a stepping stone in ML automation modelling. In [13], an integrated simulation tool to optimise the design of bifacial solar panel with reflectors is presented. This tool can also be applied to study the efficiency of the solar cells. A framework for integrating information from offshore wind farms is implemented in [14] in order to ease data interchange and enhance operation and maintenance practices. In [15], a new software tool named “Storage LCA Tool” is presented for comparing PCM—phase change materials- storage systems with conventional systems that do not involve energy storage, being beneficial for supporting decision-making on energy concepts for buildings. A risk assessment tool to improve safety standards and emergency management in onshore wind farms is presented in [16]. Raabe et al. [17] developed two software tools, Model of Equilibrium of Bay Beaches (MEPBAY) and Coastal Modelling System (SMC), for supporting distinct operational levels of headland-bay beach in coastal engineering projects, and Motahhir et al. [18] developed an open hardware/software test bench for a solar tracker.

Marine energy prediction is currently a hot topic where meteorological data are used. Marine Renewable Energy (MRE) is one of the most important renewable and sustainable energy sources available in our environment [19], and it includes ocean thermal energy, marine tidal current energy and wave energy, among others. Its benefits and great potential [20] make it one of the most relevant natural resources, playing a crucial role not only in the reduction of the emission of greenhouse gases but also in all other aspects involved in the difficult challenge of the transition to a low carbon footprint society [21,22,23]. Wave energy exhibits a more stable power supply than wind energy and even solar energy. In recent years, Wave Energy Converters (WECs) [24] have been developed and widely installed, even arranged in array form [25], to transform this wave energy into electricity, which can be injected into the electric network or supplied to existing offshore oil and gas platforms [26] or seawater desalination plants [27], among others. WECs are mechanical devices that convert kinetic energy into electrical energy through either the vertical oscillation or the linear motion of waves. Nevertheless, waves are difficult to be characterised due to their stochastic nature because of the influence of a large number of environmental factors that exert on them [28]. As a consequence of this complexity, many aspects of WEC design, deployment and operation [29,30,31] need a proper prediction of waves [32,33], in order to maximise the wave energy extraction [34]. For this purpose, WECs use wave flux of energy (

F_{e}

) which can be calculated from the two most relevant wave parameters related to this aspect: significant wave height (

H_{s}

) and wave energy period (

T_{e}

).

Currently, and as a support to traditional study procedures, SC and ML techniques [35,36] are being widely used in numerous research fields related to classification, regression, and optimisation tasks, obtaining significant improvements in the performance of the results, either in engineering [10], energy, or environmental problems [37,38,39]. SC and ML methodologies can be used not only by experienced computer scientists but also by other researchers. For example, the well-known Waikato Environment for Knowledge Analysis (WEKA) [40] software tool provides researchers with a wide collection of ML algorithms. ML techniques have been already applied to tackle wave characterisation, accurately estimating

H_{s}

and

T_{e}

parameters [41,42], given that robustness of ML methods can tackle the previously explained difficulties in wave energy prediction. In [43], a reliable ML model based on multiple linear regression and covariant-weighted least square estimation for

H_{s}

modelling is presented in order to predict significant wave height 30 min in advance. In [44], an approach for feature selection problems is developed and applied for

H_{s}

and

F_{e}

prediction in oceanic buoys, obtaining very good results. In [45], a Bayesian Network system provides a helpful tool to support decision-making process of installation and maintenance operations in offshore wind farms using predictions of

H_{s}

, among others. In [46], several ML methods are implemented and compared for the prediction of

H_{s}

in the Persian Gulf, the extreme learning machine (ELM) providing the best results. The problem is that, in order to apply ML and SC techniques, it is essential to obtain datasets with relevant information about the issue under study, used to infer knowledge. Usually, these datasets are not publicly available in a friendly format, and their generation is the first step needed.

The information to create these datasets related with MRE can be obtained from meteorological observations, but such information may be available in an inappropriate format and even contain missing values or measurements. Consequently, it is usually required to perform pre-processing tasks for improving the quality of the data, such as the replacement of missing values, outlier detection, or data normalisation, among others. Furthermore, if more than one source of information is used to achieve a better characterisation of the problem under study [47,48,49], then a data integration process, denominated as the matching process in this document, has to be carried out by researchers to manually create the datasets with the needed information. Given that such process is of great relevance and has an extensive casuistry, the present work has been specially focused on it. Moreover, depending on the subject and the SC and ML technique to be applied, or even if the researcher considers other factors in order to enhance the performance obtained or have more in-depth conclusions, the datasets would have to be updated afterwards. In summary, many important details and different intermediate steps have to be considered when creating suitable datasets, especially when data integration is required, resulting in an extremely tedious task.

The main purpose of this paper is to present a new open source tool for the creation of datasets integrated by meteorological variables from two sources of information. Given that the tool provides a user-friendly graphical interface, no knowledge in programming languages is needed. It also prevents researchers from performing the mentioned tedious work and greatly simplifes all the steps involved in it, avoiding possible errors in the intermediate steps, at least as a preliminary study in certain areas where some kind of environmental prediction is needed. The meteorological data used by the tool come from two well-known sources of information: the National Oceanic and Atmospheric Administration (NOAA) National Data Buoy Center (NDBC) [50] and the National Centers for Environmental Prediction (NCEP)/National Center for Atmospheric Research (NCAR) Reanalysis Project (NNRP or R1) [51,52]. The open source software tool presented in this work is named SPAMDA in Supplemental Materials (Software for Pre-processing and Analysis of Meteorological DAta to build datasets), and it is available at https://github.com/ayrna. As SPAMDA performs all this data processing, it reduces the time involving these tasks and allows researchers to focus on the study of the meteorological aspects of the observations. The datasets obtained are ready to be used as input for SC and ML techniques in prediction tasks (classification or regression), although researchers can use them for other purposes. These datasets contain one or more meteorological variables as inputs and one variable as target (variable to be predicted). The format of the generated datasets will be Attribute-Relation File Format (ARFF) [53], which is the one used by WEKA. In addition, the datasets can also be generated in Comma-Separated Values (CSV) format, enabling researchers to use other tools.

In order to address the problem previously discussed, meteorological data integration from NDBC and NNRP and the casuistry that it entails, SPAMDA offers to researchers novelties and functionalities that will be detailed in Section 3, although some of them are briefly summarised below:

The generation of datasets becomes a very easy and customisable task by means of the selection of different input parameters, such as predictive and objective variables, classification and regression, output discretisation (useful for ordinal regression) or prediction horizon, among others.
The created datasets can be easily used by SC and ML tools.
It makes the researcher focus on environmental modelling, without having to worry about the development of scripts or mechanical tasks, avoiding laborious pre-processing procedures that imply a great deal of time and endeavour in early stages of the research.
It avoids possible researcher errors in the intermediate steps of the process, such as geographical coordinates conversion, missing values handling (dates or measurements not recorded) or different temporal resolution of the data collected, among others.
It provides information about the quality and quantity of the data. SPAMDA allows preliminary studies of missing values (dates or measurements not recorded) in buoys managed by NDBC, so that the researcher can have an idea of the quality of the data recorded by the buoys and about their suitability for the intended purpose. In any case, SPAMDA allows data integration taking into account such missing values when needed by the user.
Estimation of the amount of energy flux that can be produced at different prediction horizons: short-term, mid-term or long-term. Although this work does not focus on model performance, it should be taken into account that models tend to generalise worse with greater prediction horizons.
It manages the extensive casuistry of data integration which can lead to incomplete datasets, described in Appendix A.
Possibility of selecting one or more reanalysis nodes near the localisation under study, which could provide a better description of the problem to achieve more accurate models.
Although pre-processing is not the main objective of SPAMDA, the tool also provides some basic pre-processing filters on buoy measurements, such as normalisation and missing data recovery.
It facilitates data management and well-organised storage of the datasets. Environmental studies in different geographical locations can be carried out by merely introducing and using other collected data.
SPAMDA is distributed as an open source tool, its modular design allows the implementation of new modules for managing meteorological data from other sources, benefiting future renewable energy and environmental research.
It includes a user-friendly GUI, facilitating and greatly simplifying data management, and it is integrated with the Explorer environment of WEKA.
It is multi-platform, and it can be used on any computer with Java regardless of the operating system.

Therefore, the functionalities and characteristics that SPAMDA offers make it a supporting tool for researchers, which could be used in applications related to coastal and ocean engineering, and also in marine energy prediction. In [3], the estimation of energy supply sources in hybrid energy systems is based on the amount of energy that can be obtained by a marine energy system within a prediction horizon. Regulation of WECs to avoid malfunction or breakage, depending on the significant wave height and/or energy flux expected, as well as the possibility of reconfiguring them in order to maximise the wave energy extraction, is studied in [29,30]. The prediction of the energy that could be obtained from a certain maritime location is considered in [26,27] in order to know whether it is convenient to install WECs as power supply in marine structures, such as offshore oil and gas platforms or seawater desalination plants. In [54], significant wave height forecasting is applied for decision-making in exploitation and environmental protection for the construction of marine energy storage plants, future strategies on renewable energy and coastal planning. Other examples of application are: design of offshore structures and ports [55], decision-making and risk assessment about operational works in the sea [56], security systems for structures or naval security [57].

This paper is organised as follows: Section 2 describes the sources of information used by SPAMDA for creating datasets. Section 3 describes in detail the features of the software tool. Section 4 shows a case study describing the use of SPAMDA in a practical approach. Section 5 provides the final conclusions and future work.

2. Meteorological Data Sources

The data provided by the above-mentioned sources of information of SPAMDA is described below:

NDBC belongs to the National Weather Service (NWS) and operates and supports a network of marine and ocean buoys that record data. The mission of the network is to record marine and ocean meteorological data, such as $H_{s}$ , dominant wave period, or wind speed and direction, among others.
The buoys maintained by NDBC are located in coastal and offshore waters, and they are provided with specific sensors and devices which allow them to perform measurements. The information collected by the buoys is available on the NDBC website [58], and it is divided into different groups. One of them corresponds to standard meteorological information of the historical data collected by each buoy, which can be downloaded as annual text files and whose format was adopted by NDBC since January 2007 [59]. These files contain hourly measurements per day from 00:50 to 23:50 UTC (Universal Time Coordinated) and from 23:50 31 December of the previous desired year to 22:50 31 December of the desired year. In Table 1, a comprehensive measurement description and the corresponding units are provided as a summary for the reader. A fragment of one of these files, which contains the measurements collected during year 2017 by the buoy identified as Station 46001 in NDBC, is shown in Figure 1. Each column corresponds to a meteorological variable or attribute, and each row or instance corresponds to the values of the measurements collected by the buoy for each attribute at a specific date and time.
Note that the data collected by the network of buoys may be incomplete due to diverse circumstances such as the weather conditions in which the buoys have to operate, failures or malfunctioning elements of the buoys, among others. Accordingly, it may be the situation that some of the measurements are completely missing (missing date or instance) or partially missing (some measurements not recorded), by a buoy or by a set of buoys, once in a while or over a period of time. It may be also possible that the measurements have been recorded at a time different from the expected one. These aspects have to be taken into account when creating the datasets. This casuistry is explained in detail in Appendix A.
NNRP provides three-dimensional global reanalysis of numerous meteorological observations (e.g., components Zonal and Meridional of the velocity of the wind, relative humidity, pressure, etc.), which is available monthly, daily, and every six hours at 00 Z (Zulu time), 06 Z, 12 Z, and 18 Z from 1948 on a global $2.5 °$ × $2.5 °$ grid. Weather observations are from different sources, such as ships, satellites, and radar, among others. Reanalysis data are created assimilating such observations employing the same climate model along the whole period of reanalysis in order to decrease the impact of modelling changes on climate statistics. Such information has become a substantial support of the needs of the research community, even more in locations where instrumental (real time) data are not available.
The reanalysis data are available in the NNRP website [61], which is accessible through different sections. Such data can be fully (a global $2.5 °$ × $2.5 °$ grid) or partially (only the desired reanalysis nodes or sub-grid) downloaded as Network Common Data Form (NetCDF) files [62], a special binary format for representing scientific data, which provides a description of the file contents and also includes the spatial and temporal properties of the data. Each reanalysis file contains the values of a meteorological variable estimated by a mathematical model for each reanalysis node. For the sake of clarity, in Figure 2, an example to approximately illustrate a sub-grid containing six nodes of reanalysis surrounding the geographic localisation of a buoy (obtained from NDBC) is shown.
Therefore, with both sources of information, which complement each other, and carrying out a matching process, SPAMDA will create datasets for prediction tasks. In this way, the dataset input variables will be one or more reanalysis variables from NNRP and one or more measurements from NDBC. The dataset output variable will always be one measurement from NDBC.

3. SPAMDA

SPAMDA combines meteorological information from NDBC and NNRP to obtain new datasets for oceanic and atmospheric studies. In order to do so, SPAMDA manages three different types of datasets which are described in the following sections, but are briefly introduced bellow for giving the reader a better general understanding:

Intermediate datasets: They contain the meteorological observations from NDBC.
Pre-processed datasets: They are obtained as a result of pre-processing tasks performed on the intermediate datasets.
Final datasets: Created by merging an intermediate or pre-processed dataset (which contain the information from NDBC) with the reanalysis data from NNRP. This procedure is referenced in SPAMDA as a matching process and will be carried out according to the study to be performed (classification or regression).

SPAMDA consists of three main functional modules, whose main features, represented in Figure 3, are the following:

Manage buoys data: The aim of this module is to provide features for the management and analysis of the information related to the buoys from NDBC. This includes:
- Entering and updating the information of each buoy.
- Creation of intermediate datasets with the collected measurements.
- Pre-processing tasks for obtaining the pre-processed datasets.
- Matching process to merge the information from NDBC and NNRP.
- Creation of the final datasets according to the ML technique to use (classification or regression).
Manage reanalysis data: This module is used for the management of the reanalysis data provided by the NNRP. In this way, researchers can keep the reanalysis data files updated for their studies. Such files will be used, depending on researchers’ needs, in the matching process when obtaining the final datasets.
Tools: This module includes features for converting intermediate or pre-processed datasets to ARFF or CSV format and for opening ARFF files with WEKA software.

In the following subsections, each integrated functional module is described in detail.

3.1. Buoys

When a new buoy is included in SPAMDA, the following information, which can be obtained from NDBC, is requested:

Station ID: An alphanumeric identifier that allows easy identification of the buoy.
Description: A short description of the buoy.
Latitude: North or South geographical localisation (degrees) of the buoy.
Longitude: West or East geographical localisation (degrees) of the buoy.
Measurements files: The above-mentioned annual text files of the standard meteorological information recorded by the buoy and downloaded from the NDBC website. This will be used for the creation of the intermediate datasets. One file per year is expected.

For clarification, an example is presented in Figure 4, where the buoy ID1 has three annual text files and the buoy ID2 has two annual text files.

3.2. Datasets

Once a buoy has been included as described in Section 3.1, it is possible to create datasets with one or more annual text files, which are referenced in SPAMDA as intermediate datasets. In this module, researchers can manage intermediate datasets of each buoy, which are the baseline for their studies, by creating new ones or deleting the unnecessary ones.

When an intermediate dataset is created, it is associated with its corresponding buoy. In addition, a summary of its content is also created, providing relevant information such as the number of instances, the dates of the first and last measurements, the annual text files included and the missing and duplicated dates.

An example where three intermediate datasets have been created is presented in Figure 5. The two intermediate datasets of the buoy ID1 contain meteorological data of different years, and the intermediate dataset of the buoy ID2 contains meteorological data of two years. For each buoy, as many intermediate datasets as needed can be created.

3.3. Pre-Process

Data pre-processing prepares the raw data (intermediate datasets) to be able to be treated correctly by ML algorithms. This action can enhance the quality of data before the learning phase, by applying pre-processing tasks (filters). The result will be referenced as pre-processed datasets.

SPAMDA provides several filters grouped in three categories, Attribute, Instance, and Recover missing data, including the configuration of their parameters and a short description of them:

Attribute: All of these filters can be applied to the attributes (variables of the buoy from NDBC) of the intermediate dataset.
−
Normalize: This filter normalises all numeric values of each attribute. The resulting values are by default in the interval [0,1].
−
Remove: It removes an attribute or a range of them.
−
RemoveByName: It removes attributes based on a regular expression matched against their names.
−
ReplaceMissingValues: For each attribute, all the missing values will be replaced by the average value of the attribute.
−
ReplaceMissingWithUserConstant: This filter replaces all the missing values of the attributes with a user-supplied constant value.
Instance: All these filters can be applied to the instances (hourly measurements of the buoy from NDBC) of the intermediate dataset.
−
RemoveDuplicates: With this filter, all duplicated instances are removed.
−
RemoveWithValues: This filter removes all the instances that match the attribute and the value supplied by the user.
−
SubsetByExpression: It removes all the instances that do not match a user-specified expression.
Recover missing data: All these filters can be applied to the instances of the intermediate dataset.
−
Replace missing values with next nearest hour: The missing values of each attribute are replaced with the next nearest non missing value.
−
Replace missing values with previous nearest hour: This filter replaces the missing values of each attribute with the previous nearest non missing value.
−
Replace missing values with next n hours mean: The missing values of each attribute are replaced with the next n nearest non missing values mean, where n can be configured by the user.
−
Replace missing values with previous n hours mean: This filter replaces the missing values of each attribute in the intermediate dataset with the previous n nearest non missing values mean.
−
Replace missing values with symmetric n hours mean: The missing values of each attribute in the intermediate dataset are replaced with the n previous and n next non missing values mean.

SPAMDA allows researchers to undo the last filter applied or to restore the initial content of the intermediate dataset. In addition, the content and relevant statistical information of the intermediate and the pre-processed datasets can be visualised in this module, for example: minimum and maximum values, mean, standard deviation, or even the number of instances with missing values.

Figure 6 shows an example where the intermediate datasets 1 and 2 of the buoy ID1 have been pre-processed, obtaining as a result the pre-processed dataset 1 of each one. The intermediate dataset 1 of the buoy ID2 has been also pre-processed. Pre-processed dataset n represents that researchers can create as many pre-processed datasets as they consider opportune.

Nevertheless, further pre-processing tasks can be performed after obtaining the final datasets by means of the Explorer environment of WEKA or other tools.

3.4. Matching Configuration

The automatic integration of the data provided by the two sources of information described in Section 2, to merge and format such data, is denominated as the matching process in this document. Such process is one of the most powerful and remarkable features of this software tool due to its great relevance and extensive casuistry. In this sense, SPAMDA has been developed to provide great flexibility to researchers.

The matching procedure is performed using an intermediate or pre-processed dataset, which includes the measurements collected by a buoy from NDBC, and the needed reanalysis data files from NNRP. Note that SPAMDA is able to manage the NetCDF binary format for handling the information stored in the reanalysis files.

Such process merges the information of both sources that match on time, but, given that the reanalysis data are available with a minimum time horizon of 6 h at 00 Z, 06 Z, 12 Z and 18 Z, and the measurements of the buoys are recorded at hourly intervals, from 00:50 to 23:50 UTC, the matching can only be carried every six hours (discarding the rest of measurements from the buoy data). In addition, and since there is still a difference of 10 min, the matching with the reanalysis data will be performed with the nearest buoy measurement (before or after) within a maximum of 60 min of difference. Finally, the matched instances of both sources will form the final datasets.

Figure 7 presents an example of matching with the measurements collected during 2017 by Station 46001 (NDBC) and the reanalysis data (NNRP) of the variable pressure for reanalysis nodes

57.5

N ×

147.5

W and

55.0

N ×

147.5

W in the same year. In this way, only the instances from both sources that are linked with arrows (highlighted in green) will be used in the creation of the final datasets. Although the reanalysis dates have been presented in a human readable format, note that reanalysis dates are stored in hours from 01-01-1800, and they have to be transformed for comparison taking into account the time zone. Such transformation is automatically done by SPAMDA when matching the instances.

The reader can check in Appendix A for an example with a more complex case of the procedure.

SPAMDA allows researchers to perform a customisable matching process, for obtaining as many different versions of the same meteorological data as needed. Prediction tasks are based on the estimation of the output attribute using the information provided by the input attributes. Depending on the task, the datasets must be prepared and configured differently:

Classification: The final datasets will be ready to use as an input for ML classifiers, requiring a nominal output attribute, whose specific preparation is detailed in Section 3.5.
Regression: The final datasets will be ready to use as input for regression methods, requiring a real output attribute, whose preparation is also explained in Section 3.5.
Direct matching: In this case, the inputs’ attributes have a direct correspondence with the output attribute, and it is not necessary to perform any additional preparation. Both input and target attributes are synchronised in time, in such a way that the final dataset is not intended for prediction purposes. For example, the final datasets may be used in lost data recovering tasks, in correlation studies, in descriptive analyses, etc.

The following parameters can be specified for the matching process:

Flux of energy [48]: When the $F_{e}$ is selected, it will be used as output. This attribute is not collected by the buoys, but there are two parameters from which it can be computed: $H_{s}$ and $T_{e}$ , which are collected as WVHT and APD attributes, respectively, and were described in Table 1. In this way, SPAMDA obtains the $F_{e}$ (measured in kilowatts per meter) of each instance using the following equation:

$F_{e} = 0.49 \cdot H_{s}^{2} \cdot T_{e},$

(1)

where $H_{s}$ is measured in meters and $T_{e}$ in seconds. $F_{e}$ is referred to as flux of energy, but it is defined as an average energy flux because $H_{s}$ is an average wave height (see descriptions of the measurements on the NDBC website).
Attribute to predict: Instead of using $F_{e}$ , researchers can select any of the attributes collected by the buoys as output (e.g., significant wave height, WVHT, wind direction, WDIR, sea level pressure, PRES, etc.). Therefore, they can conduct different studies by selecting one attribute or other.
Reanalysis data files: In order to have a possible better description of the problem under study, more than one reanalysis variable can be considered as input. Remember that these files have to be previously downloaded from the NNRP website [61], which should set the range of dates (temporal properties) and the desired sub-grid (spatial properties, see Figure 2) for each variable of reanalysis.
In that sense, the reanalysis data files must have the same spatial and temporal properties but related to different variables. SPAMDA simplifies this task by showing the reanalysis data files that are compatible with each other, and checking that the selection made by the research meets that condition.
Buoy attributes: In addition to the reanalysis variables, the final datasets will also include the selected attributes as inputs (of the intermediate or pre-processed dataset used), providing a possible better characterisation of the problem under study, although it will depend on how correlated the attributes are.
Include missing dates: As above-mentioned, the information collected by a buoy may be incomplete due to measurements not recorded by it. As a consequence, the matching of instances between both sources of information may not be possible (missing dates). In that situation, researchers can consider two options: (1) discard the instances affected or (2) include them. In the latter case, the final datasets will contain the affected instances, but the measurements of the buoy will be stored as missing values in WEKA format, denoted as «?».
Nearest reanalysis nodes to consider: As already shown in Figure 2 (which represents six reanalysis nodes), the reanalysis data files may contain information of several reanalysis nodes. In this way, researchers can:
−
Consider all the reanalysis nodes contained in each file: in this case, the information provided by each reanalysis node contained in each selected reanalysis data file will be used.
−
Consider only some of the reanalysis nodes contained in each file: in this case, the information used is only that corresponding to the closest nodes to the buoy (the number of nodes, N, is indicated by the user). To do that, SPAMDA uses the Haversine equation [63] (or the great-circle distance) to calculate the distance from the location of the buoy to each node of reanalysis and obtain the closest ones. The Haversine equation performs calculation from main point to destination point with a trigonometric function:

$\begin{matrix} d (p_{0}, p_{j}) & = & arccos (sin (l a t_{0}) \cdot sin (l a t_{j}) \\ \cdot cos (l o n_{0} - l o n_{j}) + cos (l a t_{0}) \\ \cdot cos (l a t_{j})), \end{matrix}$

(2)

where $p_{0}$ is the geographical location of the buoy and $p_{j}$ is the position of each node. Finally, $l a t$ and $l o n$ represent the latitude and longitude of the positions of the points.
Number of final datasets: Depending on the number of nearest reanalysis nodes to consider, the number of final datasets to create and the content of them can be configured according to the following options:
−
One (using weighted mean of the N nearest reanalysis nodes): Only one final dataset will be created, which will contain the attributes (the selected one as output and the selected ones as inputs) of the intermediate or pre-processed dataset used, along with a weighted mean of each variable of the reanalysis data used (one per selected reanalysis data file). This weighted mean is obtained by SPAMDA and uses Equation (2) to calculate the distance from the geographical position of the buoy to each node of reanalysis. Once the distances have been computed, they are normalised and inverted as shown in the following equation:

$w_{i} = \frac{d (p_{0}, p_{i})}{\sum_{j = 1}^{N} d (p_{0}, p_{j})}, i = 1, \dots, N .$

(3)

Then, with these calculated weights, a weighted mean of each variable of reanalysis is obtained for each of the N nodes. In this way, the closest reanalysis nodes to the geographical position of the buoy will provide more information.
Considering as an example the two nearest reanalysis nodes represented in Figure 2 and the reanalysis variables air temperature and pressure, the weighted mean of each reanalysis variable will be calculated using the reanalysis nodes $57.5$ N × $147.5$ W and $55.0$ N × $147.5$ W.
−
‘N’ (one per each reanalysis node): As many final datasets as the number of nearest N reanalysis nodes configured by researcher will be created. Therefore, each final dataset will contain the value of each reanalysis variable used of the nearest corresponding reanalysis node, along with the selected attributes of the intermediate or pre-processed dataset used. In this way, researchers can perform comparison studies depending on the reanalysis node considered, to achieve better performance for the problem under study.
In this case, and considering as example the four closest reanalysis nodes (see Figure 2) and the reanalysis variables air temperature and pressure, four final datasets will be created, each one containing the information of both reanalysis variables of the corresponding reanalysis node: $57.5$ N × $147.5$ W, $55.0$ N × $147.5$ W, $57.5$ N × $150.0$ W and $55.0$ N × $150.0$ W, along with the selected attributes of the intermediate or pre-processed dataset used.

Once the matching parameters have been described, for a better understanding of them, Figure 8 presents an example of the data integration considering the data shown in Figure 7 and using the following configuration (Note that the date is shown just for a better understanding, but it will not be included in the final dataset):

Attribute to predict: variable WVHT (Figure 8a/flux of energy (Figure 8b).
Variable Pres as reanalysis input attribute.
Variable WSDP as buoy input attribute.
Not including missing dates.
Considering the closest reanalysis node.
Task to be used: Direct matching.

3.5. Final Datasets

Once the matching process has been performed with the desired configuration, it is necessary to prepare the matched information for the desired prediction task (Regression or Classification), obtaining as a result the final datasets. Remember that Direct matching, as it was described in Section 3.4, performs a direct correspondence between the attributes used as inputs and the output one, and it is not necessary to carry out any preparation.

SPAMDA allows researchers to make such preparation by means of the following options:

Prediction horizon (Classification and Regression): This option indicates the time gap for moving backward the attribute to predict (output attribute). In this way, the input attributes (variables of the buoy and reanalysis data) will be used to predict the output attribute in a specific future time (e.g., +6 h, +12 h, +18 h, +1 day, etc.).
The minimum interval for increasing and decreasing the prediction horizon is 6 h (due to reanalysis data temporal resolution) [4], the same interval used when the matching process is carried out. Therefore, for each increment of the prediction horizon, an instance of the dataset is lost (as this future information is not available). As the minimum prediction horizon is 6 h, at least one instance will be lost. The relation between the inputs and the output (attribute to predict) is defined as follows:

$o_{t + Δ t} = ϕ (b_{t}, r_{t}),$

(4)

where t is the time instant to study, $Δ t$ is the prediction horizon, o is the attribute to be predicted, $b_{t}$ represents the vector that contains the selected NDBC variables, and, finally, $r_{t}$ represents the vector that contains the selected reanalysis variables. In this way and considering the matched information shown in Figure 8a, WVHT is o, the vector $b$ contains the variable WSPD, and the vector $r$ contains Pres.
Optionally, the reanalysis variables can be synchronised with the attribute to predict. Given that these variables are estimated by a mathematical model, we can obtain very good future estimations, which can improve the performance of the results. In this case, the relation between the inputs and the attribute to predict would be:

$o_{t + Δ t} = ϕ (b_{t}, r_{t + Δ t}) .$

(5)

Note that the selected NDBC variables as input cannot be synchronised with the attribute to predict.
For the sake of clarity, considering the matched information shown in Figure 8a, an example of building a dataset for a Regression task is shown in Figure 9a. As mentioned earlier, this prediction task requires a real output variable (in this case, WVHT, the last one). The options considered for the preparation of each final dataset are the following:
−
Do not synchronise the reanalysis data (see Equation (4) for the relation between the inputs and the output).
−
A prediction horizon of 6 h.
Note that, due to prediction horizon is 6 h, the values of WVHT attribute are moved backward one instance (up). As a consequence, the last instance (31 December 2017 18:00) is lost and is not included in the final dataset. In addition, and because the reanalysis data have not been synchronised, the values of the Pres and WSPD variables are at the same time instant (t in Equation (4)).
Moreover, considering again the matched information shown in Figure 8a, an example of the creation of the same dataset but applying synchronisation (see Equation (5)) is shown in Figure 9b.
Again, and due to the prediction horizon selected (6 h), the values of the WVHT attribute are moved backward one instance (up) and the last instance (31 December 2017 18:00) is not included in the final dataset. However, now, the values of the Pres variable are also moved backward one instance (due to the synchronisation). Therefore, in this case, Pres is at the same time instant as the attribute to predict ( $t + Δ t$ in Equation (5)).
Thresholds of the output attribute (Classification): Since the values of the variables collected by the buoys are real numbers, it is necessary to discretise them (convert them from real to nominal values) for the attribute selected as output (attribute to be predicted). SPAMDA allows researchers to perform this process by defining the necessary classes with their thresholds, which will be used to carry out such discretisation.
Considering again the matched information shown in Figure 8a, an example of the creation of a Classification dataset is shown in Figure 10. The options considered for the preparation of the final dataset are the following:
−
Do not synchronise the reanalysis data.
−
A prediction horizon of 6 h.
−
The thresholds shown in Table 2.
Note that the attribute to be predicted has been renamed to Class_WVHT to show that it is now a nominal variable because its values have been discretised according to the thresholds (usually defined by an expert). In addition, and due to the 6 h prediction horizon, the last instance is lost (31 December 2017 18:00), and the values of the attribute Class_WVHT are moved backward one instance (up). As the reanalysis data have not been synchronised, the values of the Pres and WSPD variables are at the same time instant (t in Equation (4)).

The content of the final datasets, obtained as the result of the preparation of the matched data, can be visualised to check everything before saving them on disk. Such preparation can be performed as many times as required and considering the different options in each moment. Although the date will not be included in the final datasets, it can be shown to properly check the matching.

Finally, it is necessary to define the output configuration to create the final datasets:

Output path file: Name of the final datasets and folder to save them on disk.
Final datasets format:
−
ARFF: Attribute-Relation File Format [53], which is used by WEKA. SPAMDA allows researchers to directly open the final datasets in the Explorer environment of WEKA (in the same context of work), enabling them to choose the most appropriate ML method to tackle the problem under study.
−
CSV: Comma-Separated Values. This format is included in order to consider other different tasks of software tools.

A text file that summarises the configuration used in matching process and in the preparation of the matched data are also generated. It can be saved and loaded, enabling researchers to resume their studies at any other time.

3.6. Manage Reanalysis Data

As mentioned in Section 2, the reanalysis data files provided by NNRP contain the estimated values by a mathematical model of one meteorological variable.

In this module (see Figure 3), SPAMDA includes features for entering new files and deleting the unnecessary ones. In addition, useful information about the content of each reanalysis file can be consulted such as name of the file and the reanalysis variable, number of instances and reanalysis nodes, initial and final time, latitude and longitude. All of these fields summarise the temporal and spatial properties of the data. Thus, researchers can quickly and easily identify each reanalysis file entered in SPAMDA.

An example where two reanalysis data files have been entered in SPAMDA is shown in Figure 11.

3.7. Tools

SPAMDA also contains another module that provides two utilities: one of them is Dataset converter used for converting the desired intermediate or pre-processed datasets to ARFF or CSV formats; the other utility can be used for opening ARFF files with WEKA Explorer environment, which is useful for easily checking the results of different configurations of the pre-processing.

4. A Case Study Applied to Gulf of Alaska

This section describes how SPAMDA works in a practical approach showing two examples to create fully processed datasets (final datasets) starting from the raw data. The objective of these final datasets is to be used with SC and ML algorithms for environmental modelling, in this case, to classify waves depending on their height and to predict energy flux in the Gulf of Alaska.

On the one hand, wave classification is addressed as a multi-class approach, given that a continuous attribute can be discretised, using different thresholds, in distinct classes. Such wave modelling can be applied with different purposes, such as missing buoy data reconstruction, extreme significant wave heights detection or decision-making and risk assessment about operational works in the sea.

On the other hand, the prediction of the energy flux is addressed as a regression problem. Energy flux prediction is related to marine energy, and it is useful to characterise the wave energy production from WEC facilities, which could be injected into the electric network or supplied to existing marine platforms.

4.1. Gathering the Information and Introducing it in SPAMDA

The data collected to perform this case study is:

The measurements obtained from 2013 to 2017 by the buoy with ID 46001, placed in the Gulf of Alaska, which are provided by NDBC as annual text files. This data are publicly available at the NDBC website.
Complementary information collected from reanalysis data containing air temperature (air), pressure (pres) and two components of wind speed measurements, South–North (vwind) and West–East (uwind). This information can be downloaded from the NNRP website in NetCDF format for the four closest nodes of reanalysis surrounding the position of the buoy. Concretely, the closest reanalysis nodes downloaded are $57.5$ N × $147.5$ W, $57.5$ N × 150 W, 55 N × $147.5$ W, and 55 N × 150 W. However, as will be seen later, only the information from the nearest node will be used in the data integration process.

After gathering the information described above, researchers can open SPAMDA. In Figure 12, the main view is shown. In order to input the reanalysis data which will be used in further steps for creating the final dataset, researchers has to select the option Manage reanalysis data.

Then, the view of Figure 13 is shown. Here, using the buttons located at the bottom, it is possible to add, delete, or consult any data from the different reanalysis files. Once the information has been introduced in the application, this view can be closed and the user can go back to the main view to continue entering the information related to the buoy under study.

After that, the researcher has to select Manage buoys data to open the view shown in Figure 14, where several tabs are available. In the Buoys tab, the researcher can consult, modify, add, or delete different data related to the buoy.

In order to enter such data, click on the New button, and then the view shown in Figure 15 pops up.

Here, the information about the buoy has to be included: the Station ID, its description, geographical localisation, and the corresponding annual text files. In this case, the files containing the data from year 2013 to 2017 are inserted by clicking on the Add file button. Once the data have been introduced, it is necessary to click on the Save button to insert the buoy in SPAMDA database. After that, the view can be closed.

To create the intermediate dataset, the researcher has to double-click on the buoy under study or click on the Datasets tab (see Figure 14) to switch to the corresponding view (see Figure 16). In this view, the researcher can delete or consult a summary of each intermediate or pre-processed dataset by selecting it from the corresponding list. It can also create new ones. To proceed with the creation of the intermediate dataset, the user clicks on the New button, and the view shown in Figure 17 appears.

Here, the researcher can select the annual text files to be included in the intermediate dataset, by clicking on the -> and <- buttons. In this case, all the files introduced before, which correspond to the buoy under study, are selected. When the file selection is finished, Create button has to be clicked in order to introduce the description and the file name of the current intermediate dataset, and, then, with the Save button, the creation process starts, showing the status of the process during it. After that, in order to prepare the intermediate dataset, the dataset is selected (see Figure 16), and then the button Open is clicked to jump to the tab Pre-process (shown in Figure 18).

In Pre-process tab, relevant statistical information about the selected dataset is shown, and also the content of the dataset can be consulted, providing the researcher the capacity to evaluate the pre-processing being performed. Here, the researcher can apply (and configure) the necessary filters (explained in Section 3.3) to the selected dataset, and, in the bottom part, the main statistics of the dataset are displayed, which can be used to observe the changes produced when applying a filter. As mentioned earlier, this case study is focused on classifying waves considering their height, so any missing data from wave height (376 values) and the remaining attributes are recovered, using the filter Replace missing values with symmetric 3 h mean. Furthermore, the attributes MWD, DEWP, VIS and TIDE are removed from the dataset by applying the filter RemoveByName, since the first two had more than 92% of missing data and the last two 100%. After finishing the pre-processing of the dataset, the researcher can click on the Save button, to introduce the description and file name for the current pre-processed dataset.

At this point, the researcher has registered the buoy in SPAMDA, then entered its raw data and selected the required data for the problem (intermediate dataset). Finally, the data have been pre-processed in order to be ready for its future use in ML algorithms. Then, a data integration process can be carried out to merge the processed data from NDBC with the reanalysis data (also included previously) from NNRP.

The next step is to customise (or load) the parameters of the matching process according to the problem being studied and to select the prediction task (described in Section 3.4) that the final dataset will be used for, in this case, waves classification or energy flux prediction.

4.2. Waves Classification

As mentioned above, the objective of the final dataset is to be used with SC and ML algorithms to classify waves depending on their significant height. The following sections describe the procedure of performing the data integration provided by SPAMDA, modelling wave height by using classification algorithms available in WEKA.

4.2.1. Obtaining the Final Dataset

By clicking on the Matching configuration tab, the view shown in Figure 19 will be opened. In this view, the researcher can configure the parameters of the data integration process. For this problem, the following parameters were selected:

Attribute to predict: WVHT.
Reanalysis data: Air, pressure, u-wind and v-wind.
Buoy attributes to be used as inputs: WDIR, WSPD, GST, DPD, APD, PRES, ATMP and WTMP (see Table 1 or descriptions of the measurements in the NDBC website).
Reanalysis nodes to consider: 1 (only the closest reanalysis node will be used).
Number of final datasets: In this example, that option is disabled because only one reanalysis node is considered.
Prediction task: Classification.

After configuring the matching process, the researcher can click on the Run button to jump to the view shown in Figure 20 and proceed to define the final dataset structure according to the selected prediction task. Given that, in the previous view (Figure 19), Classification was selected, the researcher can now add, modify, or delete the thresholds (usually defined by an expert) for discretising the output variable (top left of Figure 20). After this, the next step is to set the time horizon desired (6 h by default) and also to activate (if desired) the synchronisation (in time) of reanalysis variables with the output (top right of Figure 20), as explained in Section 3.5. Then, the researcher can click on the Update final dataset button to see the content shown in the bottom left corner (NDBC observations, NNRP variables, missing values, dates). Finally, after checking that everything is correct, the last step would be to select the name and path of the dataset file, and its output format (CSV or ARFF) and click on the Create final datasets button (bottom right of Figure 20). For this example, the following configuration was applied:

Thresholds: see Table 2.
Prediction horizon: 6 h.
Synchronisation: Disabled.
Final dataset format: ARFF.

At this point, the final dataset would be created according to the tailored configuration and stored in the computer of the researcher, which already can apply the ML techniques to address the problem of wave classification. Concretely, the final dataset consists of 7302 instances and whose distribution is represented in Table 3.

4.2.2. Obtaining Classification Models with ML Algorithms

Now, the process to obtain wave classification models is described using the final dataset previously created with SPAMDA. The modelling will be performed using WEKA as SC and ML tool, which can be opened through SPAMDA, as shown in Figure 21. Nevertheless, as mentioned above, the researcher can create the final dataset in CSV format in order to use any other ML tools, such as KEEL, Python, or R, among others.

Since the final dataset is a time series of meteorological data (collected from 2013 to 2017), a hold-out scheme (60% train/40% test) will be used. In this way, years from 2013 to 2015 will be used for the training phase (4380 instances), whereas 2016 and 2017 years will be used for the test phase (2922 instances). Previous to the learning phase, the attributes are normalised to avoid some attributes dominating others because of a larger scale.

The classification algorithms that will be considered for wave modelling are Logistic Regression [64], C4.5 [65], Random Forest [66], Support Vector Machine [67] and Multilayer Perceptron [68], which will be applied with the default values of the parameters provided by WEKA. Given that Logistic Regression and C4.5 algorithms are deterministic, only one run will be considered for each one. However, Random Forest, Support Vector Machine and Multilayer Perceptron algorithms have a stochastic component, so, in this case, 30 executions for each one will be carried out. Table 4 shows the results of this experimentation.

As can be seen, Random Forest and Multilayer Perceptron algorithms have achieved similar accuracy, but the performance of the latter is slightly better. Although this is an illustrative classification example using datasets built with SPAMDA, both models have obtained good performance, despite the fact that the problem being tackled is difficult (prediction is approached six hours in advance).

4.3. Energy Flux Prediction

As mentioned above, the final dataset of this example is also used with SC and ML algorithms to predict flux of energy. The following sections explain the process of performing the data integration provided by SPAMDA to build the final dataset, modelling the flux of energy by using regression algorithms available in WEKA.

4.3.1. Obtaining the Final Dataset

The researcher can configure the parameters of the data integration by clicking on the Matching configuration tab. For this problem, as shown in Figure 22, the following parameters were selected:

Attribute to predict: Flux of energy.
Reanalysis data: Air, pressure, u-wind and v-wind.
Buoy attributes to be used as inputs: WDIR, WSPD, GST, DPD, APD, PRES, ATMP and WTMP (see Table 1 or descriptions of the measurements in the NDBC website).
Reanalysis nodes to consider: 1 (only the closest reanalysis node will be used).
Number of final datasets: In this example, this option is disabled because only one reanalysis node is considered.
Prediction task: Regression.

After configuring the parameters of the matching process, the next step is to define the final dataset structure according to the selected prediction task. Researchers can click on the Run button to jump to the view shown in Figure 23. Note that the thresholds for discretising the output variable (top left of Figure 23) are disabled due to, in this case, energy flux prediction being a regression problem.

By default, the time horizon is set to six hours, that is, the energy flux prediction will be performed six hours in advance (top right of Figure 23), but researchers can increase such time horizon depending on their needs. The synchronisation (in time) of reanalysis variables with the output (explained in Section 3.5) can be set in this view. By clicking on the Update final dataset button, researchers can preview the content of the final dataset (bottom left corner of Figure 23). Finally, the last step would be to set the name, path and output format (CSV or ARFF) of the dataset file, and then the user should click on the Create final datasets button (bottom right of Figure 23). For this example, the following configuration was applied:

Prediction horizon: 6 h.
Synchronisation: Disabled.
Final dataset format: ARFF.

After that, the final dataset would be created and stored in the computer of the researcher according to the introduced configuration, ready to be used as input for SC and ML techniques to tackle the problem of energy flux prediction. The number of instances (7302) and the distribution of the final dataset (Table 3) are the same as in the previous example (waves classification) since the data used to create the final dataset and the time horizon selected (6 h) are the same.

4.3.2. Obtaining Prediction Models with ML Algorithms

In this example, WEKA is used as SC and ML tool to obtain energy flux prediction models, as shown in Figure 24. Nonetheless, the final dataset can be created in CSV format so that the researcher can use any other SC and ML tool.

For this problem, the same partitioning scheme used in the wave classification problem is considered (60% train/40% test), that is, from 2013 to 2015 for the training phase (4380 instances) and 2016 and 2017 for the test phase (2922 instances). Again, the attributes are normalised prior to the learning phase.

To perform the energy flux modelling, one execution will be run for the deterministic algorithm Linear Regression [36], whereas 30 executions will be considered for the stochastic ones: Random Forest [66], Support Vector Machine [67] and Multilayer Perceptron [68]. Table 5 shows the experimental results obtained using the default values for the parameters of the algorithms provided by WEKA.

As can be checked, Random Forest has achieved the best performance for the Root mean squared error. The standard deviation of the results obtained by Multilayer Perceptron indicates that this algorithm may have been slightly affected by its stochastic component. However, both Multilayer Perceptron and Random Forest have obtained an excellent Correlation coefficient.

In this case study, the use of datasets created with SPAMDA has been shown to address an energy flux prediction problem. An exhaustive comparison of regression algorithms is not the purpose of this work. However, note that Multilayer Perceptron and Random Forest algorithms have achieved very good results despite the fact that the energy flux prediction has been performed with a time horizon of 6 h.

4.4. Important Remarks

In this section, it has been described how to use SPAMDA to create final datasets with the aim of classifying waves and predicting flux of energy. However, using the same data described in Section 4.1, the researcher can quickly address other objectives or different studies by merely tailoring the matching configuration of the data integration process. For example, longer-term wave or energy flux prediction can be addressed by changing the time horizon, waves modelling can be approached from another perspective by creating the final dataset for regression, or environmental modelling can be focused in diverse fields by changing the output meteorological variable.

Furthermore, environmental modelling in other geographical location can be carried out by merely using other collected data.

As SPAMDA performs all data processing and management to create the datasets, it not only prevents researchers from performing repetitive tasks but also prevents them from making possible errors. In this way, researchers can focus on the studies they are carrying out.

5. Conclusions

Studies on marine energy using ML and SC methodologies apply specific algorithms (extreme learning machine, metaheuristics, Bayesian networks, neural networks, etc.) on data using custom-made implementations or scripts developed in some programming language; but they do not allow for building datasets in an automated way ready to be used as input for prediction tasks (classification or regression). In this sense, a new open source tool named SPAMDA has been presented in this work, with a user-friendly GUI for creating datasets using meteorological data from NDBC and NNRP. The aim of the tool is to provide the research community with an automated, customisable and robust integration for NDBC and NNRP data, serving as a tool for analysis and decision support in marine energy and engineering applications, among others.

Such datasets can be easily obtained with SPAMDA by means of the selection of different input parameters, such as predictive and objective variables, output discretisation or prediction horizon. As a result, researchers will benefit from significant support when carrying out environmental modelling related to energy, atmospheric or oceanic studies, among others. Moreover, given that SPAMDA simplifies all the intermediate steps involved in the creation of datasets and manages the extensive casuistry of the data integration (such as specifying the meteorological information, managing incomplete data, pre-processing tasks, the customisable matching process to merge the data and the preparation of the datasets according to the SC or ML technique to use), it avoids errors and reduces the time needed. In this way, researchers will be able to have more in-depth analysis, which could result in more complete conclusions about the issue under study.

The case study described in Section 4 illustrates how SPAMDA can be used by researchers in a practical approach for environmental modelling, concretely, to classify waves in the Gulf of Alaska depending on their height. The case study also covers an example of energy flux prediction, to predict the wave energy that could be exploited by WEC facilities six hours in advance, although such time horizon is customisable. Given that this work does not focus on models performance, a more extensive validation or comparison study of the results obtained in both examples has not been carried out. The final datasets obtained with SPAMDA can be replicated by researchers using the same meteorological data from NDBC and NNRP (publicly available) and applying the same parameters for the pre-processing tasks and the data integration process. After that, the models and results obtained, using such final datasets, will depend on the SC or ML tool used.

In order to improve SPAMDA, some future work could be focused on new functional modules for managing meteorological data of different formats [69], so that the developed tool can be extended to any other research, new pre-processing functionalities such as filters to analyse the correlation between attributes or new functional modules for recovering missing values using nearby buoys data [70]. Furthermore, the developed software could manage other sources of reanalysis data (with different spatial and temporal resolution), and new output formats for the datasets which could be used as input by other tools for ML such as KEEL (Knowledge Extraction based on Evolutionary Learning) [71]. However, such new functionalities can be developed with a reasonable effort to be able to manage each particular casuistry. For example, when dealing with incomplete data, interpreting different data and files structures or carrying out the matching process of two environmental data sources.

Supplementary Materials

The source code and the software tool are available at https://github.com/ayrna.

Author Contributions

Conceptualization, Formal analysis and Investigation, A.M.G.-O., J.C.F., M.D.-M., P.A.G. and C.H.-M.; Funding acquisition, Project administration, Resources and Supervision, P.A.G. and C.H.-M.; Methodology, A.M.G.-O., J.C.F., M.D.-M., P.A.G. and C.H.-M.; Software, A.M.G.-O., J.C.F. and M.D.-M.; Validation, A.M.G.-O., J.C.F., M.D.-M., P.A.G. and C.H.-M.; Writing—original draft, A.M.G.-O., J.C.F. and M.D.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially subsidised by the projects with references TIN2017-85887-C2-1-P of the Spanish Ministry of Economy and Competitiveness (MINECO), UCO-1261651 of the “Consejería de Economía, Conocimiento, Empresas y Universidad” of the “Junta de Andalucía” (Spain) and FEDER funds of the European Union.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found in the National Data Buoy Center (NDBC) and the NOAA Physical Sciences Laboratory (PSL).

Acknowledgments

The authors also thank NOAA/OAR/ESRL PSD, Boulder, CO, USA for the NCEP Reanalysis data provided from their Web site at https://www.esrl.noaa.gov/psd/, to NOAA/ NDBC by its data that were collected and made freely available, to the University of Waikato for the WEKA (Waikato Environment for Knowledge Analysis) software tool, to University Corporation for Atmospheric Research/Unidata for the NetCDF (network Common Data Form) Java library and to QOS.ch for the SLF4J (Simple Logging Facade for Java) library.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Abbreviations used in this manuscript:

$F_{e}$	Flux of energy
$H_{s}$	Significant wave height
$T_{e}$	Wave energy period
$p_{0}$	Geographical location of the buoy
$p_{j}$	Geographical location of each reanalysis node
$l a t$	Latitude of the point
$l o n$	Longitude of the point
$o_{t}$	The attribute to be predicted at the time instant to study
$Δ t$	The prediction horizon
$b_{t}$	The vector containing the selected NDBC variables
$r_{t}$	The vector containing the selected reanalysis variables

Appendix A. Managing the Casuistry of Incomplete Data

In this appendix, we describe how SPAMDA deals with incomplete data when creating intermediate datasets and performing the matching process.

The measurements collected by the buoys may be incomplete or recorded at a different time than the expected one, due to the weather conditions in which the buoys have to operate. To illustrate this casuistry, the following examples are shown in Figure A1:

In the instance marked with (a), the measurement of 17:50 was collected at 17:45, 5 min earlier.
In the instance marked with (b), the measurement of 23:50 was collected at 23:30, 20 min earlier.
In the instance marked with (c), the measurement of 05:50 is duplicated.
In the instance marked with (d), the measurement of 11:50 is missing (missing date or instance).
In the instance marked with (e), the measurement of 17:50 and 18:50 are missing (missing dates or instances).
Missing values highlighted in red.

Figure A1. A fragment of an annual text file with different missing value examples.

SPAMDA has been designed to tackle these situations, and it informs researchers of any incidence found while reading the annual text files for creating the intermediate datasets. For the case of measurements that were recorded at a different time than expected, a time gap of 6 min (

10 %

of an hour) has been established. Therefore, if the time difference exceeds such value, the date will be considered as an unexpected.

Figure A2 shows the status of the creation of an intermediate dataset with the information of Figure A1. Note that the instance marked with a) has not been informed by SPAMDA as an unexpected date because its time difference is less than six minutes. Depending on the affected attribute, NDBC uses a specific value [69] to indicate the presence of lost data (e.g., 99 for VIS and TIDE attributes, 999 for DEWP, MWD and WDIR, etc.). SPAMDA interprets these specific values and, after creating the intermediate dataset, researchers can check if it contains missing values by visualising its statistical information or content. Remember that SPAMDA provides several filters for recovering missing data, which were described in Section 3.2.

Figure A2. Status of the creation of the intermediate dataset for the example of Figure A1.

SPAMDA takes into account this casuistry when carrying out the matching process. An example is given in Figure A3. As mentioned above, the matching process is performed with the nearest measurement (previous or next) within a maximum of 60 min of difference. However, in the instance marked with

e)

, given that the measurements dates 1 May 2017 17:50 and 1 May 2017 18:50 are missing, the reanalysis date 1 May 2017 18:00 cannot be matched with buoy data (this date is highlighted in mauve in Figure A3). Depending on the selection made by researchers in the parameter Include missing dates, this instance will be included in the final dataset (with missing values for buoy variables) or not.

Figure A3. Matching the measurements (left) and the reanalysis data (right).

References

Anis, M.S.; Jamil, B.; Ansari, M.A.; Bellos, E. Generalized models for estimation of global solar radiation based on sunshine duration and detailed comparison with the existing: A case study for India. Sustain. Energy Technol. Assess. 2019, 31, 179–198. [Google Scholar] [CrossRef]
Laface, V.; Arena, F.; Soares, C.G. Directional analysis of sea storms. Ocean Eng. 2015, 107, 45–53. [Google Scholar] [CrossRef]
Shivam, K.; Tzou, J.C.; Wu, S.C. Multi-Objective Sizing Optimization of a Grid-Connected Solar–Wind Hybrid System Using Climate Classification: A Case Study of Four Locations in Southern Taiwan. Energies 2020, 13, 2505. [Google Scholar] [CrossRef]
Dorado-Moreno, M.; Cornejo-Bueno, L.; Gutiérrez, P.A.; Prieto, L.; Hervás-Martínez, C.; Salcedo-Sanz, S. Robust estimation of wind power ramp events with reservoir computing. Renew. Energy 2017, 111, 428–437. [Google Scholar] [CrossRef]
He, Q.; Zha, C.; Song, W.; Hao, Z.; Du, Y.; Liotta, A.; Perra, C. Improved Particle Swarm Optimization for Sea Surface Temperature Prediction. Energies 2020, 13, 1369. [Google Scholar] [CrossRef]
Fuchs, H.L.; Gerbi, G.P. Seascape-level variation in turbulence- and wave-generated hydrodynamic signals experienced by plankton. Prog. Oceanogr. 2016, 141, 109–129. [Google Scholar] [CrossRef]
Da Silva, V.D.P.R.; Araújo e Silva, R.; Cavalcanti, E.P.; Braga Campos, C.; Vieira de Azevedo, P.; Singh, V.P.; Rodrigues Pereira, E.R. Trends in solar radiation in NCEP/NCAR database and measurements in northeastern Brazil. Sol. Energy 2010, 84, 1852–1862. [Google Scholar] [CrossRef]
Gouldby, B.; Méndez, F.J.; Guanche, Y.; Rueda, A.; Mínguez, R. A methodology for deriving extreme nearshore sea conditions for structural design and flood risk analysis. Coast. Eng. 2014, 88, 15–26. [Google Scholar] [CrossRef]
Alizadeh, R.; Jia, L.; Nellippallil, A.B.; Wang, G.; Hao, J.; Allen, J.K.; Mistree, F. Ensemble of surrogates and cross-validation for rapid and accurate predictions using small data sets. Artif. Intell. Eng. Des. Anal. Manuf. 2019, 33, 484–501. [Google Scholar] [CrossRef]
Alizadeh, R.; Allen, J.K.; Mistree, F. Managing computational complexity using surrogate models: A critical review. Res. Eng. Des. 2020, 31, 275–298. [Google Scholar] [CrossRef]
Manfren, M.; Groppi, D.; Astiaso Garcia, D. Open data and energy analytics—An analysis of essential information for energy system planning, design and operation. Energies 2020, 13, 2334. [Google Scholar] [CrossRef]
Dhanraj Bokde, N.; Mundher Yaseen, Z.; Bruun Andersen, G. ForecastTB—An R Package as a Test-Bench for Time Series Forecasting—Application of Wind Speed and Solar Radiation Modeling. Energies 2020, 13, 2578. [Google Scholar] [CrossRef]
Lo, C.K.; Lim, Y.S.; Rahman, F.A. New integrated simulation tool for the optimum design of bifacial solar panel with reflectors on a specific site. Renew. Energy 2015, 81, 293–307. [Google Scholar] [CrossRef]
Nguyen, T.H.; Prinz, A.; Friisø, T.; Nossum, R.; Tyapin, I. A framework for data integration of offshore wind farms. Renew. Energy 2013, 60, 150–161. [Google Scholar] [CrossRef]
Di Bari, R.; Horn, R.; Nienborg, B.; Klinker, F.; Kieseritzky, E.; Pawelz, F. The Environmental Potential of Phase Change Materials in Building Applications. A Multiple Case Investigation Based on Life Cycle Assessment and Building Simulation. Energies 2020, 13, 3045. [Google Scholar] [CrossRef]
Astiaso Garcia, D.; Bruschi, D. A risk assessment tool for improving safety standards and emergency management in Italian onshore wind farms. Sustain. Energy Technol. Assess. 2016, 18, 48–58. [Google Scholar] [CrossRef]
Raabe, A.L.A.; Klein, A.H.d.F.; González, M.; Medina, R. MEPBAY and SMC: Software tools to support different operational levels of headland-bay beach in coastal engineering projects. Coast. Eng. 2010, 57, 213–226. [Google Scholar] [CrossRef]
Motahhir, S.; EL Hammoumi, A.; EL Ghzizal, A.; Derouich, A. Open hardware/software test bench for solar tracker with virtual instrumentation. Sustain. Energy Technol. Assess. 2019, 31, 9–16. [Google Scholar] [CrossRef]
Cascajo, R.; García, E.; Quiles, E.; Correcher, A.; Morant, F. Integration of Marine Wave Energy Converters into Seaports: A Case Study in the Port of Valencia. Energies 2019, 12, 787. [Google Scholar] [CrossRef]
Zeyringer, M.; Fais, B.; Keppo, I.; Price, J. The potential of marine energy technologies in the UK—Evaluation from a systems perspective. Renew. Energy 2018, 115, 1281–1293. [Google Scholar] [CrossRef]
De Jong, M.; Hoppe, T.; Noori, N. City Branding, Sustainable Urban Development and the Rentier State. How do Qatar, Abu Dhabi and Dubai present Themselves in the Age of Post Oil and Global Warming? Energies 2019, 12, 1657. [Google Scholar] [CrossRef]
Brede, M.; de Vries, B.J.M. The energy transition in a climate-constrained world: Regional vs. global optimization. Environ. Model. Softw. 2013, 44, 44–61. [Google Scholar] [CrossRef]
Alizadeh, R.; Lund, P.D.; Soltanisehat, L. Outlook on biofuels in future studies: A systematic literature review. Renew. Sustain. Energy Rev. 2020, 134, 110326. [Google Scholar] [CrossRef]
Falcão, A.F.D.O. Wave energy utilization: A review of the technologies. Renew. Sustain. Energy Rev. 2010, 14, 899–918. [Google Scholar] [CrossRef]
Amini, E.; Golbaz, D.; Amini, F.; Majidi Nezhad, M.; Neshat, M.; Astiaso Garcia, D. A Parametric Study of Wave Energy Converter Layouts in Real Wave Models. Energies 2020, 13, 6095. [Google Scholar] [CrossRef]
Oliveira-Pinto, S.; Rosa-Santos, P.; Taveira-Pinto, F. Electricity supply to offshore oil and gas platforms from renewable ocean wave energy: Overview and case study analysis. Energy Convers. Manag. 2019, 186, 556–569. [Google Scholar] [CrossRef]
Fernández Prieto, L.; Rodríguez Rodríguez, G.; Schallenberg Rodríguez, J. Wave energy to power a desalination plant in the north of Gran Canaria Island: Wave resource, socioeconomic and environmental assessment. J. Environ. Manag. 2019, 231, 546–551. [Google Scholar] [CrossRef]
Ochi, M.K. Ocean Waves: The Stochastic Approach; Cambridge Ocean Technology Series; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Crowley, S.; Porter, R.; Taunton, D.J.; Wilson, P.A. Modelling of the WITT wave energy converter. Renew. Energy 2018, 115, 159–174. [Google Scholar] [CrossRef]
Abdelkhalik, O.; Robinett, R.; Zou, S.; Bacelli, G.; Coe, R.; Bull, D.; Wilson, D.; Korde, U. On the control design of wave energy converters with wave prediction. J. Ocean. Eng. Mar. Energy 2016, 2, 473–483. [Google Scholar] [CrossRef]
Ringwood, J.V.; Bacelli, G.; Fusco, F. Energy-Maximizing Control of Wave-Energy Converters: The Development of Control System Technology to Optimize Their Operation. IEEE Control Syst. 2014, 34, 30–55. [Google Scholar] [CrossRef]
Wei, C.C. Nearshore Wave Predictions Using Data Mining Techniques during Typhoons: A Case Study near Taiwan’s Northeastern Coast. Energies 2018, 11, 11. [Google Scholar] [CrossRef]
Kaloop, M.R.; Kumar, D.; Zarzoura, F.; Roy, B.; Hu, J.W. A wavelet—Particle swarm optimization—Extreme learning machine hybrid modeling for significant wave height prediction. Ocean Eng. 2020, 213, 107777. [Google Scholar] [CrossRef]
Rusu, L. Assessment of the Wave Energy in the Black Sea Based on a 15-Year Hindcast with Data Assimilation. Energies 2015, 8, 10370–10388. [Google Scholar] [CrossRef]
Rhee, S.Y.; Park, J.; Inoue, A. (Eds.) Soft Computing in Machine Learning; Springer: Berlin, Germany, 2014. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Chang, F.J.; Hsu, K.; Chang, L.C. (Eds.) Flood Forecasting Using Machine Learning Methods; MPDI: Basel, Switzerland, 2019. [Google Scholar]
Dineva, A.; Mosavi, A.; Faizollahzadeh Ardabili, S.; Vajda, I.; Shamshirband, S.; Rabczuk, T.; Chau, K.W. Review of Soft Computing Models in Design and Control of Rotating Electrical Machines. Energies 2019, 12, 1049. [Google Scholar] [CrossRef]
Guo, Y.; Wang, J.; Chen, H.; Li, G.; Liu, J.; Xu, C.; Huang, R.; Huang, Y. Machine learning-based thermal response time ahead energy demand prediction for building heating systems. Appl. Energy 2018, 221, 16–27. [Google Scholar] [CrossRef]
Frank, E.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Cambridge, MA, USA, 2016. [Google Scholar]
Durán-Rosal, A.M.; Fernández, J.C.; Gutiérrez, P.A.; Hervás-Martínez, C. Detection and prediction of segments containing extreme significant wave heights. Ocean Eng. 2017, 142, 268–279. [Google Scholar] [CrossRef]
Kumar, N.K.; Savitha, R.; Al Mamun, A. Regional ocean wave height prediction using sequential learning neural networks. Ocean Eng. 2017, 129, 605–612. [Google Scholar] [CrossRef]
Ali, M.; Prasad, R.; Xiang, Y.; Deo, R.C. Near real-time significant wave height forecasting with hybridized multiple linear regression algorithms. Renew. Sustain. Energy Rev. 2020, 132, 110003. [Google Scholar] [CrossRef]
Cornejo-Bueno, L.; Nieto-Borge, J.; García-Díaz, P.; Rodríguez, G.; Salcedo-Sanz, S. Significant wave height and energy flux prediction for marine energy applications: A grouping genetic algorithm—Extreme Learning Machine approach. Renew. Energy 2016, 97, 380–389. [Google Scholar] [CrossRef]
Emmanouil, S.; Aguilar, S.G.; Nane, G.F.; Schouten, J.J. Statistical models for improving significant wave height predictions in offshore operations. Ocean Eng. 2020, 206, 107249. [Google Scholar] [CrossRef]
Shamshirband, S.; Mosavi, A.; Rabczuk, T.; Nabipour, N.; Wing Chau, K. Prediction of significant wave height; comparison between nested grid numerical model, and machine learning models of artificial neural networks, extreme learning and support vector machines. Eng. Appl. Comput. Fluid Mech. 2020, 14, 805–817. [Google Scholar] [CrossRef]
Johansson, L.; Epitropou, V.; Karatzas, K.; Karppinen, A.; Wanner, L.; Vrochidis, S.; Bassoukos, A.; Kukkonen, J.; Kompatsiaris, I. Fusion of meteorological and air quality data extracted from the web for personalized environmental information services. Environ. Model. Softw. 2015, 64, 143–155. [Google Scholar] [CrossRef]
Fernández, J.C.; Salcedo-Sanz, S.; Gutiérrez, P.A.; Alexandre, E.; Hervás-Martínez, C. Significant wave height and energy flux range forecast with machine learning classifiers. Eng. Appl. Artif. Intell. 2015, 43, 44–53. [Google Scholar] [CrossRef]
Adams, J.; Flora, S. Correlating seabird movements with ocean winds: Linking satellite telemetry with ocean scatterometry. Mar. Biol. 2010, 157, 915–929. [Google Scholar] [CrossRef]
National Data Buoy Center. National Oceanic and Atmospheric Administration of the USA (NOAA). Available online: http://www.ndbc.noaa.gov/ (accessed on 10 December 2020).
Kalnay, E.; Kanamitsu, M.; Kistler, R.; Collins, W.; Deaven, D.; Gandin, L.; Iredell, M.; Saha, S.; White, G.; Woollen, J.; et al. The NCEP/NCAR 40-Year Reanalysis Project. Bull. Am. Meteorol. Soc. 1996, 77, 437–471. [Google Scholar] [CrossRef]
Kistler, R.; Collins, W.; Saha, S.; White, G.; Woollen, J.; Kalnay, E.; Chelliah, M.; Ebisuzaki, W.; Kanamitsu, M.; Kousky, V.; et al. The NCEP–NCAR 50–Year Reanalysis: Monthly Means CD–ROM and Documentation. Bull. Am. Meteorol. Soc. 2001, 82, 247–267. [Google Scholar] [CrossRef]
The WEKA Data Mining Software: Attribute-Relation File Format (ARFF). Available online: https://www.cs.waikato.ac.nz/ml/weka/arff.html (accessed on 10 December 2020).
Ali, M.; Prasad, R. Significant wave height forecasting via an extreme learning machine model integrated with improved complete ensemble empirical mode decomposition. Renew. Sustain. Energy Rev. 2019, 104, 281–295. [Google Scholar] [CrossRef]
Chatziioannou, K.; Katsardi, V.; Koukouselis, A.; Mistakidis, E. The effect of nonlinear wave-structure and soil-structure interactions in the design of an offshore structure. Mar. Struct. 2017, 52, 126–152. [Google Scholar] [CrossRef]
Dalgic, Y.; Lazakis, I.; Dinwoodie, I.; McMillan, D.; Revie, M. Advanced logistics planning for offshore wind farm operation and maintenance activities. Ocean Eng. 2015, 101, 211–226. [Google Scholar] [CrossRef]
Spaulding, M.L.; Grilli, A.; Damon, C.; Crean, T.; Fugate, G.; Oakley, B.A.; Stempel, P. STORMTOOLS: Coastal Environmental Risk Index (CERI). J. Mar. Sci. Eng. 2016, 4, 54. [Google Scholar] [CrossRef]
National Data Buoy Center. NDBC—Historical NDBC Data. Available online: http://www.ndbc.noaa.gov/historical_data.shtml (accessed on 10 December 2020).
National Data Buoy Center. NDBC—Important NDBC Web Site Changes. Available online: http://www.ndbc.noaa.gov/mods.shtml (accessed on 10 December 2020).
National Data Buoy Center. Measurement Descriptions and Units. Available online: https://www.ndbc.noaa.gov/measdes.shtml#stdmet (accessed on 10 December 2020).
NOAA/OAR/ESRL PSD. ESRL : PSD : NCEP/NCAR Reanalysis 1. Available online: https://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html (accessed on 15 January 2019).
Unidata. Network Common Data Form (NetCDF) Version 4.6.10 [Software]; UCAR/Unidata: Boulder, CO, USA, 2017. [Google Scholar] [CrossRef]
De Smith, M.J.; Goodchild, M.F.; Longley, P.A. Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools, 3rd ed.; Matador: Leicester, UK, 2009. [Google Scholar]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Quinlan, J.R. C4. 5: Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1992. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall PTR: Hoboken, NJ, USA, 1994. [Google Scholar]
National Data Buoy Center. NDBC—Measurement Descriptions and Units. Available online: https://www.ndbc.noaa.gov/measdes.shtml (accessed on 10 December 2020).
Durán-Rosal, A.M.; Hervás-Martínez, C.; Tallón-Ballesteros, A.J.; Martínez-Estudillo, A.C.; Salcedo-Sanz, S. Massive missing data reconstruction in ocean buoys with evolutionary product unit neural networks. Ocean Eng. 2016, 117, 292–301. [Google Scholar] [CrossRef]
Alcalá-Fdez, J.; Sánchez, L.; García, S.; del Jesús, M.J.; Ventura, S.; Garrell, J.M.; Otero, J.; Romero, C.; Bacardit, J.; Rivas, V.M.; et al. KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 2009, 13, 307–318. [Google Scholar] [CrossRef]

Figure 1. A fragment of an annual text file of the Station 46001.

Figure 2. Sub-grid representation of six nodes of reanalysis surrounding the Station 46001.

Figure 3. Brief outline of the functionality provided by SPAMDA.

Figure 4. Example of entering two buoys with its annual text files.

Figure 5. Example of the creation of the intermediate datasets.

Figure 6. Example of the creation of pre-processed datasets.

Figure 7. An example of matching the data from NDBC (left) and NNRP (right).

Figure 8. Example of data integration for Direct matching.

Figure 9. Example of the creation of a Regression dataset with a prediction horizon of 6 h.

Figure 10. An example of the creation a Classification dataset, with a prediction horizon of 6 h and without synchronisation.

Figure 11. Example of entering two reanalysis data files.

Figure 12. SPAMDA main view.

Figure 13. Manage reanalysis data view: downloaded files containing the four closest reanalysis nodes.

Figure 14. Buoys tab: buoy ID 46001.

Figure 15. New buoy view: information of the buoy ID 46001.

Figure 16. Datasets tab: intermediate datasets of the buoy ID 46001.

Figure 17. New intermediate dataset view: creating the intermediate dataset with five annual text files.

Figure 18. Pre-process tab: pre-processing the created intermediate dataset.

Figure 19. Matching configuration tab: parameters for the data integration of the intermediate dataset and the reanalysis files (waves classification).

Figure 20. Final datasets tab: content of the final dataset created after data integration and discretisation of the output variable in four classes.

Figure 21. Final dataset opened with the environment Explorer of WEKA (waves classification).

Figure 22. Matching configuration tab: parameters for the data integration of the intermediate dataset and the reanalysis files (energy flux prediction).

Figure 23. Final datasets tab: content of the final dataset created after data integration.

Figure 24. Final dataset opened with the environment Explorer of WEKA (energy flux prediction).

Table 1. Measurements descriptions and units of each meteorological variable or attribute collected by the buoys (a detailed description can be found in NDBC website [60]).

Attribute	Units	Description
WDIR	degT	The direction the wind is coming from true North.
WSPD	m/s	The speed of the wind.
GST	m/s	Peak of gust speed.
WVHT	m	Significant wave height.
DPD	sec	Dominant wave period (maximum wave energy).
APD	sec	Average wave period of all waves.
MWD	degT	The direction from which the waves at the dominant period are coming.
PRES	hPa	Sea level pressure.
ATMP	degC	Air temperature.
WTMP	degC	Sea surface temperature.
DEWP	degC	Dewpoint temperature.
VIS	nmi	Visibility of the station.
TIDE	ft	The water level.

Table 2. Thresholds for the classification example represented in Figure 10.

Class	Description	Lower End [	Upper End )
Low	Low wave height	$0.36$	$1.5$
Average	Average wave height	$1.5$	$2.5$
Big	Big wave height	$2.5$	$4.0$
Huge	Huge wave height	$4.0$	$9.9$

Table 3. Distribution of instances of the final dataset.

Year	Number of Instances
2013	1460
2014	1460
2015	1460
2016	1464
2017	1458
	7302

Table 4. Results (mean ± SD) obtained by the algorithms.

Algorithm	Accuracy (CCR)	Kappa
Logistic Regression	$59.0691$	$0.44447$
C4.5	$61.7385$	$0.47852$
Random Forest	$68.6516 \pm 0.3083$	$0.57040 \pm 0.0042$
Support Vector Machine	$61.0016 \pm 0.0522$	$0.46770 \pm 0.0007$
Multilayer Perceptron	$69.7045 \pm 1.3033$	$0.58576 \pm 0.0178$

Table 5. Results (mean ± SD) obtained by the algorithms.

Algorithm	Root Mean Squared Error	Correlation Coefficient
Linear Regression	$29.6368$	$0.7296$
Random Forest	$23.4353 \pm 0.1313$	$0.8408 \pm 0.0021$
Support Vector Machine	$31.3008 \pm 0.1197$	$0.7275 \pm 0.0015$
Multilayer Perceptron	$27.1151 \pm 7.5536$	$0.8444 \pm 0.0193$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gómez-Orellana, A.M.; Fernández, J.C.; Dorado-Moreno, M.; Gutiérrez, P.A.; Hervás-Martínez, C. Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux. Energies 2021, 14, 468. https://doi.org/10.3390/en14020468

AMA Style

Gómez-Orellana AM, Fernández JC, Dorado-Moreno M, Gutiérrez PA, Hervás-Martínez C. Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux. Energies. 2021; 14(2):468. https://doi.org/10.3390/en14020468

Chicago/Turabian Style

Gómez-Orellana, Antonio Manuel, Juan Carlos Fernández, Manuel Dorado-Moreno, Pedro Antonio Gutiérrez, and César Hervás-Martínez. 2021. "Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux" Energies 14, no. 2: 468. https://doi.org/10.3390/en14020468

APA Style

Gómez-Orellana, A. M., Fernández, J. C., Dorado-Moreno, M., Gutiérrez, P. A., & Hervás-Martínez, C. (2021). Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux. Energies, 14(2), 468. https://doi.org/10.3390/en14020468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Significant Wave Height and Energy Flux

Abstract

1. Introduction

2. Meteorological Data Sources

3. SPAMDA

3.1. Buoys

3.2. Datasets

3.3. Pre-Process

3.4. Matching Configuration

3.5. Final Datasets

3.6. Manage Reanalysis Data

3.7. Tools

4. A Case Study Applied to Gulf of Alaska

4.1. Gathering the Information and Introducing it in SPAMDA

4.2. Waves Classification

4.2.1. Obtaining the Final Dataset

4.2.2. Obtaining Classification Models with ML Algorithms

4.3. Energy Flux Prediction

4.3.1. Obtaining the Final Dataset

4.3.2. Obtaining Prediction Models with ML Algorithms

4.4. Important Remarks

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Managing the Casuistry of Incomplete Data

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI