Building Suitable Datasets for Soft Computing and Machine Learning Techniques from Meteorological Data Integration: A Case Study for Predicting Signiﬁcant Wave Height and Energy Flux

: Meteorological data are extensively used to perform environmental learning. Soft Computing (SC) and Machine Learning (ML) techniques represent a valuable support in many research areas, but require datasets containing information related to the topic under study. Such datasets are not always available in an appropriate format and its preparation and pre-processing implies a lot of time and effort by researchers. This paper presents a novel software tool with a user-friendly GUI to create datasets by means of management and data integration of meteorological observations from two data sources: the National Data Buoy Center and the National Centers for Environmental Prediction and for Atmospheric Research Reanalysis Project. Such datasets can be created using buoys and reanalysis data through customisable procedures, in terms of temporal resolution, predictive and objective variables, and can be used by SC and ML methodologies for prediction tasks (classiﬁcation or regression). The objective is providing the research community with an automated and versatile system for the casuistry that entails well-formed and quality data integration, potentially leading to better prediction models. The software tool can be used as a supporting tool for coastal and ocean engineering applications, sustainable energy production, or environmental modelling; as well as for decision-making in the design and building of coastal protection structures, marine transport, ocean energy converters, and well-planned running of offshore and coastal engineering activities. Finally, to illustrate the applicability of the proposed tool, a case study to classify waves depending on their signiﬁcant height and to predict energy ﬂux in the Gulf of Alaska is presented.


Introduction
A better understanding of the environment is of vital importance for science, contributing not only to more efficient exploitation of natural resources but also to the development of new strategies aimed at its protection. In that sense, meteorological observations provide an essential and valuable source of information which is widely used by researchers to address environmental learning, comprehension, prediction and conservation in numerous oceanic and atmospheric studies of a wide variety of areas (e.g., energy, climate change, agriculture, etc.). Some specific examples of the diversity of fields in which meteorological data can be used in are, among others: global solar radiation estimation [1], directional analysis of sea storms [2], estimation of hybrid energy systems taking into account economic and environmental objectives [3], wind power ramp events prediction [4], sea surface temperature prediction [5], study of the responses exhibited by plankton to fluid motions [6], trends in solar radiation [7] or simulation of extreme near shore sea conditions [8]. All these and optimisation tasks, obtaining significant improvements in the performance of the results, either in engineering [10], energy, or environmental problems [37][38][39]. SC and ML methodologies can be used not only by experienced computer scientists but also by other researchers. For example, the well-known Waikato Environment for Knowledge Analysis (WEKA) [40] software tool provides researchers with a wide collection of ML algorithms. ML techniques have been already applied to tackle wave characterisation, accurately estimating H s and T e parameters [41,42], given that robustness of ML methods can tackle the previously explained difficulties in wave energy prediction. In [43], a reliable ML model based on multiple linear regression and covariant-weighted least square estimation for H s modelling is presented in order to predict significant wave height 30 min in advance. In [44], an approach for feature selection problems is developed and applied for H s and F e prediction in oceanic buoys, obtaining very good results. In [45], a Bayesian Network system provides a helpful tool to support decision-making process of installation and maintenance operations in offshore wind farms using predictions of H s , among others. In [46], several ML methods are implemented and compared for the prediction of H s in the Persian Gulf, the extreme learning machine (ELM) providing the best results. The problem is that, in order to apply ML and SC techniques, it is essential to obtain datasets with relevant information about the issue under study, used to infer knowledge. Usually, these datasets are not publicly available in a friendly format, and their generation is the first step needed.
The information to create these datasets related with MRE can be obtained from meteorological observations, but such information may be available in an inappropriate format and even contain missing values or measurements. Consequently, it is usually required to perform pre-processing tasks for improving the quality of the data, such as the replacement of missing values, outlier detection, or data normalisation, among others. Furthermore, if more than one source of information is used to achieve a better characterisation of the problem under study [47][48][49], then a data integration process, denominated as the matching process in this document, has to be carried out by researchers to manually create the datasets with the needed information. Given that such process is of great relevance and has an extensive casuistry, the present work has been specially focused on it. Moreover, depending on the subject and the SC and ML technique to be applied, or even if the researcher considers other factors in order to enhance the performance obtained or have more in-depth conclusions, the datasets would have to be updated afterwards. In summary, many important details and different intermediate steps have to be considered when creating suitable datasets, especially when data integration is required, resulting in an extremely tedious task.
The main purpose of this paper is to present a new open source tool for the creation of datasets integrated by meteorological variables from two sources of information. Given that the tool provides a user-friendly graphical interface, no knowledge in programming languages is needed. It also prevents researchers from performing the mentioned tedious work and greatly simplifes all the steps involved in it, avoiding possible errors in the intermediate steps, at least as a preliminary study in certain areas where some kind of environmental prediction is needed. The meteorological data used by the tool come from two well-known sources of information: the National Oceanic and Atmospheric Administration (NOAA) National Data Buoy Center (NDBC) [50] and the National Centers for Environmental Prediction (NCEP)/National Center for Atmospheric Research (NCAR) Reanalysis Project (NNRP or R1) [51,52]. The open source software tool presented in this work is named SPAMDA in Supplementary Materials (Software for Pre-processing and Analysis of Meteorological DAta to build datasets), and it is available at https://github.com/ayrna. As SPAMDA performs all this data processing, it reduces the time involving these tasks and allows researchers to focus on the study of the meteorological aspects of the observations. The datasets obtained are ready to be used as input for SC and ML techniques in prediction tasks (classification or regression), although researchers can use them for other purposes. These datasets contain one or more meteorological variables as inputs and one variable as target (variable to be predicted). The format of the generated datasets will be Attribute-Relation File Format (ARFF) [53], which is the one used by WEKA. In addition, the datasets can also be generated in Comma-Separated Values (CSV) format, enabling researchers to use other tools.
In order to address the problem previously discussed, meteorological data integration from NDBC and NNRP and the casuistry that it entails, SPAMDA offers to researchers novelties and functionalities that will be detailed in Section 3, although some of them are briefly summarised below: • The generation of datasets becomes a very easy and customisable task by means of the selection of different input parameters, such as predictive and objective variables, classification and regression, output discretisation (useful for ordinal regression) or prediction horizon, among others. It provides information about the quality and quantity of the data. SPAMDA allows preliminary studies of missing values (dates or measurements not recorded) in buoys managed by NDBC, so that the researcher can have an idea of the quality of the data recorded by the buoys and about their suitability for the intended purpose. In any case, SPAMDA allows data integration taking into account such missing values when needed by the user. • Estimation of the amount of energy flux that can be produced at different prediction horizons: short-term, mid-term or long-term. Although this work does not focus on model performance, it should be taken into account that models tend to generalise worse with greater prediction horizons. • It manages the extensive casuistry of data integration which can lead to incomplete datasets, described in Appendix A. • Possibility of selecting one or more reanalysis nodes near the localisation under study, which could provide a better description of the problem to achieve more accurate models. • Although pre-processing is not the main objective of SPAMDA, the tool also provides some basic pre-processing filters on buoy measurements, such as normalisation and missing data recovery. • It facilitates data management and well-organised storage of the datasets. Environmental studies in different geographical locations can be carried out by merely introducing and using other collected data. • SPAMDA is distributed as an open source tool, its modular design allows the implementation of new modules for managing meteorological data from other sources, benefiting future renewable energy and environmental research. • It includes a user-friendly GUI, facilitating and greatly simplifying data management, and it is integrated with the Explorer environment of WEKA. • It is multi-platform, and it can be used on any computer with Java regardless of the operating system. Therefore, the functionalities and characteristics that SPAMDA offers make it a supporting tool for researchers, which could be used in applications related to coastal and ocean engineering, and also in marine energy prediction. In [3], the estimation of energy supply sources in hybrid energy systems is based on the amount of energy that can be obtained by a marine energy system within a prediction horizon. Regulation of WECs to avoid malfunction or breakage, depending on the significant wave height and/or energy flux expected, as well as the possibility of reconfiguring them in order to maximise the wave energy extraction, is studied in [29,30]. The prediction of the energy that could be obtained from a certain maritime location is considered in [26,27] in order to know whether it is convenient to install WECs as power supply in marine structures, such as offshore oil and gas platforms or seawater desalination plants. In [54], significant wave height forecasting is applied for decision-making in exploitation and environmental protection for the construction of marine energy storage plants, future strategies on renewable energy and coastal planning. Other examples of application are: design of offshore structures and ports [55], decision-making and risk assessment about operational works in the sea [56], security systems for structures or naval security [57].
This paper is organised as follows: Section 2 describes the sources of information used by SPAMDA for creating datasets. Section 3 describes in detail the features of the software tool. Section 4 shows a case study describing the use of SPAMDA in a practical approach. Section 5 provides the final conclusions and future work.

Meteorological Data Sources
The data provided by the above-mentioned sources of information of SPAMDA is described below: • NDBC belongs to the National Weather Service (NWS) and operates and supports a network of marine and ocean buoys that record data. The mission of the network is to record marine and ocean meteorological data, such as H s , dominant wave period, or wind speed and direction, among others. The buoys maintained by NDBC are located in coastal and offshore waters, and they are provided with specific sensors and devices which allow them to perform measurements. The information collected by the buoys is available on the NDBC website [58], and it is divided into different groups. One of them corresponds to standard meteorological information of the historical data collected by each buoy, which can be downloaded as annual text files and whose format was adopted by NDBC since January 2007 [59]. These files contain hourly measurements per day from 00:50 to 23:50 UTC (Universal Time Coordinated) and from 23:50 31 December of the previous desired year to 22:50 31 December of the desired year. In Table 1, a comprehensive measurement description and the corresponding units are provided as a summary for the reader. A fragment of one of these files, which contains the measurements collected during year 2017 by the buoy identified as Station 46001 in NDBC, is shown in Figure 1. Each column corresponds to a meteorological variable or attribute, and each row or instance corresponds to the values of the measurements collected by the buoy for each attribute at a specific date and time. Note that the data collected by the network of buoys may be incomplete due to diverse circumstances such as the weather conditions in which the buoys have to operate, failures or malfunctioning elements of the buoys, among others. Accordingly, it may be the situation that some of the measurements are completely missing (missing date or instance) or partially missing (some measurements not recorded), by a buoy or by a set of buoys, once in a while or over a period of time. It may be also possible that the measurements have been recorded at a time different from the expected one. These aspects have to be taken into account when creating the datasets. This casuistry is explained in detail in Appendix A.
• NNRP provides three-dimensional global reanalysis of numerous meteorological observations (e.g., components Zonal and Meridional of the velocity of the wind, relative humidity, pressure, etc.), which is available monthly, daily, and every six hours at 00 Z (Zulu time), 06 Z, 12 Z, and 18 Z from 1948 on a global 2.5°× 2.5°grid. Weather observations are from different sources, such as ships, satellites, and radar, among others. Reanalysis data are created assimilating such observations employing the same climate model along the whole period of reanalysis in order to decrease the impact of modelling changes on climate statistics. Such information has become a substantial support of the needs of the research community, even more in locations where instrumental (real time) data are not available. The reanalysis data are available in the NNRP website [61], which is accessible through different sections. Such data can be fully (a global 2.5°× 2.5°grid) or partially (only the desired reanalysis nodes or sub-grid) downloaded as Network Common Data Form (NetCDF) files [62], a special binary format for representing scientific data, which provides a description of the file contents and also includes the spatial and temporal properties of the data. Each reanalysis file contains the values of a meteorological variable estimated by a mathematical model for each reanalysis node. For the sake of clarity, in Figure 2, an example to approximately illustrate a sub-grid containing six nodes of reanalysis surrounding the geographic localisation of a buoy (obtained from NDBC) is shown. Therefore, with both sources of information, which complement each other, and carrying out a matching process, SPAMDA will create datasets for prediction tasks. In this way, the dataset input variables will be one or more reanalysis variables from NNRP and one or more measurements from NDBC. The dataset output variable will always be one measurement from NDBC.

SPAMDA
SPAMDA combines meteorological information from NDBC and NNRP to obtain new datasets for oceanic and atmospheric studies. In order to do so, SPAMDA manages three different types of datasets which are described in the following sections, but are briefly introduced bellow for giving the reader a better general understanding: • Intermediate datasets: They contain the meteorological observations from NDBC. • Pre-processed datasets: They are obtained as a result of pre-processing tasks performed on the intermediate datasets.

•
Final datasets: Created by merging an intermediate or pre-processed dataset (which contain the information from NDBC) with the reanalysis data from NNRP. This procedure is referenced in SPAMDA as a matching process and will be carried out according to the study to be performed (classification or regression).
SPAMDA consists of three main functional modules, whose main features, represented in Figure 3, are the following: • Manage buoys data: The aim of this module is to provide features for the management and analysis of the information related to the buoys from NDBC. This includes: 1.
Entering and updating the information of each buoy.

2.
Creation of intermediate datasets with the collected measurements.

3.
Pre-processing tasks for obtaining the pre-processed datasets.

4.
Matching process to merge the information from NDBC and NNRP.

5.
Creation of the final datasets according to the ML technique to use (classification or regression).
• Manage reanalysis data: This module is used for the management of the reanalysis data provided by the NNRP. In this way, researchers can keep the reanalysis data files updated for their studies. Such files will be used, depending on researchers' needs, in the matching process when obtaining the final datasets. • Tools: This module includes features for converting intermediate or pre-processed datasets to ARFF or CSV format and for opening ARFF files with WEKA software.
In the following subsections, each integrated functional module is described in detail.

Buoys
When a new buoy is included in SPAMDA, the following information, which can be obtained from NDBC, is requested: • Station ID: An alphanumeric identifier that allows easy identification of the buoy. • Description: A short description of the buoy. • Latitude: North or South geographical localisation (degrees) of the buoy. • Longitude: West or East geographical localisation (degrees) of the buoy. • Measurements files: The above-mentioned annual text files of the standard meteorological information recorded by the buoy and downloaded from the NDBC website. This will be used for the creation of the intermediate datasets. One file per year is expected.
For clarification, an example is presented in Figure 4, where the buoy ID1 has three annual text files and the buoy ID2 has two annual text files.

Datasets
Once a buoy has been included as described in Section 3.1, it is possible to create datasets with one or more annual text files, which are referenced in SPAMDA as intermediate datasets. In this module, researchers can manage intermediate datasets of each buoy, which are the baseline for their studies, by creating new ones or deleting the unnecessary ones.
When an intermediate dataset is created, it is associated with its corresponding buoy. In addition, a summary of its content is also created, providing relevant information such as the number of instances, the dates of the first and last measurements, the annual text files included and the missing and duplicated dates.
An example where three intermediate datasets have been created is presented in Figure 5. The two intermediate datasets of the buoy ID1 contain meteorological data of different years, and the intermediate dataset of the buoy ID2 contains meteorological data of two years. For each buoy, as many intermediate datasets as needed can be created.

Pre-Process
Data pre-processing prepares the raw data (intermediate datasets) to be able to be treated correctly by ML algorithms. This action can enhance the quality of data before the learning phase, by applying pre-processing tasks (filters). The result will be referenced as pre-processed datasets.
SPAMDA provides several filters grouped in three categories, Attribute, Instance, and Recover missing data, including the configuration of their parameters and a short description of them: • Attribute: All of these filters can be applied to the attributes (variables of the buoy from NDBC) of the intermediate dataset. • Instance: All these filters can be applied to the instances (hourly measurements of the buoy from NDBC) of the intermediate dataset.
-RemoveDuplicates: With this filter, all duplicated instances are removed.

-
RemoveWithValues: This filter removes all the instances that match the attribute and the value supplied by the user.

-
SubsetByExpression: It removes all the instances that do not match a user-specified expression.
• Recover missing data: All these filters can be applied to the instances of the intermediate dataset. SPAMDA allows researchers to undo the last filter applied or to restore the initial content of the intermediate dataset. In addition, the content and relevant statistical information of the intermediate and the pre-processed datasets can be visualised in this module, for example: minimum and maximum values, mean, standard deviation, or even the number of instances with missing values. Figure 6 shows an example where the intermediate datasets 1 and 2 of the buoy ID1 have been pre-processed, obtaining as a result the pre-processed dataset 1 of each one. The intermediate dataset 1 of the buoy ID2 has been also pre-processed. Pre-processed dataset n represents that researchers can create as many pre-processed datasets as they consider opportune. Nevertheless, further pre-processing tasks can be performed after obtaining the final datasets by means of the Explorer environment of WEKA or other tools.

Matching Configuration
The automatic integration of the data provided by the two sources of information described in Section 2, to merge and format such data, is denominated as the matching process in this document. Such process is one of the most powerful and remarkable features of this software tool due to its great relevance and extensive casuistry. In this sense, SPAMDA has been developed to provide great flexibility to researchers.
The matching procedure is performed using an intermediate or pre-processed dataset, which includes the measurements collected by a buoy from NDBC, and the needed reanalysis data files from NNRP. Note that SPAMDA is able to manage the NetCDF binary format for handling the information stored in the reanalysis files.
Such process merges the information of both sources that match on time, but, given that the reanalysis data are available with a minimum time horizon of 6 h at 00 Z, 06 Z, 12 Z and 18 Z, and the measurements of the buoys are recorded at hourly intervals, from 00:50 to 23:50 UTC, the matching can only be carried every six hours (discarding the rest of measurements from the buoy data). In addition, and since there is still a difference of 10 min, the matching with the reanalysis data will be performed with the nearest buoy measurement (before or after) within a maximum of 60 min of difference. Finally, the matched instances of both sources will form the final datasets. Figure 7 presents an example of matching with the measurements collected during 2017 by Station 46001 (NDBC) and the reanalysis data (NNRP) of the variable pressure for reanalysis nodes 57.5 N × 147.5 W and 55.0 N × 147.5 W in the same year. In this way, only the instances from both sources that are linked with arrows (highlighted in green) will be used in the creation of the final datasets. Although the reanalysis dates have been presented in a human readable format, note that reanalysis dates are stored in hours from 01-01-1800, and they have to be transformed for comparison taking into account the time zone. Such transformation is automatically done by SPAMDA when matching the instances.
The reader can check in Appendix A for an example with a more complex case of the procedure.
SPAMDA allows researchers to perform a customisable matching process, for obtaining as many different versions of the same meteorological data as needed. Prediction tasks are based on the estimation of the output attribute using the information provided by the input attributes. Depending on the task, the datasets must be prepared and configured differently: The final datasets will be ready to use as an input for ML classifiers, requiring a nominal output attribute, whose specific preparation is detailed in Section 3.5. • Regression: The final datasets will be ready to use as input for regression methods, requiring a real output attribute, whose preparation is also explained in Section 3.5.
• Direct matching: In this case, the inputs' attributes have a direct correspondence with the output attribute, and it is not necessary to perform any additional preparation. Both input and target attributes are synchronised in time, in such a way that the final dataset is not intended for prediction purposes. For example, the final datasets may be used in lost data recovering tasks, in correlation studies, in descriptive analyses, etc.
The following parameters can be specified for the matching process: • Flux of energy [48]: When the F e is selected, it will be used as output. This attribute is not collected by the buoys, but there are two parameters from which it can be computed: H s and T e , which are collected as WVHT and APD attributes, respectively, and were described in Table 1. In this way, SPAMDA obtains the F e (measured in kilowatts per meter) of each instance using the following equation: where H s is measured in meters and T e in seconds. F e is referred to as flux of energy, but it is defined as an average energy flux because H s is an average wave height (see descriptions of the measurements on the NDBC website). • Attribute to predict: Instead of using F e , researchers can select any of the attributes collected by the buoys as output (e.g., significant wave height, WVHT, wind direction, WDIR, sea level pressure, PRES, etc.). Therefore, they can conduct different studies by selecting one attribute or other. • Reanalysis data files: In order to have a possible better description of the problem under study, more than one reanalysis variable can be considered as input. Remember that these files have to be previously downloaded from the NNRP website [61], which should set the range of dates (temporal properties) and the desired sub-grid (spatial properties, see Figure 2) for each variable of reanalysis.
In that sense, the reanalysis data files must have the same spatial and temporal properties but related to different variables. SPAMDA simplifies this task by showing the reanalysis data files that are compatible with each other, and checking that the selection made by the research meets that condition. • Buoy attributes: In addition to the reanalysis variables, the final datasets will also include the selected attributes as inputs (of the intermediate or pre-processed dataset used), providing a possible better characterisation of the problem under study, although it will depend on how correlated the attributes are. • Include missing dates: As above-mentioned, the information collected by a buoy may be incomplete due to measurements not recorded by it. As a consequence, the matching of instances between both sources of information may not be possible (missing dates).
In that situation, researchers can consider two options: (1) discard the instances affected or (2) include them. In the latter case, the final datasets will contain the affected instances, but the measurements of the buoy will be stored as missing values in WEKA format, denoted as «?». • Nearest reanalysis nodes to consider: As already shown in Figure 2 (which represents six reanalysis nodes), the reanalysis data files may contain information of several reanalysis nodes. In this way, researchers can: -Consider all the reanalysis nodes contained in each file: in this case, the information provided by each reanalysis node contained in each selected reanalysis data file will be used.

-
Consider only some of the reanalysis nodes contained in each file: in this case, the information used is only that corresponding to the closest nodes to the buoy (the number of nodes, N, is indicated by the user). To do that, SPAMDA uses the Haversine equation [63] (or the great-circle distance) to calculate the distance from the location of the buoy to each node of reanalysis and obtain the closest ones.
The Haversine equation performs calculation from main point to destination point with a trigonometric function: d(p 0 , p j ) = arccos(sin(lat 0 ) · sin(lat j ) · cos(lon 0 − lon j ) + cos(lat 0 ) · cos(lat j )), where p 0 is the geographical location of the buoy and p j is the position of each node. Finally, lat and lon represent the latitude and longitude of the positions of the points.
• Number of final datasets: Depending on the number of nearest reanalysis nodes to consider, the number of final datasets to create and the content of them can be configured according to the following options: -One (using weighted mean of the N nearest reanalysis nodes): Only one final dataset will be created, which will contain the attributes (the selected one as output and the selected ones as inputs) of the intermediate or pre-processed dataset used, along with a weighted mean of each variable of the reanalysis data used (one per selected reanalysis data file). This weighted mean is obtained by SPAMDA and uses Equation (2) to calculate the distance from the geographical position of the buoy to each node of reanalysis. Once the distances have been computed, they are normalised and inverted as shown in the following equation: Then, with these calculated weights, a weighted mean of each variable of reanalysis is obtained for each of the N nodes. In this way, the closest reanalysis nodes to the geographical position of the buoy will provide more information.
Considering as an example the two nearest reanalysis nodes represented in Figure 2 and the reanalysis variables air temperature and pressure, the weighted mean of each reanalysis variable will be calculated using the reanalysis nodes 57.5 N × 147.5 W and 55.0 N × 147.5 W. -'N' (one per each reanalysis node): As many final datasets as the number of nearest N reanalysis nodes configured by researcher will be created. Therefore, each final dataset will contain the value of each reanalysis variable used of the nearest corresponding reanalysis node, along with the selected attributes of the intermediate or pre-processed dataset used. In this way, researchers can perform comparison studies depending on the reanalysis node considered, to achieve better performance for the problem under study. In this case, and considering as example the four closest reanalysis nodes (see Figure 2) and the reanalysis variables air temperature and pressure, four final datasets will be created, each one containing the information of both reanalysis variables of the corresponding reanalysis node: 57. Once the matching parameters have been described, for a better understanding of them, Figure 8 presents an example of the data integration considering the data shown in Figure 7 and using the following configuration (Note that the date is shown just for a better understanding, but it will not be included in the final dataset):

Final Datasets
Once the matching process has been performed with the desired configuration, it is necessary to prepare the matched information for the desired prediction task (Regression or Classification), obtaining as a result the final datasets. Remember that Direct matching, as it was described in Section 3.4, performs a direct correspondence between the attributes used as inputs and the output one, and it is not necessary to carry out any preparation.
SPAMDA allows researchers to make such preparation by means of the following options: • Prediction horizon (Classification and Regression): This option indicates the time gap for moving backward the attribute to predict (output attribute). In this way, the input attributes (variables of the buoy and reanalysis data) will be used to predict the output attribute in a specific future time (e.g., +6 h, +12 h, +18 h, +1 day, etc.). The minimum interval for increasing and decreasing the prediction horizon is 6 h (due to reanalysis data temporal resolution) [4], the same interval used when the matching process is carried out. Therefore, for each increment of the prediction horizon, an instance of the dataset is lost (as this future information is not available). As the minimum prediction horizon is 6 h, at least one instance will be lost. The relation between the inputs and the output (attribute to predict) is defined as follows: where t is the time instant to study, ∆t is the prediction horizon, o is the attribute to be predicted, b t represents the vector that contains the selected NDBC variables, and, finally, r t represents the vector that contains the selected reanalysis variables. In this way and considering the matched information shown in Figure 8a, WVHT is o, the vector b contains the variable WSPD, and the vector r contains Pres. Optionally, the reanalysis variables can be synchronised with the attribute to predict. Given that these variables are estimated by a mathematical model, we can obtain very good future estimations, which can improve the performance of the results. In this case, the relation between the inputs and the attribute to predict would be: Note that the selected NDBC variables as input cannot be synchronised with the attribute to predict. For the sake of clarity, considering the matched information shown in Figure 8a, an example of building a dataset for a Regression task is shown in Figure 9a. As mentioned earlier, this prediction task requires a real output variable (in this case, WVHT, the last one). The options considered for the preparation of each final dataset are the following: -Do not synchronise the reanalysis data (see Equation (4) for the relation between the inputs and the output).

-
A prediction horizon of 6 h.
Note that, due to prediction horizon is 6 h, the values of WVHT attribute are moved backward one instance (up). As a consequence, the last instance (31 December 2017 18:00) is lost and is not included in the final dataset. In addition, and because the reanalysis data have not been synchronised, the values of the Pres and WSPD variables are at the same time instant (t in Equation (4)). Moreover, considering again the matched information shown in Figure 8a, an example of the creation of the same dataset but applying synchronisation (see Equation (5)) is shown in Figure 9b. Again, and due to the prediction horizon selected (6 h), the values of the WVHT attribute are moved backward one instance (up) and the last instance (31 December 2017 18:00) is not included in the final dataset. However, now, the values of the Pres variable are also moved backward one instance (due to the synchronisation). Therefore, in this case, Pres is at the same time instant as the attribute to predict (t + ∆t in Equation (5)). • Thresholds of the output attribute (Classification): Since the values of the variables collected by the buoys are real numbers, it is necessary to discretise them (convert them from real to nominal values) for the attribute selected as output (attribute to be predicted). SPAMDA allows researchers to perform this process by defining the necessary classes with their thresholds, which will be used to carry out such discretisation. Considering again the matched information shown in Figure 8a, an example of the creation of a Classification dataset is shown in Figure 10. The options considered for the preparation of the final dataset are the following: -Do not synchronise the reanalysis data.

-
A prediction horizon of 6 h.

-
The thresholds shown in Table 2.
Note that the attribute to be predicted has been renamed to Class_WVHT to show that it is now a nominal variable because its values have been discretised according to the thresholds (usually defined by an expert). In addition, and due to the 6 h prediction horizon, the last instance is lost (31 December 2017 18:00), and the values of the attribute Class_WVHT are moved backward one instance (up). As the reanalysis data have not been synchronised, the values of the Pres and WSPD variables are at the same time instant (t in Equation (4)).
The content of the final datasets, obtained as the result of the preparation of the matched data, can be visualised to check everything before saving them on disk. Such preparation can be performed as many times as required and considering the different options in each moment. Although the date will not be included in the final datasets, it can be shown to properly check the matching.
Finally, it is necessary to define the output configuration to create the final datasets: • A text file that summarises the configuration used in matching process and in the preparation of the matched data are also generated. It can be saved and loaded, enabling researchers to resume their studies at any other time. Table 2. Thresholds for the classification example represented in Figure 10.

Manage Reanalysis Data
As mentioned in Section 2, the reanalysis data files provided by NNRP contain the estimated values by a mathematical model of one meteorological variable.
In this module (see Figure 3), SPAMDA includes features for entering new files and deleting the unnecessary ones. In addition, useful information about the content of each reanalysis file can be consulted such as name of the file and the reanalysis variable, number of instances and reanalysis nodes, initial and final time, latitude and longitude. All of these fields summarise the temporal and spatial properties of the data. Thus, researchers can quickly and easily identify each reanalysis file entered in SPAMDA.
An example where two reanalysis data files have been entered in SPAMDA is shown in Figure 11.

Tools
SPAMDA also contains another module that provides two utilities: one of them is Dataset converter used for converting the desired intermediate or pre-processed datasets to ARFF or CSV formats; the other utility can be used for opening ARFF files with WEKA Explorer environment, which is useful for easily checking the results of different configurations of the pre-processing.

A Case Study Applied to Gulf of Alaska
This section describes how SPAMDA works in a practical approach showing two examples to create fully processed datasets (final datasets) starting from the raw data. The objective of these final datasets is to be used with SC and ML algorithms for environmental modelling, in this case, to classify waves depending on their height and to predict energy flux in the Gulf of Alaska.
On the one hand, wave classification is addressed as a multi-class approach, given that a continuous attribute can be discretised, using different thresholds, in distinct classes. Such wave modelling can be applied with different purposes, such as missing buoy data reconstruction, extreme significant wave heights detection or decision-making and risk assessment about operational works in the sea.
On the other hand, the prediction of the energy flux is addressed as a regression problem. Energy flux prediction is related to marine energy, and it is useful to characterise the wave energy production from WEC facilities, which could be injected into the electric network or supplied to existing marine platforms.

Gathering the Information and Introducing it in SPAMDA
The data collected to perform this case study is: 1.
The measurements obtained from 2013 to 2017 by the buoy with ID 46001, placed in the Gulf of Alaska, which are provided by NDBC as annual text files. This data are publicly available at the NDBC website.

2.
Complementary information collected from reanalysis data containing air temperature (air), pressure (pres) and two components of wind speed measurements, South-North (vwind) and West-East (uwind). This information can be downloaded from the NNRP website in NetCDF format for the four closest nodes of reanalysis surrounding the position of the buoy. Concretely, the closest reanalysis nodes downloaded are 57.5 N × 147.5 W, 57.5 N × 150 W, 55 N × 147.5 W, and 55 N × 150 W. However, as will be seen later, only the information from the nearest node will be used in the data integration process.
After gathering the information described above, researchers can open SPAMDA. In Figure 12, the main view is shown. In order to input the reanalysis data which will be used in further steps for creating the final dataset, researchers has to select the option Manage reanalysis data. Then, the view of Figure 13 is shown. Here, using the buttons located at the bottom, it is possible to add, delete, or consult any data from the different reanalysis files. Once the information has been introduced in the application, this view can be closed and the user can go back to the main view to continue entering the information related to the buoy under study. After that, the researcher has to select Manage buoys data to open the view shown in Figure 14, where several tabs are available. In the Buoys tab, the researcher can consult, modify, add, or delete different data related to the buoy. In order to enter such data, click on the New button, and then the view shown in Figure 15 pops up. Here, the information about the buoy has to be included: the Station ID, its description, geographical localisation, and the corresponding annual text files. In this case, the files containing the data from year 2013 to 2017 are inserted by clicking on the Add file button. Once the data have been introduced, it is necessary to click on the Save button to insert the buoy in SPAMDA database. After that, the view can be closed.
To create the intermediate dataset, the researcher has to double-click on the buoy under study or click on the Datasets tab (see Figure 14) to switch to the corresponding view (see Figure 16). In this view, the researcher can delete or consult a summary of each intermediate or pre-processed dataset by selecting it from the corresponding list. It can also create new ones. To proceed with the creation of the intermediate dataset, the user clicks on the New button, and the view shown in Figure 17 appears.
Here, the researcher can select the annual text files to be included in the intermediate dataset, by clicking on the -> and <-buttons. In this case, all the files introduced before, which correspond to the buoy under study, are selected. When the file selection is finished, Create button has to be clicked in order to introduce the description and the file name of the current intermediate dataset, and, then, with the Save button, the creation process starts, showing the status of the process during it. After that, in order to prepare the intermediate dataset, the dataset is selected (see Figure 16), and then the button Open is clicked to jump to the tab Pre-process (shown in Figure 18).
In Pre-process tab, relevant statistical information about the selected dataset is shown, and also the content of the dataset can be consulted, providing the researcher the capacity to evaluate the pre-processing being performed. Here, the researcher can apply (and configure) the necessary filters (explained in Section 3.3) to the selected dataset, and, in the bottom part, the main statistics of the dataset are displayed, which can be used to observe the changes produced when applying a filter. As mentioned earlier, this case study is focused on classifying waves considering their height, so any missing data from wave height (376 values) and the remaining attributes are recovered, using the filter Replace missing values with symmetric 3 h mean. Furthermore, the attributes MWD, DEWP, VIS and TIDE are removed from the dataset by applying the filter RemoveByName, since the first two had more than 92% of missing data and the last two 100%. After finishing the pre-processing of the dataset, the researcher can click on the Save button, to introduce the description and file name for the current pre-processed dataset.
At this point, the researcher has registered the buoy in SPAMDA, then entered its raw data and selected the required data for the problem (intermediate dataset). Finally, the data have been pre-processed in order to be ready for its future use in ML algorithms. Then, a data integration process can be carried out to merge the processed data from NDBC with the reanalysis data (also included previously) from NNRP.   The next step is to customise (or load) the parameters of the matching process according to the problem being studied and to select the prediction task (described in Section 3.4) that the final dataset will be used for, in this case, waves classification or energy flux prediction.

Waves Classification
As mentioned above, the objective of the final dataset is to be used with SC and ML algorithms to classify waves depending on their significant height. The following sections describe the procedure of performing the data integration provided by SPAMDA, modelling wave height by using classification algorithms available in WEKA.

Obtaining the Final Dataset
By clicking on the Matching configuration tab, the view shown in Figure 19 will be opened. In this view, the researcher can configure the parameters of the data integration process. For this problem, the following parameters were selected: • Attribute to predict: WVHT. • Reanalysis data: Air, pressure, u-wind and v-wind. • Buoy attributes to be used as inputs: WDIR, WSPD, GST, DPD, APD, PRES, ATMP and WTMP (see Table 1 or descriptions of the measurements in the NDBC website). • Reanalysis nodes to consider: 1 (only the closest reanalysis node will be used). • Number of final datasets: In this example, that option is disabled because only one reanalysis node is considered. • Prediction task: Classification.
After configuring the matching process, the researcher can click on the Run button to jump to the view shown in Figure 20 and proceed to define the final dataset structure according to the selected prediction task. Given that, in the previous view (Figure 19), Classification was selected, the researcher can now add, modify, or delete the thresholds (usually defined by an expert) for discretising the output variable (top left of Figure 20). After this, the next step is to set the time horizon desired (6 h by default) and also to activate (if desired) the synchronisation (in time) of reanalysis variables with the output (top right of Figure 20), as explained in Section 3.5. Then, the researcher can click on the Update final dataset button to see the content shown in the bottom left corner (NDBC observations, NNRP variables, missing values, dates). Finally, after checking that everything is correct, the last step would be to select the name and path of the dataset file, and its output format (CSV or ARFF) and click on the Create final datasets button (bottom right of Figure 20). For this example, the following configuration was applied: • Thresholds: see Table 2 At this point, the final dataset would be created according to the tailored configuration and stored in the computer of the researcher, which already can apply the ML techniques to address the problem of wave classification. Concretely, the final dataset consists of 7302 instances and whose distribution is represented in Table 3.

Obtaining Classification Models with ML Algorithms
Now, the process to obtain wave classification models is described using the final dataset previously created with SPAMDA. The modelling will be performed using WEKA as SC and ML tool, which can be opened through SPAMDA, as shown in Figure 21. Nevertheless, as mentioned above, the researcher can create the final dataset in CSV format in order to use any other ML tools, such as KEEL, Python, or R, among others.
Since the final dataset is a time series of meteorological data (collected from 2013 to 2017), a hold-out scheme (60% train/40% test) will be used. In this way, years from 2013 to 2015 will be used for the training phase (4380 instances), whereas 2016 and 2017 years will be used for the test phase (2922 instances). Previous to the learning phase, the attributes are normalised to avoid some attributes dominating others because of a larger scale.
The classification algorithms that will be considered for wave modelling are Logistic Regression [64], C4.5 [65], Random Forest [66], Support Vector Machine [67] and Multilayer Perceptron [68], which will be applied with the default values of the parameters provided by WEKA. Given that Logistic Regression and C4.5 algorithms are deterministic, only one run will be considered for each one. However, Random Forest, Support Vector Machine and Multilayer Perceptron algorithms have a stochastic component, so, in this case, 30 executions for each one will be carried out. Table 4 shows the results of this experimentation.
As can be seen, Random Forest and Multilayer Perceptron algorithms have achieved similar accuracy, but the performance of the latter is slightly better. Although this is an illustrative classification example using datasets built with SPAMDA, both models have obtained good performance, despite the fact that the problem being tackled is difficult (prediction is approached six hours in advance).

Energy Flux Prediction
As mentioned above, the final dataset of this example is also used with SC and ML algorithms to predict flux of energy. The following sections explain the process of performing the data integration provided by SPAMDA to build the final dataset, modelling the flux of energy by using regression algorithms available in WEKA.

Obtaining the Final Dataset
The researcher can configure the parameters of the data integration by clicking on the Matching configuration tab. For this problem, as shown in Figure 22, the following parameters were selected: • Attribute to predict: Flux of energy. • Reanalysis data: Air, pressure, u-wind and v-wind. • Buoy attributes to be used as inputs: WDIR, WSPD, GST, DPD, APD, PRES, ATMP and WTMP (see Table 1 or descriptions of the measurements in the NDBC website). • Reanalysis nodes to consider: 1 (only the closest reanalysis node will be used). • Number of final datasets: In this example, this option is disabled because only one reanalysis node is considered. • Prediction task: Regression.
After configuring the parameters of the matching process, the next step is to define the final dataset structure according to the selected prediction task. Researchers can click on the Run button to jump to the view shown in Figure 23. Note that the thresholds for discretising the output variable (top left of Figure 23) are disabled due to, in this case, energy flux prediction being a regression problem.  By default, the time horizon is set to six hours, that is, the energy flux prediction will be performed six hours in advance (top right of Figure 23), but researchers can increase such time horizon depending on their needs. The synchronisation (in time) of reanalysis variables with the output (explained in Section 3.5) can be set in this view. By clicking on the Update final dataset button, researchers can preview the content of the final dataset (bottom left corner of Figure 23). Finally, the last step would be to set the name, path and output format (CSV or ARFF) of the dataset file, and then the user should click on the Create final datasets button (bottom right of Figure 23). For this example, the following configuration was applied: • After that, the final dataset would be created and stored in the computer of the researcher according to the introduced configuration, ready to be used as input for SC and ML techniques to tackle the problem of energy flux prediction. The number of instances (7302) and the distribution of the final dataset (Table 3) are the same as in the previous example (waves classification) since the data used to create the final dataset and the time horizon selected (6 h) are the same.

Obtaining Prediction Models with ML Algorithms
In this example, WEKA is used as SC and ML tool to obtain energy flux prediction models, as shown in Figure 24. Nonetheless, the final dataset can be created in CSV format so that the researcher can use any other SC and ML tool. For this problem, the same partitioning scheme used in the wave classification problem is considered (60% train/40% test), that is, from 2013 to 2015 for the training phase (4380 instances) and 2016 and 2017 for the test phase (2922 instances). Again, the attributes are normalised prior to the learning phase.
To perform the energy flux modelling, one execution will be run for the deterministic algorithm Linear Regression [36], whereas 30 executions will be considered for the stochastic ones: Random Forest [66], Support Vector Machine [67] and Multilayer Perceptron [68]. Table 5 shows the experimental results obtained using the default values for the parameters of the algorithms provided by WEKA. In this case study, the use of datasets created with SPAMDA has been shown to address an energy flux prediction problem. An exhaustive comparison of regression algorithms is not the purpose of this work. However, note that Multilayer Perceptron and Random Forest algorithms have achieved very good results despite the fact that the energy flux prediction has been performed with a time horizon of 6 h.

Important Remarks
In this section, it has been described how to use SPAMDA to create final datasets with the aim of classifying waves and predicting flux of energy. However, using the same data described in Section 4.1, the researcher can quickly address other objectives or different studies by merely tailoring the matching configuration of the data integration process. For example, longer-term wave or energy flux prediction can be addressed by changing the time horizon, waves modelling can be approached from another perspective by creating the final dataset for regression, or environmental modelling can be focused in diverse fields by changing the output meteorological variable.
Furthermore, environmental modelling in other geographical location can be carried out by merely using other collected data.
As SPAMDA performs all data processing and management to create the datasets, it not only prevents researchers from performing repetitive tasks but also prevents them from making possible errors. In this way, researchers can focus on the studies they are carrying out.

Conclusions
Studies on marine energy using ML and SC methodologies apply specific algorithms (extreme learning machine, metaheuristics, Bayesian networks, neural networks, etc.) on data using custom-made implementations or scripts developed in some programming language; but they do not allow for building datasets in an automated way ready to be used as input for prediction tasks (classification or regression). In this sense, a new open source tool named SPAMDA has been presented in this work, with a user-friendly GUI for creating datasets using meteorological data from NDBC and NNRP. The aim of the tool is to provide the research community with an automated, customisable and robust integration for NDBC and NNRP data, serving as a tool for analysis and decision support in marine energy and engineering applications, among others.
Such datasets can be easily obtained with SPAMDA by means of the selection of different input parameters, such as predictive and objective variables, output discretisation or prediction horizon. As a result, researchers will benefit from significant support when carrying out environmental modelling related to energy, atmospheric or oceanic studies, among others. Moreover, given that SPAMDA simplifies all the intermediate steps involved in the creation of datasets and manages the extensive casuistry of the data integration (such as specifying the meteorological information, managing incomplete data, pre-processing tasks, the customisable matching process to merge the data and the preparation of the datasets according to the SC or ML technique to use), it avoids errors and reduces the time needed. In this way, researchers will be able to have more in-depth analysis, which could result in more complete conclusions about the issue under study.
The case study described in Section 4 illustrates how SPAMDA can be used by researchers in a practical approach for environmental modelling, concretely, to classify waves in the Gulf of Alaska depending on their height. The case study also covers an example of energy flux prediction, to predict the wave energy that could be exploited by WEC facilities six hours in advance, although such time horizon is customisable. Given that this work does not focus on models performance, a more extensive validation or comparison study of the results obtained in both examples has not been carried out. The final datasets obtained with SPAMDA can be replicated by researchers using the same meteorological data from NDBC and NNRP (publicly available) and applying the same parameters for the pre-processing tasks and the data integration process. After that, the models and results obtained, using such final datasets, will depend on the SC or ML tool used.
In order to improve SPAMDA, some future work could be focused on new functional modules for managing meteorological data of different formats [69], so that the developed tool can be extended to any other research, new pre-processing functionalities such as filters to analyse the correlation between attributes or new functional modules for recovering missing values using nearby buoys data [70]. Furthermore, the developed software could manage other sources of reanalysis data (with different spatial and temporal resolution), and new output formats for the datasets which could be used as input by other tools for ML such as KEEL (Knowledge Extraction based on Evolutionary Learning) [71]. However, such new functionalities can be developed with a reasonable effort to be able to manage each particular casuistry. For example, when dealing with incomplete data, interpreting different data and files structures or carrying out the matching process of two environmental data sources. Funding: This work has been partially subsidised by the projects with references TIN2017-85887-C2-1-P of the Spanish Ministry of Economy and Competitiveness (MINECO), UCO-1261651 of the "Consejería de Economía, Conocimiento, Empresas y Universidad" of the "Junta de Andalucía" (Spain) and FEDER funds of the European Union.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found in the National Data Buoy Center (NDBC) and the NOAA Physical Sciences Laboratory (PSL). WEKA (Waikato Environment for Knowledge Analysis) software tool, to University Corporation for Atmospheric Research/Unidata for the NetCDF (network Common Data Form) Java library and to QOS.ch for the SLF4J (Simple Logging Facade for Java) library.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
Abbreviations used in this manuscript: The attribute to be predicted at the time instant to study ∆t The prediction horizon b t The vector containing the selected NDBC variables r t The vector containing the selected reanalysis variables

Appendix A. Managing the Casuistry of Incomplete Data
In this appendix, we describe how SPAMDA deals with incomplete data when creating intermediate datasets and performing the matching process.
The measurements collected by the buoys may be incomplete or recorded at a different time than the expected one, due to the weather conditions in which the buoys have to operate. To illustrate this casuistry, the following examples are shown in Figure A1: SPAMDA has been designed to tackle these situations, and it informs researchers of any incidence found while reading the annual text files for creating the intermediate datasets. For the case of measurements that were recorded at a different time than expected, a time gap of 6 min (10% of an hour) has been established. Therefore, if the time difference exceeds such value, the date will be considered as an unexpected. Figure A2 shows the status of the creation of an intermediate dataset with the information of Figure A1. Note that the instance marked with a) has not been informed by SPAMDA as an unexpected date because its time difference is less than six minutes. Depending on the affected attribute, NDBC uses a specific value [69] to indicate the presence of lost data (e.g., 99 for VIS and TIDE attributes, 999 for DEWP, MWD and WDIR, etc.). SPAMDA interprets these specific values and, after creating the intermediate dataset, researchers can check if it contains missing values by visualising its statistical information or content. Remember that SPAMDA provides several filters for recovering missing data, which were described in Section 3.2.
SPAMDA takes into account this casuistry when carrying out the matching process. An example is given in Figure A3. As mentioned above, the matching process is performed with the nearest measurement (previous or next) within a maximum of 60 min of difference. However, in the instance marked with e), given that the measurements dates 1 May 2017 17:50 and 1 May 2017 18:50 are missing, the reanalysis date 1 May 2017 18:00 cannot be matched with buoy data (this date is highlighted in mauve in Figure A3). Depending on the selection made by researchers in the parameter Include missing dates, this instance will be included in the final dataset (with missing values for buoy variables) or not.