*2.1. IOOS COMT Model Data Interoperability Design*

The IOOS COMT model data interoperability design used the same basic core strategy described in [8]: Convert collections of non-standard data files to a common data model using a light-weight Extended Markup Language (XML) layer, which then allows distribution of datasets uniformly via standard services, which can be consumed by standards-based clients (applications) (Figure 3). At the heart of the system is the Unidata Thematic Realtime Environmental Distributed Data Services (THREDDS) Data Server, which is built on the Unidata NetCDF-Java library. The NetCDF-Java library is capable of reading NetCDF, HDF5, GRIB and GRIB2 data files into a common data model, which allows a uniform representation of the data regardless of input format. In addition, it can read NetCDF Markup Language (NcML) files, simple XML files that allow the provider to define aggregations of binary files as well as provide or modify metadata. Thus collections of non-standard convention binary files can be turned into aggregated, standardized datasets without modification of the original files. This is a powerful feature that places minimal impact on the providers. They can continue to use their existing data files with their existing software while exploring the benefits of standardized tools. The new features of this system since [8] are described individually below.

#### *2.2. Advances in Model Data Standards, Tools and Techniques*

#### 2.2.1. The ncSOS Service for Observational Data

The CF Conventions and the Unidata Common Data Model were originally developed only for 2D, 3D, and 4D gridded data (featureType:Grid) with two spatial dimensions (e.g., longitude, latitude), and time and/or depth dimensions. The success of this approach motivated the extension of this approach to observational data. In version 1.6 of the CF Conventions, metadata were defined to support observational data such as tide gauges, CTDs, ADCPs and ocean gliders (featureTypes: TimeSeries, Profile, TimeSeriesProfile, Trajectory). The NetCDF-Java library was updated to support these featureTypes, allowing for customized methods appropriate for these data types.

The OGC Sensor Observation Service (SOS) is an IOOS-approved web service for delivering observational data, supporting GetCapabilities, DescribeSensor, and other service requests that allow for a rich exchange between the server and client. Typically SOS services connect to databases that store the observational data (e.g., NOAA-COOPS, NDBC, 52 North), but with the new CF-1.6 specifications allowing standardized collections of observed data in NetCDF files to be ingested into the Common Data Model, it was realized that an SOS service could also be developed relatively easily for the THREDDS Data Server. Under funding from IOOS COMT, RPS ASA developed ncSOS, released it as open source, where it continued to be developed with funding from USGS and IOOS [13]. Written in Java, it is a simple plug-in for the THREDDS Data Server, installing in minutes with no configuration necessary. This allows access to observational data by any broker or client that can formulate SOS requests such as XML or simple Representational State Transfer (REST) text and process the responses (currently XML, JSON or CSV).

#### 2.2.2. Unstructured Grid (UGRID) Standards and Tools

To represent the data output from unstructured grid models a common way, metadata conventions need to be adopted. The CF Conventions have proven very popular and effective for structured (e.g., rectilinear, curvilinear) grid model output, but had no way of specifying the grid topology (connectivity) necessary for unstructured grids, or concepts such as location of data on the grid elements (e.g., located on faces, edges or nodes). Shortly after CF version 1.0 was released in 2008 a UGRID Interoperability Google Group was formed with representatives from organizations such as Deltares, NOAA, USGS, DOE, and the FVCOM, ADCIRC, SELFE modeling communities [14]. After several years of discussion, development and testing, unstructured grid metadata conventions were finally released in 2013 as UGRID 0.9 [15]. The conventions were developed to allow specification of data variables on fixed horizontal unstructured grids. Higher-order element representation of data variables and handling data from moving or changing meshes were left as future enhancements, with the realization that these enhancements might necessitate a different underlying data model, but leave the functionality for users intact.

With the new UGRID 0.9 conventions for unstructured grid data, it was possible to create a new class for NetCDF-Java to support the UGRID featureType. This Java code was also developed by RPS ASA and released as an open source plugin for NetCDF-Java and/or THREDDS Data Server [16]. This allows for unambiguous retrieval of properties such as connectivity arrays or data location on the elements (e.g., face, node), which allows interoperable clients to be developed to support any UGRID-compliant data.

The NCTOOLBOX was developed to leverage the NetCDF-Java library Common Data Model for Matlab users [17]. An evolution of the njTBX Toolbox for Matlab described in [8], it supports a wide range of operations on CF-compliant gridded data (rectilinear or curvilinear). With the new UGRID 0.9 conventions for unstructured grid data, and support in NetCDF-Java, it was possible for RPS ASA to create new tools for NCTOOLBOX to support UGRID-Compliant datasets as well. As an example, water levels from three different models used in COMT (ADCIRC, SELFE and FVCOM) can be accessed and displayed without using model specific code (Figure 4). The Matlab code to recreate this figure is the script *demos/contrib/test\_ugrid3.m* from the UGRID version of NCTOOLBOX, available at [18].

Blanton *et al.* [19] leveraged the capabilities of UGRID standards and NCTOOLBOX to build a powerful GUI-based tool (ADCIRCVIZ) for accessing and visualizing storm forecasts run on unstructured grid models from multiple remote locations. While geared toward forecasts computed with ADCIRC, any model that conforms to UGRID standard can be visualized in this application.

The THREDDS Data Server [20] currently includes the built in Web Map Services (WMS) provided by ncWMS, developed by the University of Reading [21]. Although this service works exceptionally well for rectilinear data, the performance is poor for curvilinear grids and there is no support for unstructured grids.

To rectify this situation, ASA-RPS built a new Python-based WMS service called SciWMS [22] that uses standard Python plotting via the Matplotlib Basemap library to generate maps. This turns out to be several times faster than the approach ncWMS uses, at least for the current generation of models, and works for unstructured grids. Because it is written in Python, it can't be bundled with THREDDS like NcWMS. It must be installed and configured separately, but the procedure is well documented, along with instructions how to customize a THREDDS server configuration to point to the SciWMS mapping services instead of the usual THREDDS-supplied WMS services. With this configuration in place, the SciWMS services become associated with the ISO metadata, which makes the SciWMS services discoverable via the catalog services instead of the default ncWMS services. Thus tools can be developed that allow searching for relevant datasets via the ISO metadata, and then quick display of model results via the SciWMS services.

**Figure 4.** Water levels from three different unstructured grid models (ADCIRC, SELFE and FVCOM) displayed by the NCTOOLBOX script *demos/contrib/test\_ ugrid3.m*. The script takes advantage of the UGRID conventions to access and display data from different unstructured grid models without any model-specific code. Any UGRID-compliant model could be displayed.

**Figure 5.** Comparison of ocean glider data (top panel) with forecast data from three different forecast models: the SECOORA SABGOM ROMS model from NCSTATE, the NAVY USEAST NCOM model, and the NOAA Global RTOFS HYCOM model. Tools from NCTOOLBOX were used that can extract vertical sections along time and space paths from any Climate and Forecast (CF)-compliant structured grid ocean model. The scripts that produce these plots may be found in the toolbox *demos/contrib* directory.

2.2.3. Expanded Analysis Functions and Demos in NCTOOLBOX for Matlab Users

In addition to providing support for unstructured grids, more tools and demos have been added to the NCTOOLBOX for Matlab, significantly increasing the functionality over the preceding njTBX toolbox described in [8]. As an example, the *nc\_genslice.m* function takes a CF-compliant model dataset URL and an [x,y,z,t] trajectory on input, and returns an interpolated track from the selected model along that path. Instead of downloading data from the entire bounding box and temporal extent of the glider path from the model, the data is extracted in small chunks following the glider path, and the end result is typically only a few hundred KB of data. This provides an easy way to compare different models to ocean gliders, and was recently used with several IOOS forecast models and data collected during GliderPalooza, a collaborative glider campaign run on the US East Coast during Sep–Nov 2013 (Figure 5). Because users of NCTOOLBOX have the data, not just graphics, quantitative model assessment can be performed in addition to visual

comparison. Wilkin and Hunter [23] leveraged the power of these new routines to objectively assess seven different forecast models in the Mid-Atlantic Bight, using IOOS community glider data collected over an 18 month period.

#### 2.2.4. An Improved Procedure for Modelers to Create Standardized Datasets

In the COMT, many of the groups wanted to upload their data to a central server, requiring a procedure to catalog the datasets being uploaded. As typical in the larger ocean community, the modeling groups generated output files with differing metadata and conventions. All were NetCDF, but while some were nearly CF or UGRID-compliant, others contained only minimal metadata. For a single simulation, some modeling groups produced a single NetCDF file for all variables and time steps, while others produced collections of NetCDF files, with individual files for each variable and fixed number of time steps. To handle this situation, an approach was developed that used template NcML files and a Google Drive spreadsheet to automatically generate the THREDDS catalog.

Despite the non-uniformity of output files, NcML made it possible to virtually aggregate and standardize the datasets. For each modeling group, a template NcML file was provided that would turn their output files into a single, CF-compliant or UGRID-compliant dataset. For example, the template provided to the SELFE group aggregated each variable along the time dimension, and then aggregated all the variables together, while also aggregating a grid file that contained the lon/lat locations of the mesh, allowing the 49 different files constituting a single simulation to be accessed through a single UGRID-compliant URL. The modeling groups could use these templates without needing to understand the details of the CF or UGRID conventions, and the templates needed little or no modification to be used for each simulation performed by a particular modeling group.

A spreadsheet on Google Drive was used by modelers to specify the location of their NcML template as well as additional custom descriptors for the model run. After completing a new simulation, they would create a directory on the testbed server and upload their output files and template NcML (preferably using GridFTP via Globus [24]). They then added a row to a shared Google Drive spreadsheet that specified a title for the run, the location of the NcML template, and a short summary statement describing the model run. Every hour a Python script running on the testbed server read the spreadsheet using the Google API and combined the metadata from the numerous NcML files and additional metadata from the Google Spreadsheet into a single THREDDS Catalog of CF- and UGRID-compliant datasets.

#### 2.2.5. Enabling Discovery via Standardized Metadata and Catalog Services

Enabling standardized datasets is a great step forward for interoperability, but it still can be difficult for users to find these standardized datasets. In [8] and in other projects (e.g., the NOAA Unified Access Framework project [25]) the approach was to build a single catalog that points to other catalogs, basically creating a large tree of datasets organized in a particular way (e.g., by IOOS region or NOAA Line Office). Thus a user had to navigate this tree to search for datasets that might be of interest. Instead, most users would rather search on space, time, and variable to

dynamically find datasets that are of interest. Thanks to advances in metadata standardization and cataloging services, this is now relatively easy to enable.

With IOOS funding, NOAA NGDC developed a Java plug-in for THREDDS called NcISO that provides an ISO metadata service, converting the attributes and other metadata into ISO 19119-2 XML. Written in Java, it also can be used as a stand-alone application which scans a remote THREDDS catalog and generates ISO metadata for each dataset. This metadata, in turn, can be harvested by catalog services such as Geonetwork, GI-CAT, Geoportal Server, CKAN and PYCSW. The COMT datasets were harvested from the testbed server THREDDS catalog by the NGDC Geoportal Server that drives the IOOS Catalog. The COMT datasets are therefore discoverable by users internal and external to the COMT project using a standardized approach (Figure 6).

**Figure 6.** Results from a query for 3D FVCOM or ADCIRC datasets found within a specified bounding box (the extent of the map window). The user has selected one of the datasets returns, which displays the boundary of the dataset on the map (yellow rectangle), a summary (yellow-highlighted text), and dataset links, including "Open" to access the dataset using OPeNDAP, and "Metadata" to provide the full metadata document.


#### 2.2.6. CF Compliant Tools for Python

Matlab is one of the popular analysis and visualization environments in the oceanographic community, so it made sense to focus initial effort on standards-based Matlab tools. To improve the efficiency for as many users as possible, however, standards-based tools need to be developed for all commonly used environments so that users can continue to use their favorite environment yet benefit from standards-enabled data.

One leading environment with similar capabilities to Matlab is Python. Python has the advantage of being open-source and free, so that tools and scripts developed for Python may be freely shared with scientists and other users without the requirement that they first buy a license. With hundreds of toolboxes giving capabilities like advanced time series, image processing, mapping and publication quality graphics, Python is becoming increasingly popular in the meteorological and oceanic research community.

**Figure 7.** Access and display of CF-compliant WaveWatch III data using the Python Iris package from the British Met Office. This demonstration was done using the IPython Notebook, which allows code, output and rich text to be combined in a web document that can be easily shared with others.

Unlike Matlab, however, Python cannot directly utilize the Unidata NetCDF-Java library to take advantage of standards-based functionality. Although Python can easily take advantage of C and Fortran modules, and Unidata began working several years ago on a C library to support CF conventions ("LibCF"), progress has been slow, and LibCF does not yet have the capability to perform fundamental tasks such as returning the geospatial coordinates from CF-compliant ocean models.

To fill this void, the British Met Office has created Iris, a CF-compliant package for Python [26]. The primary goal is to serve their own users, but because it is open and standards-based, Iris can support a much wider community. With Iris, as in NCTOOLBOX, users can access and work with output from different models without any specific code: any CF-compliant structured grid model can be easily opened, accessed and displayed in Iris (Figure 7).

With several full time developers, government backing, a clear roadmap, and agile and open development approach, Iris is a strong contender to be the dominant met and ocean package for standards-based data access. Although Iris currently only supports structured grids, support for UGRID-compliant unstructured grid data and CF-1.6-compliant observational data is on the development roadmap.

#### **3. Conclusions**

Significant progress has been made in the international geoscience community to develop standards, services and tools that make data search, access, analysis and visualization easier and more efficient. In the ocean modeling community, techniques originally developed for atmospheric forecast and climate models have been adapted and extended to serve the ocean community. Leveraging Unidata technologies such as NetCDF, NcML and the THREDDS Data Server, coupled with international standards development work on the CF Conventions, UGRID Conventions and the OGC Services, a system has been developed that places relatively little burden on data providers or data users.

There is still work to be done hardening and expanding the system. More providers need to be aware of existing tools that will allow them to easily serve standardized, aggregated data. WMS services for unstructured grids are functional, but need to be optimized for performance. Standards-based tools for Python need to be brought up to the same functionality as the tools for Matlab. Packages for other commonly used scientific analysis and visualization environments such as R still need to be developed.

While additional work needs to be done, the advances described here bring us closer to a future where users discover data by keyword and geospatial queries on distributed holdings, access data via standard data services, and analyze and visualize data with common, standards-based software. The basic infrastructure depends on a common data model for each data type, a system that was first demonstrated on structured gridded data, and has been expanded to work with both unstructured grid data and specific observational data types. Although this approach has been developed for atmospheric, climate and oceanographic use, it could be used for hydrology, geology or other geoscience communities that use these data types. While applied here to IOOS, it is also being applied to support other applications [19,27]. With demonstrated success for IOOS, and with support from the international geoscience community, the future looks promising for this distributed, standards-based approach.
