Advanced Cyberinfrastructure to Enable Search of Big Climate Datasets in THREDDS

Understanding the past, present, and changing behavior of the climate requires close collaboration of a large number of researchers from many scientific domains. At present, the necessary interdisciplinary collaboration is greatly limited by the difficulties in discovering, sharing, and integrating climatic data due to the tremendously increasing data size. This paper discusses the methods and techniques for solving the inter-related problems encountered when transmitting, processing, and serving metadata for heterogeneous Earth System Observation and Modeling (ESOM) data. A cyberinfrastructure-based solution is proposed to enable effective cataloging and two-step search on big climatic datasets by leveraging state-of-the-art web service technologies and crawling the existing data centers. To validate its feasibility, the big dataset served by UCAR THREDDS Data Server (TDS), which provides Petabyte-level ESOM data and updates hundreds of terabytes of data every day, is used as the case study dataset. A complete workflow is designed to analyze the metadata structure in TDS and create an index for data parameters. A simplified registration model which defines constant information, delimits secondary information, and exploits spatial and temporal coherence in metadata is constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to search the big climatic datasets in near real-time. The proposed approach has been tested on UCAR TDS and the results prove that it achieves its design goal by at least boosting the crawling speed by 10 times and reducing the redundant metadata from 1.85 gigabytes to 2.2 megabytes, which is a significant breakthrough for making the current most non-searchable climate data servers searchable.


Introduction
Cyberinfrastructure plays an important role in today's climate research activities [1][2][3][4][5][6]. Climate scientists search, browse, visualize, and retrieve spatial data using web systems on a daily basis, especially as data volumes from observation and model simulation grow to large amounts that personal devices cannot hold entirely [7,8]. The big data challenges of volume, velocity, variety, veracity, and value (5Vs), have pushed geoscientific research into a more collaborative endeavor that involves many observational data providers, cyberinfrastructure developers, modelers, and information stakeholders [9]. Climate science has developed for decades and produced tens of petabytes of data products, including stationary observations, hindcast, and reanalysis, which are stored in distributed data centers in different countries around the globe [10]. Individuals or small groups of scientists face big challenges when they attempt to efficiently discover the data they require. Currently, most scientists acquire their knowledge about datasets via conferences, colleague recommendations, textbooks, and search engines. They become very familiar with the datasets they use, and every time they want to retrieve the data, they go directly to the dataset website to download the data falling within the requested time and spatial windows. However, these routines are less sustainable as the sensors/datasets become more varied, models evolve more frequently, and new data pertaining to their research is available somewhere else [9].
In most scenarios, metadata is the first information that researchers see, before they access and use the actual Earth observation and modeling data the metadata describes [11,12]. Based on the metadata, they decide whether or not the actual data will be useful in their research. For big spatial data, metadata is the key component backing up all kinds of users' daily operations, such as searching, filtering, browsing, downloading, displaying, etc. Currently, two of the fundamental problems in accessing and using big spatial data are the volume of metadata and the velocity of processing metadata [13,14]. Through manual investigation of Unidata THREDDS data repository (a metadata source we take as an example of typical geodata storage patterns) [15,16], it reveals that most of the metadata are highly redundant. The vast majority of metadata records contain identical information and only key fields representing spatial and temporal characteristics are regularly updated. However, there exists a regular pattern to how the redundant information is structured and how new information is added to the repository-but, the pattern varies according to data organization hierarchy and changes with the type of data being delivered (for example: Radar station vs. satellite observation vs. regular forecast model output).
To overcome these big data search challenges, we must confront practical problems in the information model, information quality, and technical implementation of information systems. Our study follows the connection between fundamental scientific challenges and existing implementations of geoscience information systems. This study aims to build a cataloging model capable of fully describing real-time heterogeneous metadata whilst simultaneously reducing data volume and enabling search within big Earth data repositories. This model can be used to efficiently represent redundant data in the original metadata repository and to perform lossless compression of information for lightweight efficient storing and searching. The model can shrink the huge amount of metadata (without sacrificing information complexity or variety available in the original repositories) and reduce the computational burden on searching among them. The model defines two types of objects: Collections and granules. It also defines their lifecycle and relationship to the upstream THREDDS repository data. Collection contains content metadata (title, description, authorship, variable/band information, etc.). Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. We prototyped the model as an online catalog system within EarthCube CyberConnector [17][18][19][20]. We have made the final system available online at: http://cube.csiss.gmu.edu/CyberConnector/web/covali. The system provides a near real-time replica of the source catalog (e.g., THREDDS), optimizes the metadata storage, and enables searching capability which was not available before. The system is like a clearinghouse with its own metadata database. Currently, the system is mainly used for searching operational time-series observations/simulations collected/derived from field sensors. Other datasets, like remote sensing datasets and airborne datasets, can be foreseen to be supported in the near future. The novelty of this research is that it turns the legacy data center repositories into lightweight flexible catalog services, which are more manageable by providing searching capabilities for petabytes of datasets. The work provides important references to people operating the operation of big climate data centers and advises on further improvements in those operational climate data centers to better serve the climate science community. This paper is organized as follows. Section 2 describes the background knowledge and history. Section 3 introduces related work. Section 4 introduces the proposed model. Section 5 shows the implementation of the model and the required cyberinfrastructure. Section 6 demonstrates the experiment results. Section 7 discusses the results of our approach. Section 8 concludes the paper. The study described in this paper is an attempt to contribute to the global scientific endeavor on understanding and predicting the impacts of climate change. Understanding climate change and its impacts requires understanding Earth as a complex system of systems with behaviors that emerge from the interaction and feedback loops that occur on a range of temporal and spatial scales. However, new advances in these studies are obstructed by the challenges of interdisciplinary collaboration and the difficulty of data and information collaboration [21][22][23][24][25][26][27]. The difficulties of information collaboration can be understood in terms of long-standing big data problems of variety (complexity) and volume/velocity.

Background
Metadata is a powerful tool for dealing with big data challenges. We discuss the background work on metadata and interoperability of metadata catalogs as critical components of advanced cyberinfrastructure that we envision.

Metadata
The topic of metadata has been approached by two distinct scholarly traditions. Understanding them helps us clarify our approach to metadata in cyberinfrastructure. Library information scientists have described the metadata bibliographic control approach. Bibliographic principles allow information users to describe, locate, and retrieve information-bearing entities. The basic metadata unit is the "information surrogate" that derives its usefulness from being locatable (by author, title, and subject), accurately describing the information object (the data of the metadata) and identifying how to locate the object. The second (complementary) view of metadata originates in the computer science discipline and is called the data management approach. Complex and heterogeneous data (textual, graphical, relational, etc.) is not separated into information units, but is instead described by data models and architectures that represent "additional information that is necessary for data to be useful" [27].
The key difference is the bibliographic approach works with distinct information entities of limited types, while the data management approach works with models of data/information structures and their relationships.
This distinction between bibliographic and data management approaches is important in the context of ongoing efforts of metadata standardization [28][29][30][31]. The second approach is not conducive for standardization because the data management models are as complex and heterogeneous as the structures of the data being modeled. Consequently, in accordance with existing standards, the currently available metadata for large climate datasets follows the first approach, which provides bibliographic information and does not describe the data structures in a way that may permit new capacities of advanced cyberinfrastructure. Our paper describes the work to supplement and transform the existing bibliographical metadata with a custom metadata management model resulting in new applications for the existing data. Metadata standardization is a prerequisite for interoperability, which is a prerequisite for building distributed information systems capable of handling complex Earth system data [32].

Interoperability, Data Catalogs, Geoinformation Systems, and THREDDS
Data and information collaboration across disciplines is critical for advanced Earth science. Unfortunately, there is no strongly unified practice for data recording, storage, transmission, and processing that the entire scientific community follows [33][34][35][36][37]. Disparate fields and traditions have their own preferred data formats, software tools, and procedures for data management. However, Earth system studies generally work with data that follow a geospatial-temporal format [38][39][40][41][42]. All of the data can be meaningfully stored on a 4D (3 spatial and one temporal) dimension grid. This basic commonality has inspired standardization efforts with the goal of enabling wider interoperability and collaboration.
Following organic outgrowth from the community, the standardization efforts are now headed by Open Geospatial Consortium (OGC) and the International Organization for Standardization Technical Committee 211 (ISO TC 211) and have yielded successful standards in two areas relevant to us [43][44][45][46][47]. First is the definition of NetCDF as one of the standard data formats for storing geospatial data. The second is the metadata standardization. Those efforts are extremely relevant to our research and are further discussed in the Related Works section. For background, it is important to mention that the standard geospatial metadata models developed by OGC are still evolving capabilities for describing the heterogeneous, high-volume, or high-velocity big data we are studying. The commonly used OGC/ISO 19* series metadata standards have relatively limited relational features (aggregation only) and, in the repository we studied, each XML encoded metadata record contains mostly redundant information (for example, two metadata objects that represent two images from a single sensor mostly contain duplicated information that describes sensor characteristics). However, there are multiple lines of work that ISO TC 211 is pursuing that addresses these issues and suggests a trend for expanding the applicability of standardized metadata models and the integration of a greater variety of information.
The standard geographical metadata model was developed in conjunction with a standard distributed catalog registration model titled Catalog Services for the Web (CSW) [48,49]. The CSW standard is widely known and many Earth system data providers offer some information about their data holdings via the CSW interface. However, the CSW standard is also poorly suited to support big data collaborative studies for the Earth system. CSW follows the basic OGC metadata model in a way that makes it challenging to capture valuable structure and semantics of existing data holdings without storing extremely redundant information-which exhausts computing resources without taking advantage of the true value of large and complex Earth big data. However, OGC metadata stored in CSW is the existing standard that governs not only data distribution practices, but also how researchers think about data collaboration.
The next item this study works with is the UCAR Unidata THREDDS Data Server (TDS). The University Corporation for Atmospheric Research (UCAR) Unidata is a geoscience data collaboration community of diverse research and educational institutions. It provides the real-time heterogeneous Earth system data that this study targets. THREDDS is Unidata's Thematic Real-Time Environmental Distributed Data Services. TDS is a web server that provides metadata and data access for scientific datasets to climate researchers. TDS provides its own rudimentary hierarchical catalog service that is not searchable and does not support the CSW standard. However, it does support the OGC geospatial metadata standard-although not consistently or comprehensively. In order to make data hosted by TDS searchable, the TDS metadata must be copied to another server and a searchable catalog must be created for the metadata. This task is performed by a customized web crawler developed by this study.
This study attempts to build upon the existing infrastructure with its available resources and limitations to provide new capabilities. The limitations of the existing systems are two-fold. First is the limits of the CSW metadata registration model (it does not naturally support registering information about metadata lifecycle or sufficiently detailed aggregation information), and second is the incompleteness of information within metadata provided by THREDDS. This study attempts to erase the limits by first interpolating information to improve the quality of the existing metadata model and then by extending the model to provide advanced capabilities. It demonstrates how to integrate TDS metadata with CSW software and proposes several practical solutions that work around the limitations of the CSW metadata registration model. We do this to show that improvements in metadata and catalog capabilities can also reduce the challenges of big data in variability, volume, and velocity.

Related Work
This paper brings together several existing lines of work to confront the problems of integrating and searching vast and diverse climate science datasets. Existing research in areas of metadata modeling, geospatial information interoperability, geospatial cataloging, web information crawling, and search indexing provides the building blocks for our work to demonstrate and evaluate advanced climate data cyberinfrastructure capabilities.

Metadata Models
There are many studies exploring the fundamental relationship between metadata models and information capabilities. There exists diverse work in other areas that deal with the same basic issues and demonstrates that the creation of novel metadata models can be used as a method for solving information challenges. For example, Spéry et al. [50] have developed a metadata model for describing the lineage of changes of geographical objects over time. They used a direct acyclic graph and a set of elementary operations to construct their model. The model supports new application of querying historical cadastral data and minimizes the size of geographical metadata information. Spatiotemporal metadata modeling can be generalized as a description of objects in space and time, and relationships between objects conceived as flows of information, energy, and material to model interdependent evolution of objects in a system [51]. Provenance ("derivation history of data product starting from its original sources" [52]) modeling is an important part of metadata study. Existing metadata models and information systems have been experimentally extended with provenance modeling capabilities to enable visualization of data history and analysis of workflows that derive data products used by scientists [53,54]. An experiment to re-conceptualize metadata as a practice "knowledge management" yielded a metadata model that can support the needs of spatial decision-making by identifying issues of entity relationships, integrity, and presentation [55]. The proposed metadata model allows communicating more complex information about spatial data. This metadata model makes it possible to build an original geographic information application, named Florida Marine Resource Identification System, that extends the use of the existing environment and civil data to empower users with higher-level knowledge for analysis and planning. Looking outside the geospatial domains, we still observe that the introduction of specialized metadata approaches and models permits the development of new capabilities.

Geospatial Metadata Standardization, Interoperability, and Cataloging
The diversity of metadata models and formats developed by research has enabled new powerful geoinformation systems, but has also introduced a new set of problems of data reuse and interoperability. Public and private research, administrative, and business organizations have accumulated growing stores of geoinformation and data, but this data has not become easier to discover and access for users outside limited organization jurisdictions. This has led to significant resource wastage and duplication of effort for data producers and consumers. Cataloging has grown increasingly challenging because of this heterogeneity. In response, new spatial data infrastructures have been developed. They have attempted to integrate and standardize multiple metadata models and develop shared semantic vocabulary models to enable discovery by employing the "digital library" models of metadata. In this process, syntactic and semantic interoperability challenges have been identified. Syntactic operability refers to information portability-the ability of systems to exchange information. Semantic interoperability refers to domain knowledge that permits information services to understand how to meaningfully use the data from other systems [56].
Various techniques for achieving metadata interoperability have been explored [57]. Two related families of techniques can be identified. One approach attempts to create standard and universal models, the other creates mappings between several metadata representations of the same data. Transformation between several metadata models requires that syntactic, structural, and semantic heterogeneities can be reconciled. The reconciliation is accomplished with techniques called metadata crosswalks. A crosswalk is "a mapping of the elements, semantics, and syntax from one metadata schema to another". Once mappings are developed, they can be used to apply multiple metadata schemas to existing data [58].
The possibilities for interoperability have been advanced by the efforts led by the International Organization for Standardization Technical Committee 211 (ISO TC 211) to standardize metadata representation. It introduced the ISO 19* series of geospatial metadata standards for describing geographic information by the means of metadata [54,[59][60][61]. The standards define mandatory and optional metadata elements and associations among elements. For example, spatiotemporal extent, authorship, and general description of datasets are required and recommended by the standard. Other kinds of information like sequencing of datasets in a collection, aggregation, and other relational data are optional in the standard. The ISO 19* series of standards also provides an XML schema for the representation of the metadata in XML [62].
Looking at existing metadata interoperability work, we see a recurrence of similar problems such as diversity of metadata representations and complexity of mapping between them. Several authors discuss the practical challenges of developing software and systems for translation [27,59,63]. There exists a proliferation of study efforts and results that advance the goals of interoperability by identifying key understanding of the challenges of interoperability and demonstrating systems, services, and models that address common challenges. Our work attempts to preserve existing interoperability advances while exploring the possibilities of expanding existing metadata models to support new possibilities use of existing data.
Standardized metadata is often stored and made available using catalog services. Catalogs allow users to find metadata using queries that describe the desired spatial, temporal, textual, and other information characteristics of the searched data [64]. The OGC Catalog Service for the Web (CSW) is one of the widely used catalog models in the geoscience domain to describe geographic information holdings [6,65].

Web Harvesting and Crawling
One critical capacity of metadata cyberinfrastructure is the ability to integrate metadata from remote web repositories. The process of finding and importing web linked data in a metadata repository is called "crawling" and is accomplished using a software system called "metadata web crawler". A web crawler is a computer program that browses the web in a "methodical, automatic manner or in an orderly fashion" [66]. A crawler is an internet bot, it is a program that autonomously and systematically retrieves data from the world wide web. It automatically discovers and collects different resources in an orderly fashion from the internet according to a set of built-in rules. Patil and Patil [66] summarize this general architecture of web crawlers and also provide a definition of several types of web crawlers. A focused crawler is a type that is designed to eliminate unnecessary downloading of web data by incorporating an algorithm for selecting which links to follow. An incremental crawler first checks for changes and updates to pages before downloading their full data. It necessarily involves an index table of page update dates and times. We follow these two strategies in the design of our crawler. The authors also outline common strategies for developing distributed and parallelized crawlers. Our crawler runs on a single machine, but we use a multithreaded process model with a shared queue mechanism-a common parallelization strategy identified by the authors [67].
A fairly recent review collected by Desai et al. [68] shows that web crawler research is an active area of work-however, most of this work is focused on the needs of general web search engine index construction. There exists an area of research called "vertical crawling" which contends with the problems of crawling non-traditional web data: News items, online shopping lists, images, audio, video. There does not appear any publications regarding efficient crawling of heterogeneous Earth system metadata.
There exists substantial previous work to show the feasibility of crawling this metadata. One recent paper summarizes the state of the art. Li et al. [69] present a heterogenous Earth system metadata crawling and search system named PolarHub-a web crawling tool capable of conducting large-scale search and crawling of distributed geospatial data. It uses existing textual web search engines (Google) to discover OGC standards-compliant geospatial data services. It presents an interactive interface that allows users to find a large variety and diversity of catalogs and related data services. It has a sophistical distributed multi-threaded software system architecture. PolarHub shows that it is possible to present data from many sources in a single place. However, it does not present datasets, only endpoints that users must further explore on their own. It does not download, summarize, or harmonize the metadata stored on the remote catalogs. It shows the feasibility of cyberinfrastructure that integrates a variety of data based on interoperable standards but does not discuss data volume and velocity challenges that arise when deeper and fuller crawling is done. PolarHub users can find a large number of catalogs and services that contain, for instance, "surface water temperature" data but they cannot use metadata crawler following this catalog hub strategy to discover datasets that hold "surface water temperature inside X spatial and temporal extent with Y spatial and temporal resolution".
A complementary strategy is discussed by Pallickara et al. [70], who present a metadata crawling system named GLEAN, which provides a new web catalog for atmospheric data based on the extraction of fine-grained metadata from existing large-scale atmospheric data collections. It solves the data volume problem by introducing a new metadata scheme based on custom synthetic datasets that represent collections (or subsets or intersections) of multiple existing datasets. This reduces metadata overhead greatly and permits high performance and precise discovery and access of specific datasets inside vast atmospheric data holdings. Unlike PolarHub, GLEAN avoids the data variety challenge by limiting its processing to one type of data format used in atmospheric science. They also do not contend with the interrelated velocity and near real-time access problems-in GLEAN crawling, the discovery of updated datasets is initiated by manual user request. They do not use the OGC catalog or metadata standards to support interoperability.
BCube project (part of EarthCube initiative) attacks similar problems with another approach [71]. EarthCube is a National Science Foundation initiative to create open community-based cyberinfrastructure for all researchers and educators across the geosciences.
EarthCube cyberinfrastructure must integrate heterogeneous data resources to allow forecasting the behavior of the complex Earth system. EarthCube is composed of many building blocks. Our work is part of the EarthCube Cyberway building block. BCube (The Brokering Building Block) offers a different approach for heterogeneous geodata interoperability. BCube adopts a brokering framework to enhance cross-disciplinary data discovery and access. A broker is a third party online data service that contains a suite of components, called accessors. Each accessor is designed to interface with a different type of geodata repository. A broker allows users to access multiple repositories with a single interface without requiring data providers to implement interoperability measures. BCube supports metadata brokering. It can search, access, and translate heterogeneous metadata from multiple sources. It demonstrates deeper interoperability than other approaches discussed here, but does not attempt to solve data volume or velocity problems [72]. The BCube approach is very relevant to us; however, BCube has very few documents available and the system is inaccessible. We were unable to compare some of the details of our different approaches.
Song and Di [73] studied the same problem with the same example repository: Unidata TDS. The authors determined the volume and velocity characteristics of the target repository metadata. Like our study, they propose modeling it with concepts of collection and granule. They implemented a crawler that is able to crawl some of the TDS archive. Their work is the previous progress in the same project as ours and is highly relevant to this study. However, their approach did not perform well using real-world TDS data, which led us to take it in a different direction. We rebuilt their work to demonstrate real-time search and the possibility of processing all of TDS by using a more sophisticated metadata model, and a more advanced integrated search client and indexing service that permits true real-time search.
Reviewing existing work reveals tremendous advances toward solving the challenges of creating interoperable Earth system cyberinfrastructures that can practically process a large volume and variety of observation and model data that are generated in high-velocity data production processes. Lines of work in metadata modeling, standardization, interoperability, repository crawling, and processing provide the basis for the materials for our study. Our contribution is to synthesize these approaches to explore how interoperability and performance could be achieved simultaneously.

Materials and Methods
To enable searching of big climate data, we propose a new big data cataloging solution, which includes the following steps. (1) Analyze the target geodata repository that provides a good example of data challenges for cross-disciplinary Earth system scientific collaboration. (2) Analyze the qualities and characteristics of the data in the selected repository. (3) Construct a model of the repository. (4) Use the repository model to construct an efficient metadata resource model. (5) Develop a crawler system that uses repository and metadata resource models to optimize its crawling algorithm and metadata representation. (6) Demonstrate advanced interoperable big geodata search and access capabilities that our approach permits. The completed cyberinfrastructure model and system architecture (derived from our metadata model) is shown in Figure 1. Develop a crawler system that uses repository and metadata resource models to optimize its crawling algorithm and metadata representation. (6) Demonstrate advanced interoperable big geodata search and access capabilities that our approach permits. The completed cyberinfrastructure model and system architecture (derived from our metadata model) is shown in Figure 1.

Metadata Repository Selection
We took Unidata THREDDS Data Server (TDS) as our example target geodata repository platform. TDS was chosen because it is widely used by atmospheric and other related Earth science fields. It supports a good variety of open metadata and data standards and there exist many data centers that use TDS. It supports basic catalog features but lacks advanced search capabilities. It gives users and administrators large latitude of how the data is organized and updated inside the TDS catalog. The geodata stored across many TDS installations meets our broad criteria for real-world data variety, volume, and velocity.
A single TDS instance was selected as a target for our experiment. UCAR Unidata TDS (thredds.ucar.edu) repository was determined as a suitable target system and a good example of diverse uses of TDS. Unidata TDS contains a requisite variety of data. It has near real-time data that demonstrate the data velocity challenge. It contains a variety of data granularity and a good range in the size and complexity of datasets available. The volume of data and the volume of metadata is sufficiently challenging. The catalog structure is heterogeneous-different types of data are organized on different principles. On initial inspection, Unidata TDS was determined to be a great example of the challenges we wanted to explore.
Using manual inspection and basic statistical analysis via custom Python scripts, we started mapping out the characteristics of the Unidata TDS information system. We tried to answer the following questions: (a) What is the hierarchical structure of data organization in this repository?; (b) how frequently are new records are added and removed?; (c) which parts of the catalog exhibit regular patterns in the information structure that can be generalized and which parts contain unique information?; (d) what are the size and content of the metadata resources stored in the catalog?; (e)

Metadata Repository Selection
We took Unidata THREDDS Data Server (TDS) as our example target geodata repository platform. TDS was chosen because it is widely used by atmospheric and other related Earth science fields. It supports a good variety of open metadata and data standards and there exist many data centers that use TDS. It supports basic catalog features but lacks advanced search capabilities. It gives users and administrators large latitude of how the data is organized and updated inside the TDS catalog. The geodata stored across many TDS installations meets our broad criteria for real-world data variety, volume, and velocity.
A single TDS instance was selected as a target for our experiment. UCAR Unidata TDS (thredds.ucar.edu) repository was determined as a suitable target system and a good example of diverse uses of TDS. Unidata TDS contains a requisite variety of data. It has near real-time data that demonstrate the data velocity challenge. It contains a variety of data granularity and a good range in the size and complexity of datasets available. The volume of data and the volume of metadata is sufficiently challenging. The catalog structure is heterogeneous-different types of data are organized on different principles. On initial inspection, Unidata TDS was determined to be a great example of the challenges we wanted to explore.
Using manual inspection and basic statistical analysis via custom Python scripts, we started mapping out the characteristics of the Unidata TDS information system. We tried to answer the following questions: (a) What is the hierarchical structure of data organization in this repository?; (b) how frequently are new records are added and removed?; (c) which parts of the catalog exhibit regular patterns in the information structure that can be generalized and which parts contain unique information?; (d) what are the size and content of the metadata resources stored in the catalog?; (e) how is information in metadata resources related to metadata resources location within the hierarchy of catalog structure?; and (f) what are the data transmission qualities of the Unidata TDS network system-what portion of the TDS information can be transferred and copied to our system?

Repository Analysis
The following figures show some of the surface structure of the Unidata TDS catalog retrieved using a web browser from http://thredds.ucar.edu/thredds/catalog.html. Figure 2 shows the top level of the catalog hierarchy. Each listed item is a folder (a catalog). Most catalogs contain several levels of nested catalogs ( Figure 3) in a tree-like hierarchy similar to a file system. At the bottom (leaf) tree level (Figure 4), the catalogs contain a list of data resources. Catalogs are presented in two formats. First is the HTML format, suitable for manual web browsing. Second is the XML format that contains additional metadata about the catalogs and the data resources. The XML representation follows THREDDS Client Catalog Specification. The specification extends the basic filesystem-like structure with temporal, spatial, and data variable description metadata annotations [74].

Repository Analysis
The following figures show some of the surface structure of the Unidata TDS catalog retrieved using a web browser from http://thredds.ucar.edu/thredds/catalog.html. Figure 2 shows the top level of the catalog hierarchy. Each listed item is a folder (a catalog). Most catalogs contain several levels of nested catalogs ( Figure 3) in a tree-like hierarchy similar to a file system. At the bottom (leaf) tree level (Figure 4), the catalogs contain a list of data resources. Catalogs are presented in two formats. First is the HTML format, suitable for manual web browsing. Second is the XML format that contains additional metadata about the catalogs and the data resources. The XML representation follows THREDDS Client Catalog Specification. The specification extends the basic filesystem-like structure with temporal, spatial, and data variable description metadata annotations [74].  The TDS catalog provides a powerful general catalog hierarchy model. However, the practical use of this model by scientists who produce geodata is what determines the possibility of data collaboration and harmonization-as well as the specific shapes and possible solutions for big data problems. Email correspondence with Unidata explained that the data placed in different sub-catalogs is produced and organized by different teams of scientists. Although Unidata TDS acts as a unified repository for diverse Earth data, there are no mandatory overarching organizing principles to enable data harmonization [75].   The TDS catalog provides a powerful general catalog hierarchy model. However, the practical use of this model by scientists who produce geodata is what determines the possibility of data collaboration and harmonization-as well as the specific shapes and possible solutions for big data problems. Email correspondence with Unidata explained that the data placed in different subcatalogs is produced and organized by different teams of scientists. Although Unidata TDS acts as a unified repository for diverse Earth data, there are no mandatory overarching organizing principles to enable data harmonization [75].
That being the case, the next step was to understand and describe different sub-structures organically adopted by different teams. After manual inspection and basic statistical analysis performed with custom Python scripts, the following information was compiled to broadly describe the different patterns of sub-catalog utilization (Table 1).  That being the case, the next step was to understand and describe different sub-structures organically adopted by different teams. After manual inspection and basic statistical analysis performed with custom Python scripts, the following information was compiled to broadly describe the different patterns of sub-catalog utilization (Table 1). Four general types of data are simultaneously held in the Unidata TDS repository: (1) Forecast model output, (2) observations (time series from in-situ instruments), (3) satellite imagery, and (4) radar imagery from stationary radar network (NEXRAD, Next Generation Weather Radar) [76]. Each type contains much additional variety in its own hierarchy of sub-catalogs, but at this level, there are some clear and useful broad differences in data qualities that can guide our experiment.
In Table 1, the estimated catalog size is the total size of the metadata held in the catalog. Most of this metadata is completely redundant, but without knowing the deeper structure of this data, we would have to mirror all of this data in order to enable search and discovery capabilities that THREDDS does not support. We calculated maximum data transfer throughput of 4 MB/s or 5 min to load 1 GB of catalog data. It appears to be possible to mirror the entire Unidata TDS metadata catalog in several hours, but data throughputs we observed were not consistent, often slowing down by one order of magnitude. Furthermore, the speed of data processing (indexing and registering with a standard-compliant OGC CSW catalog) is also very time, and compute and storage resource, consuming. We do not have the capabilities to register and search millions of records mostly containing redundant information. Furthermore, the Unidata TDS data were added in near real-time according to specific patterns and structure in the sub-catalogs. If we attempted to copy and register all of that metadata, then we would not have been able to provide near real-time capabilities.
The last two columns in Table 1 show two critical qualities that determine what approach we needed to take to integrate that metadata into our systems.
If final datasets have "coarse" granularity, that means each dataset is a very file and the size of metadata is small in relation to the data size-for "coarse" datasets, we can copy, harvest, and index the metadata into our search system. "Fine" datasets stretch technical capabilities to transfer and process metadata. "Very fine" records are too numerous (the data files too small) for us to be able to effectively synchronize or process their metadata.
If datasets are produced in a regular way (predictable spatiotemporal attributes), then we can harvest minimal information and model the entire catalog. However, for NEXRAD radar metadata, there is no regular pattern to metadata production. A new record could be added every 5 min or every 15 min-and their regularity/irregularity also varies in time and depending on different radar sites (different sub-catalogs). This fine-grained irregular data is the most challenging, because it can neither be harvested wholesale nor modeled in an accurate way. It requires a targeted combination approach. Additional considerations arise when tracking what datasets have expired and been removed-ideally, this should be accomplished without performing an expensive full scan of the TDS repository.
Further examination of the sub-catalogs structure for irregular (and regular) highly granular data revealed additional useful structural information. Some catalogs are "dynamic" (or "live" or "streaming")-they are updated with new data resources with regular (or irregular) frequency. Other catalogs are archival-they can be assumed to never change (until they expire and are deleted entirely). Three distinct types of sub-catalogs can be identified:

•
Pure archival directories: These folders only contain old collections and granules and will never be updated or deleted. • Mixed archival directories: Some of the sub-folders contain archive material, some contain live, streaming near real-time data granules and collections. • Daily archival directories: Folders that contain streaming data for a given day; when the day passes, this directory becomes an archive folder and does not need to be mirrored again. When daily archives expire, all the data resources for that day are deleted together.

Crawling
Big data catalogs normally need to complete a lot of crawling tasks to grab metadata files, and repeat scanning to capture metadata of the newly observed datasets on a regular basis. Crawling is the fundamental information source of metadata, and how to intelligently crawl is one of the largest challenges in big data searching due to the repeated computational burden and the complexity of the content. When designing our crawling strategy, we considered the observation update frequency, time window, observatory network organization, and made the crawler only touch the folders of those updated sensors at collection (sensor) level. Although a sensor has millions of metadata records, we only crawl the metadata at the sensor level. In other words, only one metadata is crawled for each sensor (or instrument). Using this strategy, we can save numerous hours in crawling and metadata transferring over network, especially when the network is unstable. After applying a parallel worker mechanism, we can have dozens of crawlers working on scanning and capturing new/updated metadata of petabytes of climate datasets.
Our crawler is different from most existing crawlers in the literature, because it is not a general-purpose search engine crawler. Typical crawlers download the entire web page, find links to follow, and add those links to the work queue. We cannot do a similar thing, because the web content we are crawling (TDS catalog) contains vastly redundant information that is not possible to download and process in its entirety without overloading available computing and network resources. There are various sensors in the climate monitoring networks and the sensors are dynamically changing, with new sensors added or old sensors removed. We had to crawl the THREDDS Data Server to make sure all the observations were fully synchronized in our catalog. Our crawler design must incorporate knowledge of metadata and metadata structure its processing and queueing algorithms in order to download only essential information.

Indexing
The third step is indexing, which extracts the spatiotemporal information from the crawled metadata and creates indexes for data granules of times series by each instrument. CSW provides the basic metadata registration and query model. However, the large granularity of metadata objects (and lack of aggregation/relational capabilities) makes CSW inefficient for storing and querying large numbers of datasets and that have only small variations in their metadata. A more efficient model is needed. This is a long explored and essentially solved problem in computer science and informatics. Theodoridis et al. [77] summarize the basic approach. For a time-evolving spatiotemporal object, a snapshot of its evolution can be represented by a triplet {o_id, s i , t i }-object id, space-stamp, and time-stamp. This information allowed us to create a "repository production model" (Figure 5). We identified patterns in the catalog hierarchical structure that allowed us to identify which paths in the catalog folder hierarchy are "live" and which ones are "archival". In our crawler implementation (discussed in the next section), we used the structure path patterns to drive the crawler algorithm in two stages-"full sync" stage, which copies the archival data, and "update" stage, which monitors and refreshes the listing from "live" catalog paths. and add those links to the work queue. We cannot do a similar thing, because the web content we are crawling (TDS catalog) contains vastly redundant information that is not possible to download and process in its entirety without overloading available computing and network resources. There are various sensors in the climate monitoring networks and the sensors are dynamically changing, with new sensors added or old sensors removed. We had to crawl the THREDDS Data Server to make sure all the observations were fully synchronized in our catalog. Our crawler design must incorporate knowledge of metadata and metadata structure its processing and queueing algorithms in order to download only essential information.

Indexing
The third step is indexing, which extracts the spatiotemporal information from the crawled metadata and creates indexes for data granules of times series by each instrument. CSW provides the basic metadata registration and query model. However, the large granularity of metadata objects (and lack of aggregation/relational capabilities) makes CSW inefficient for storing and querying large numbers of datasets and that have only small variations in their metadata. A more efficient model is needed. This is a long explored and essentially solved problem in computer science and informatics. Theodoridis et al. [77] summarize the basic approach. For a time-evolving spatiotemporal object, a snapshot of its evolution can be represented by a triplet {o_id, si, ti}-object id, space-stamp, and timestamp. This information allowed us to create a "repository production model" (Figure 5). We identified patterns in the catalog hierarchical structure that allowed us to identify which paths in the catalog folder hierarchy are "live" and which ones are "archival". In our crawler implementation (discussed in the next section), we used the structure path patterns to drive the crawler algorithm in two stages-"full sync" stage, which copies the archival data, and "update" stage, which monitors and refreshes the listing from "live" catalog paths. The repository production model allows targeted crawling-however, the number of metadata resources remains too large to harvest, process, and index in its entirety, even when done in two stages to avoid redundant harvesting. We needed a second model that encompasses the metadata information structure (Figure 6). There are two issues we needed to solve: First is that most of the metadata in the catalog is completely redundant; second is that metadata information scope is not consistent in the catalog. The two issues have the same source: Catalogs, and sub-catalogs and data granules all can have metadata attached.
In these examples from Unidata TDS, we see that metadata is attached to the hierarchical catalog The repository production model allows targeted crawling-however, the number of metadata resources remains too large to harvest, process, and index in its entirety, even when done in two stages to avoid redundant harvesting. We needed a second model that encompasses the metadata information structure (Figure 6). There are two issues we needed to solve: First is that most of the metadata in the catalog is completely redundant; second is that metadata information scope is not consistent in the catalog. The two issues have the same source: Catalogs, and sub-catalogs and data granules all can have metadata attached. example: Authorship), the sub-catalog contains additional content metadata (ex: Variable names) and spatial metadata, while each granule contains temporal metadata. In the next two examples, the distribution of metadata between catalogs and granules is different. The last example is a case where each catalog only contains a single data record (granule). In some cases, the metadata is simply duplicated between several catalog levels, while in others, one specific layer contains all metadata. Another important detail is that the catalog hierarchy, the names of parent catalogs is also metadata for the data resources. When combined, these two perspectives (information change model and information structure model) produce a model of the Unidata TDS repository that can be used to develop efficient (nonredundant) harvesting and representation of all contained metadata. By applying the production model to our crawler design, we were able to harvest only the information we know had changed. Knowing the structure of data changes also allowed us to perform targeted incremental harvesting for near real-time discovery capability. We defined two types of objects: Collections and granules. Collection contains content metadata (title, description, authorship, variable/band information, etc.). Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. The OGC CSW catalog standard does not support the composition of collections and granules, so we used CSW to represent collections only, while granules had to be stored externally. We used popular PyCSW software to hold collection metadata. We extended PyCSW with a PostgreSQL relational database to store relations between collections and granules and granule metadata (Figure 7).  In these examples from Unidata TDS, we see that metadata is attached to the hierarchical catalog structure in various ways. In the first example, a catalog contains some content metadata (for example: Authorship), the sub-catalog contains additional content metadata (ex: Variable names) and spatial metadata, while each granule contains temporal metadata. In the next two examples, the distribution of metadata between catalogs and granules is different. The last example is a case where each catalog only contains a single data record (granule). In some cases, the metadata is simply duplicated between several catalog levels, while in others, one specific layer contains all metadata. Another important detail is that the catalog hierarchy, the names of parent catalogs is also metadata for the data resources.

Two-Step Search Process
When combined, these two perspectives (information change model and information structure model) produce a model of the Unidata TDS repository that can be used to develop efficient (non-redundant) harvesting and representation of all contained metadata. By applying the production model to our crawler design, we were able to harvest only the information we know had changed. Knowing the structure of data changes also allowed us to perform targeted incremental harvesting for near real-time discovery capability. We defined two types of objects: Collections and granules. Collection contains content metadata (title, description, authorship, variable/band information, etc.). Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. The OGC CSW catalog standard does not support the composition of collections and granules, so we used CSW to represent collections only, while granules had to be stored externally. We used popular PyCSW software to hold collection metadata. We extended PyCSW with a PostgreSQL relational database to store relations between collections and granules and granule metadata (Figure 7).

Two-Step Search Process
When the metadata is harvested into PyCSW and temporal granule index is saved in PostgreSQL, the search clients can use these two data sources to retrieve final results for access. The search process takes place in two steps. Initially, the client searches the PyCSW store using standard search methods and queries. This returns a list of collection level results. To get a list of granules, the search client sends a second query to the crawler service. The crawler service queries the granule index, refreshes the index with the latest granules if needed, and returns a list of granules for the requested collection. The search client can then use the collection level CSW record and combine it with selected granule information to produce granule level CSW information. Figures 1 and 8-10 show these interactions from systems architecture and event sequence perspectives. Each collection contains one or more granules. Each granule contains only the spatiotemporal extent metadata. The OGC CSW catalog standard does not support the composition of collections and granules, so we used CSW to represent collections only, while granules had to be stored externally. We used popular PyCSW software to hold collection metadata. We extended PyCSW with a PostgreSQL relational database to store relations between collections and granules and granule metadata (Figure 7).   When the metadata is harvested into PyCSW and temporal granule index is saved in PostgreSQL, the search clients can use these two data sources to retrieve final results for access. The search process takes place in two steps. Initially, the client searches the PyCSW store using standard search methods and queries. This returns a list of collection level results. To get a list of granules, the search client sends a second query to the crawler service. The crawler service queries the granule index, refreshes the index with the latest granules if needed, and returns a list of granules for the requested collection. The search client can then use the collection level CSW record and combine it with selected granule information to produce granule level CSW information. Figures 1 and 8-10 show these interactions from systems architecture and event sequence perspectives. Figure 8. Implementation architecture for searching big data served via Unidata THREDDS Data Server (TDS). Although in our study only UCAR TDS is used, the system is designed to support any TDS repository as a data source.     So far, we have analyzed the Unidata TDS repository structure, built a model of the repository that can inform an effective crawling strategy, and defined the model for the product output for the crawler. We have also described how a search client should function. To complete our experiment, we built a crawler that follows our metadata model and demonstrates a web search capability for the entire contents of Unidata TDS.

Implementation
We implemented a module system within EarthCube CyberConnector [17,78,79] to realize the proposed mode (Figure 8). The implementation included the searching server system and the client system. We will introduce the searching capabilities enabled by these systems.

Crawler Service Implementation
We built a web crawler that traverses Unidata TDS and extracts and stores essential metadata without using unnecessary resources. It is named 'thredds-crawler' and the source code is available via a public GitHub repository: https://github.com/CSISS/thredds-crawler.
The To support our big-data experiment requirements, the crawler is tightly integrated with a catalog software PyCSW [https://pycsw.org/] and a PostgreSQL [https://www.postgresql.org/] database. The crawler, PyCSW, and the database each run in a separate Docker [https://www.docker.com/] container. For the sake of this demonstration, all three services run on the same machine and communicate over the local network. The Docker-compose tool is used to connect and orchestrate the three containers. This architecture allows simple scaling out to multiple machines using containers, which allows for potential substantial improvement in system performance.
The crawler docker container runs as a web service hosted by Gunicorn-a python HTTP server widely used for hosting web applications. It serves three HTTP API endpoints that perform the following functions: Harvest, create index, and read index.
The harvest function loads the Unidata Catalog XML from specified catalog_url using Siphon library. The catalog contains a list of datasets. TDS has a feature to translate its dataset metadata into the ISO/OGC-compatible XML format. For each dataset being harvested, the harvester constructs a query to TDS to retrieve ISO/OGC metadata for the dataset. The ISO/OGC metadata returned by TDS is, however, often incomplete, inaccurate, or inconsistent in some way. The crawler harvesting process then applies a chain of XML filters to the ISO/OGC metadata to rectify it with information from native TDS dataset metadata. Once the metadata is downloaded and processed, it is saved in the PyCSW database directly by using a PyCSW compatibility library.
Indexing is similar to harvesting, but involves a strategy for targeting datasets to be harvested and additional processing steps. During index creation, TDS catalogs and datasets are turned into collections and granules in our model. For each TDS dataset encountered, we determine the collection name. Crucially, the collection name is not the name of the catalog containing the dataset. We found that catalog names are inconsistent, but that TDS dataset ids contain consistent identification information. In the TDS, dataset ids are kept unique by including timestamps in the dataset id. For example, in a dataset with id "NWS/NEXRAD3/PTA/YUX/20190830/Level3_YUX_PTA_20190830_1713.nids", the portion "20190830_1713" is a timestamp. To turn TDS catalogs with datasets into collections with granules, we remove the temporal information to construct the collection id. Then, we download the dataset in the ISO/OGC XML format and transform its XML content with a filter function called "collection builder". This function updates the dataset metadata to turn it into a more general form that describes the collection. It changes identifiers stored in the metadata. It also adds standard-compliant additional fields that identify the metadata for describing "series" ("series" in ISO/OGC model, "collection" in our model). This process needs to be done only for the first dataset encountered for each collection. When processing additional datasets, the existing collection is reused. In TDS, the dataset spatiotemporal extent information is part of the catalog metadata, which means that we only need to download a single dataset metadata to build the collection metadata and we can index the remainder of granules from catalog metadata. This solves the redundancy issue that previously prevented TDS from being searchable. We also correct TDS identifiers to ensure that the namespace authority portion of the identifier is correctly set.
The following tables help illustrate the process of extract collections from a granule identifier for multiple types of data. Table 2 shows the catalog paths of TDS datasets. Table 3 shows how the collection identifier is generated, and Table 4 shows the final result collection name. Table 2. Catalog paths of TDS datasets for three types of data. The catalog path hierarchy is marked in green. The dataset filename is marked in red.  Table 3. TDS dataset identifiers for three types of data. The portion of the identifiers that contain temporal information is highlighted.  When the index harvesting is complete, the collection information (OGC/ISO 19139 XML metadata format) is stored in PyCSW. The granule information is stored in a compact SQL index store (Figure 7). Once the index is created, it can be retrieved from the crawler web service using HTTP API (GET/index). These requests take a collection name and temporal extent as parameters. Although our data model includes granule spatial and temporal extent, at the time of publication, only temporal index queries were implemented. It checks the index data store to see if the latest available granules are newer than the requested time extent. If more recent granules are not required, the crawler returns a list of granules in compact JSON format (Figure 9). However, if the index does not contain recent enough granules, then the index service performs a partial "refresh" indexing of the TDS repository. It uses the TDS catalog link stored in our PyCSW collection and re-runs the index process described here. (Figure 10). However, as we discussed in the Experiment section, the TDS catalog is organized with some sub-catalogs storing archival information, while others contain near real-time "live data". The crawler index refresh process takes advantage of that structure. It ignores the old sub-catalogs and only indexes those that contain more recent and unknown data. This makes near real-time index retrieval fast and efficient.

RADAR
Both harvesting and index creation use the same multi-threaded queue strategy to achieve higher performance. Normally, most of the time is spent waiting for data to be transmitted over the network. By using many threads, we can increase the saturation of both the network and local computer and memory resources, which allows the metadata to become available much faster.

Search System Implementation
The search system is implemented based on the previously developed EarthCuber CyberConnector infrastructure building block [17]. CyberConnector is a Java-based web application that supports discovery and visualization of data from CSW catalogs [17]. We extended CyberConnector to support accessing metadata harvested and indexed by the thredds-crawler described in the previous section. We modified the CyberConnector Search Client to perform a two-stage search. The web application user selects "Search" function ( Figure 11). They select a time range, which is used by the thredds-crawler index service to determine if granule refresh is needed. The web browser sends an AJAX request to the CyberConnector web application with search parameters. CyberConnector queries thredds-crawler PyCSW service for collections that match query parameters. It returns a list of collections. To see the granules available in a collection, the user clicks the "List Granules" button ( Figure 12). This issues another request to CyberConnector for a granules list in the specified temporal extent. CyberConnector web application proxies the granules list request to thredds-crawler indexing service, which returns a list of granules ( Figure 9); or thredds-crawler harvests TDS to update the index and then returns a list of granules ( Figure 10). The client receives a list of granules, which can then be downloaded or visualized ( Figure 13).

Experiment and Results
Based on the implemented catalog system, we conducted several experiments to validate the feasibility of the proposed approach. The datasets for climate science are generally very large because of their long-term running and high temporal resolution. We took the UCAR NEXRAD dataset [80] and the RDA ASR dataset (53.09 terabytes) [81] as our demonstration examples. The searching

Experiment and Results
Based on the implemented catalog system, we conducted several experiments to validate the feasibility of the proposed approach. The datasets for climate science are generally very large because of their long-term running and high temporal resolution. We took the UCAR NEXRAD dataset [80] and the RDA ASR dataset (53.09 terabytes) [81] as our demonstration examples. The searching capabilities on the two datasets were established in the EarthCube CyberConnector. We made a complete set of tests on the searcher and the results are introduced below.

Searching the NEXRAD Dataset
NEXRAD is a very important dataset for climate science research. It currently comprises 160 sites throughout the United States and selected overseas locations (as shown in Figure 14). The basic original datasets, including three meteorological base data quantities: Reflectivity, mean radial velocity, and spectrum width, are called Level II. The derived products are called Level III, which include numerous meteorological analysis products. All NEXRAD Level-II data are available via NCEI, as well as NOAA big data plan cloud providers, Amazon web service (http: //thredds-aws.unidata.ucar.edu/thredds/catalog.html) and Google Cloud (https://cloud.google.com/ storage/docs/public-datasets/nexrad). UCAR provides the near real-time observed data via their THREDDS data server (http://thredds.ucar.edu). Unfortunately, all these data repositories are still non-searchable at present, because it is a huge challenge for any catalog to index and search such big amount of metadata files for the frequently updated radar data records (every 6 min). We used this dataset to prove that the proposed cataloging approach can work well on frequently updated big datasets.

Searching UCAR RDA (Research Data Archive) TDS Repository
NSF-funded NCAR CISL (Computational & Information System Lab) maintains Research Data Archive (RDA), which stores over 11,000 terabytes of climate datasets in its high-performance data storage system. RDA hosts many climate datasets at present, and the Arctic System Reanalysis (ASR) is one of them. ASR is a demonstration regional reanalysis for the greater Arctic developed by Ohio State University. The  The completed system consists of the harvester/indexer service and the search client that is available to the user as a web application. As a result, users are able to search diverse heterogeneous Earth system observation and modeling datasets simultaneously. Once the metadata is found, users can use the CyberConnector visualization system to simultaneously visualize near real-time NEXRAD radar, satellite observation, and forecast simulation model product data. The system performance characteristics of this approach are significantly improved over the existing naive method of harvesting all of the datasets' metadata.

Searching UCAR RDA (Research Data Archive) TDS Repository
NSF-funded NCAR CISL (Computational & Information System Lab) maintains Research Data Archive (RDA), which stores over 11,000 terabytes of climate datasets in its high-performance data storage system. RDA hosts many climate datasets at present, and the Arctic System Reanalysis (ASR) is one of them. ASR is a demonstration regional reanalysis for the greater Arctic developed by Ohio State University. The ASR version 2 dataset (the latest version) is served via RDA with a total volume of 53.04 terabytes. The horizontal resolution is 15 km and the temporal coverage is from 2000 to 2016. It has 34 pressure levels (71 model levels), 31 surface (including 3 soil variables), and 11 upper air analysis variables, 71 surface (including 3 soil variables), and 17 upper air forecast variables.
RDA provides TDS for most of its archived datasets. We harvested the metadata of ASR from its TDS and made them publicly available in CyberConnector. As shown in Figure 15, scientists can search the ASR dataset by providing keywords, spatial extent, or temporal range. The ASR data is in NetCDF format, which is displayable in COVALI. We demonstrated searching ASR dataset in COVALI and visualized the temperature at 2 m above the surface within 12 h. COVALI and RDA were deployed in two remotely distributed facilities. The interactions between COVALI and RDA big data storage were conducted via the standard service interface and over the network. The experiment proves that the proposed solution works well for enabling search on remote big data. the test to compare the performance of the traditional approach and the proposed approach. The results demonstrate that the proposed approach outperforms the traditional approach at least ten times on the overall time cost (from ~10 to ~1 s) and has significant improvements on harvesting speed, storage use, and search speed based on the number of datasets being processed. Search time cost has two components. The time to search for collections in the catalog and the time to retrieve the granules list from the granule index. Figure 17 shows that search result retrieval is extremely fast in our system.
The search currently supports filters, including keywords, data format, and spatiotemporal extents. All of them are fixed filters with less uncertainties. Therefore, the returned results stay the same as long as the metadata base does not add new records or delete existing records. The result completeness is 100% accurate because correct records match the filter conditions. Users can narrow down spatiotemporal extent based on their interest and provide one or more keywords which could match the data field names. The first page relevance of the search results depend on the relationships between the inputted keywords and the metadata field values. Based on our experiences with climate scientists, we find that they normally do not input any keywords and only use spatiotemporal filters to check out what can be searched in a catalog. Once they have a region of interest or a time window, they then have an impression about what possible data is available. They come to the catalog just to find the access URL to download or visualize the data files. Normally, results from our search client are numerous because of the loose filters the scientists give. The results on the first page are usually very well related to the scientists' needs. A more intelligent search, such as a semantics-based search, which could find more accurate first-page results with higher relevance, will be studied in the next stage of work.

Performance Evaluation
The traditional approach for cataloging climate datasets is fully harvesting all the metadata files of every single data record. We implemented the searcher using the traditional method before, but the performance was very slow and sustained operation not possible for the practical scenario for big data cataloging. After we applied the new cataloging strategy, we tested it by crawling several hundreds and thousands of records from UCAR THREDDS Data Server. We tested using different sets of parallel workers: 40 workers, 20 workers, 10 workers, 5 workers, and a single worker, respectively, to measure the improvements of parallel crawling. Figure 16 displays the time cost of the test to compare the performance of the traditional approach and the proposed approach. The results demonstrate that the proposed approach outperforms the traditional approach at least ten times on the overall time cost (from~10 to~1 s) and has significant improvements on harvesting speed, storage use, and search speed based on the number of datasets being processed. Search time cost has two components. The time to search for collections in the catalog and the time to retrieve the granules list from the granule index. Figure 17 shows that search result retrieval is extremely fast in our system.

Discussion
Our solution to the big data volume, variety, and velocity challenges discussed in the paper consists of a novel metadata model, and cyberinfrastructure architecture and implementation that is derived from the model. The metadata model combines the description of metadata content (the "information model") with the description of metadata repository structure and behavior. The cyberinfrastructure consists of a crawler service that takes advantage of the metadata model to optimize THREDDS crawling strategy to eliminate the transfer and processing of redundant metadata information. Additionally, the metadata repository model permits the crawler service to perform incremental metadata transfer, which enables real-time search capability. The demonstrated cyberinfrastructure also includes an interoperable catalog service that uses the metadata model to minimize the storage of redundant information. Finally, a search client that uses the catalog and the crawler services is implemented.

Can the Proposed Solution Address the Volume Challenge?
Metadata volume is ~25 GB for the UCAR RADAR dataset. The traditional method for harvesting metadata (as discussed in Section 6.3) is able to process approximately one record (with an approximate size of 100 KB) per second. To completely ingest all of THREDDS RADAR metadata at the observed harvesting rate, it would take 250,000 s or ~70 h. By using the proposed metadata model and cataloging system, we observe harvesting rates that are at least 10 times faster. This permits daily synchronization of all Unidata TDS metadata.  The search currently supports filters, including keywords, data format, and spatiotemporal extents. All of them are fixed filters with less uncertainties. Therefore, the returned results stay the same as long as the metadata base does not add new records or delete existing records. The result completeness is 100% accurate because correct records match the filter conditions. Users can narrow down spatiotemporal extent based on their interest and provide one or more keywords which could match the data field names. The first page relevance of the search results depend on the relationships between the inputted keywords and the metadata field values. Based on our experiences with climate scientists, we find that they normally do not input any keywords and only use spatiotemporal filters to check out what can be searched in a catalog. Once they have a region of interest or a time window, they then have an impression about what possible data is available. They come to the catalog just to find the access URL to download or visualize the data files. Normally, results from our search client are numerous because of the loose filters the scientists give. The results on the first page are usually very well related to the scientists' needs. A more intelligent search, such as a semantics-based search, which could find more accurate first-page results with higher relevance, will be studied in the next stage of work.

Discussion
Our solution to the big data volume, variety, and velocity challenges discussed in the paper consists of a novel metadata model, and cyberinfrastructure architecture and implementation that is derived from the model. The metadata model combines the description of metadata content (the "information model") with the description of metadata repository structure and behavior. The cyberinfrastructure consists of a crawler service that takes advantage of the metadata model to optimize THREDDS crawling strategy to eliminate the transfer and processing of redundant metadata information. Additionally, the metadata repository model permits the crawler service to perform incremental metadata transfer, which enables real-time search capability. The demonstrated cyberinfrastructure also includes an interoperable catalog service that uses the metadata model to minimize the storage of redundant information. Finally, a search client that uses the catalog and the crawler services is implemented.

Can the Proposed Solution Address the Volume Challenge?
Metadata volume is~25 GB for the UCAR RADAR dataset. The traditional method for harvesting metadata (as discussed in Section 6.3) is able to process approximately one record (with an approximate size of 100 KB) per second. To completely ingest all of THREDDS RADAR metadata at the observed harvesting rate, it would take 250,000 s or~70 h. By using the proposed metadata model and cataloging system, we observe harvesting rates that are at least 10 times faster. This permits daily synchronization of all Unidata TDS metadata.

Can the Proposed Solution Address the Velocity Challenge?
We determined that new (live) RADAR metadata is being generated at 330 records per minute. Our maximum harvest capacity (constrained by Unidata THREDDS network capacity) is 60 records per minute. Using the traditional method, we cannot keep up with the data velocity. Using the indexing harvester approach, we can process up to 1400 records per minute. This exceeds the velocity of THREDDS data production. Additionally, by using incremental index update during the client search request exchange, we can target the indexing harvest process to the exact sub-catalog containing the updated information and thus provide real-time search capability for this high-velocity data.

Can the Proposed Solution Reduce Metadata Crawling Redundancy?
The solution demonstrated here is able to reduce redundancy in crawling and storage resource consumption. For example, using the traditional method with Forecast Models catalog,~7000 records are downloaded. The total storage used is 1.85 GB. The same metadata can be processed using our approach by downloading only 45 sample metadata records (2.2 MB) that represent collection level information. This represents a 99% reduction in data transmission and storage costs.

What Are the Benefits and Drawbacks of the Proposed Solution Compared to Other Big Data Searching Strategies?
The solution demonstrates the expected benefits described at the beginning of this study. The main drawback of this solution is the model and software system complexity. Custom software has to be developed to intelligently process catalogs as they are being harvested. To get complete and accurate results, the ingested metadata must be cleaned and transformed to fill in missing pieces of information and to make it conform to our model. Although our approach is general enough to work with multiple TDS repositories, in practice, inconsistencies and additional varieties from each repository must be reconciled using custom code. Our work demonstrates that it is possible to build a unified and highly efficient searchable catalog system for large and heterogeneous Earth system data repositories that supports real-time queries; however, every solution has its limitations and costs. In this case, the costs are complexity in software and systems architecture, which means increased software development and maintenance costs.

Conclusions
This paper proposed and demonstrated a novel cyberinfrastructure-based cataloging solution to enable an efficient two-step search on big climatic datasets by leveraging the existing data centers and state-of-art web service technologies. We used the huge datasets served by UCAR THREDDS Data Server (TDS), which serves Petabyte-level ESOM data and updates hundreds of terabytes of data every day, as our study dataset to validate its feasibility. We analyzed the metadata structure in TDS and created an index for data parameters. A developed metadata registration model, which defines constant information, delimits variable information, and exploits spatial and temporal coherence in metadata, was constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to near real-time search in big climatic datasets. We experimented with the approach on both UCAR TDS and NCAR RDA TDS, and the results prove that the proposed approach achieves its design goal, which is a significant breakthrough for the current most non-searchable climate data servers. The solution identified redundant information and determined the sampling frequencies to keep unpredictable parts of the source catalog synchronized with our downstream mirror catalog. An automated hierarchical crawler-indexer and a complimentary search system using the pre-existing EarthCube CyberConnector were implemented. Metadata crawling and access performance validates our integrated approach as an effective method for dealing with big data challenges posed by heterogeneous, real-time Earth System Observation and Model data. However, although the proposed approach outperforms the traditional searching solution for big data, it is still time-consuming in both crawling and searching processes, and may be out of pace dealing with real-time streaming data. In the future, we will study to further reduce the time spent in crawling redundant metadata and to find a high-performance method for rapid and intelligent search.