A Methodology for Heterogeneous Sensor Data Organization and Near Real-Time Data Sharing by Adopting OGC SWE Standards

: Finding a solution to collect, analyze, and share, in near real-time, data acquired by heterogeneous sensors, such as trafﬁc, air pollution, soil moisture, or weather data, represents a great challenge. This paper describes the solution developed at Eurac Research to automatically upload data, in near real-time, by adopting Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) standards to guarantee interoperability. We set up a methodology capable of ingesting heterogeneous datasets to automatize observation uploading and sensor registration, with minimum interaction required of the user. This solution has been successfully tested and applied in the Long Term (Socio-)Ecological Research (LT(S)ER) Matsch-Mazia initiative, and the code is accessible under the CC BY 4.0 license.


Introduction
The possibility to collect and ingest data in near real-time for the monitoring of events belonging to different fields of our life is a big challenge.One issue is being able to collect the data, while another is being able to access the data quickly enough and with sufficient ease to allow visualization and analysis.
In recent years, the market for sensors has grown quickly and it is possible to find more and more efficient and intelligent devices, with a subsequent notable decrease in their costs.Hydrology [1], oceanography [2], risk monitoring [3], weather forecasting or citizen-and urban-sensing [4] and the monitor of the essential climate variables (ECV) [5] represent some of the fields of interest.In fact, the possibility of defining a real-time solution for data acquisition could improve the understanding of physical processes and their evolution.
Due to the large number of sensor manufacturers and the related diversity of their output formats, the first problem to face is finding a uniform way to organize data for analysis and, as a further step, to share them on the web.In this context, the idea of the Sensor Web represents a solution to standardize this process.
The definition of the "Sensor Web" was introduced by Delin et al. [6] and Delin [7], as an idea to create sensor nodes for collecting data by means of a set of shared and robust protocols.The Open Geospatial Consortium (OGC) defines interoperability interfaces and metadata encodings that enable the real-time integration of heterogeneous sensor webs and the Internet of Things into the information infrastructure by means of the Sensor Web Enablement (SWE) initiative [8][9][10].
OGC is an international, non-profit, and voluntary consensus standards organization.It consists of more than 380 companies, government agencies, and universities.OGC introduced the following SWE standards:
These standards allow sensor interoperation and service synergy, based on the possibility of discovering and exchanging data from sensor observations, facilitating process automation.Furthermore, the availability of the respective metadata guarantees enough information provided to allow sensors to be selected properly and to give an initial overview of their specific histories and characteristics.
This paper describes a solution to automatically upload data, in near real-time, in any standard SOS implementation, to simplify data sharing and interoperability and to speed up the output of research.

Related Work
The diffusion of the OGC SWE standards led to the creation of sensor observation management systems, such as istSOS (http://istsos.org/)[14] and the 52 • North SOS project (http://52north.org/about/52north).These implementations form a basis for organizing datasets and sharing them on the internet in a real-time system.
Of these two solutions, we decided to adopt the latter because it is robust, largely adopted in research institutes, and is well-supported by the developers.Addditionally, the GET-IT (http: //www.get-it.it/)system integrates the 52 • North SOS project for distributing observation data and metadata of sensors, according to the INSPIRE directive.It offers a user-friendly web interface which can effectievly simplify the use of the SWE standards, especially for non-expert users [15,16].As mentioned before, the heterogeneous and dynamic market of sensors does not fit with the concept of standardization.This complicates the harmonization of datasets and increases the effort before uploading into the SOS service.This is why it is also necessary to create an intermediate phase, to make sensor data ready for the SWE initiative.Martínez et al. [17] considered a valuable solution in this direction, by means of a plug-and-play sensor integrated into research data infrastructures.To facilitate the integration of multiple data providers, it is also possible to use a Lightweight SOS Profile [18].The basic idea is to comprise those basic SOS operations (GetCapabilities, DescribeSensor, GetObservation, and GetFeatureOfInterest), so that they must be implemented by every standard compliant SOS server to support typical functionalities, such as sensor data access and visualization.
These solutions show that data standardization represents a great opportunity to take into account, as it allows the movement from local to distributed and cloud computing, in which information and computing resources are provided as web services [19].
This paper focuses on the development of a flexible procedure which, with a minimum interaction of the user, automatically uploads data, in near real-time, in any standard SOS implementation.In particular, the main idea is to adopt the 52 • North SOS implementation and to integrate a tool able to automatize observation uploading and sensor registration (InsertSensor operation), with the least effort possible required by the user.This automatization is, in our opinion, still missing in the available opensource solutions and represents a big problem for researchers.In our work, we try to fill this gap by means of a Python-based tool that is efficient and easy to use.Furthermore, this solution offers the possibility of simplifying the creation of sensor metadata in a standard format (i.e., OGC, INSPIRE, or ISO) or, whenever necessary, in a project-specific format.In fact, considering the increasing number of sensors, it is extremely important to adopt a common standard for metadata, in order to obtain an organized and interoperable federated system.
The paper contains the following sections: Section 3 introduces the rationale behind the implementation design.Section 4 describes the concept of the data life cycle, which is fundamental in describing all of the phases, from acquisition through to the sharing of data.In particular, the procedure we set up to have the data running in a SOS service is described.Section 5 presents an application of this procedure used in the LT(S)ER initiative.Finally, the conclusion section summarizes our results and defines the future steps.
In the initial stage of the implementation phase, we realized that the user may not even know the OGC standards or have competence in writing code; either of which could increase errors and production time.Therefore, the previous HCI factors were directly discussed with the user because it is important to know, already in the test phase, his/her perception of the methodology and, above all, if it meets the requirements.As a consequence, the possibility to automatize, as much as possible, the entire data flow, together with a minimization of user interaction, aims to guarantee an overall level of satisfaction with respect to the final goal represented by the possibility to have data in near real-time.This is the general concept of the rationale of our approach for the data life cycle.
In particular, ground stations have internet connections.In this way, the observed data are locally stored initially and then, after establishing a remote control and connection, they are directly accessible and downloadable.The harmonization phase guarantees quality-checked data that can be organized to be ready for the automatic upload in a SOS system.The SOS environment offers the possibility to access sensor metadata and their observations through an Applied Programming Interface (API) by means of GET requests.This system has the advantage that it is accessible with any programming languages able to deal with simple HTTP requests.Each data point at a given time stamp can be identified, downloaded, and included in further scientific analyses, resulting in a simplified human-data interaction.The data sharing is meant to connect users to the data stored in the database in a direct way.Depending on the user's experience with programming languages and the equivalent complex systems of data retrieval and analysis, it is useful to create systems able to expose the data in a very intuitive way.Such approaches could be web-based applications directly interacting with a back-end client and giving the possibility to access, wrangle, and visualize the data on the fly by a potential data user.

Data Life Cycle
The concept of the data life cycle consists of a sequence of phases that data goes through, from its initial generation or capture through to its eventual archival and/or deletion at the end of its useful life.It provides a high-level overview about the management of data, subdivided into different stages according to specific tasks, with the aim of the successful organization and preservation of data for use and sharing.Multiple versions of a data life cycle exist [20][21][22], where the main differences are related to the variation in practices across fields of interests or communities.
The phases of the data life cycle we adopted are represented in Figure 1.Data are collected (acquisition phase) by ground stations running proprietary software tools.To fulfill the remaining phases (harmonization, organization, analysis, and sharing), we combined open-source solutions with Python and R scripts to simplify the entire procedure.

Implementation Design
In the early phase of the development of the methodology, we tried to combine the concept of usability together with the factors of Human-Computer Interaction (HCI) to create a system capable of supporting the research in a functional way.Usability refers to the quality of the interaction, in terms of parameters such as time taken to perform tasks, number of errors made, and the time to become a competent user [23].The word "usability" also refers to methods for improving ease of use during the design process [24].Among the different factors involved in HCI, we focused on the following [25]:

•
The User-motivation, enjoyment, satisfaction, personality, and experience level; • Task Factors-easy, complex, novel, task allocation, repetitive, monitoring, skills, and components; • Productivity factors-increase output, increase quality, decrease cost, decrease errors, decrease labor requirements, decrease production time, and increase creative and innovative ideas leading to new products.

Acquisition
The Institute for Alpine Environment, with the technical collaboration of the Center for Sensing Solution group, at Eurac Research in Bolzano, Italy, installed and maintained ground stations for collecting both images (RGB and Infrared) and 75 other physical parameters, such as Normalized Difference Vegetation Index (NDVI), Surface Temperature, or Photosynthetically Active Radiation.In particular, digital cameras taking repeated images (Digital Repeat Photography, DRP) of the landscape at high frequencies (several images per day) helped to overcome the limitations of field observations by individuals, such as continuity and logistics [26].The geographical locations of these ground stations are shown in Figure 2a,b.The main goal of creating such a network is the possibility of having long-term observations with automated systems at high temporal (and, in the case of DRP, also high spatial) resolutions.
The solution we introduce in this work uses the data collected by this network of ground stations.

Harmonization
Harmonization consists of three steps: Checking, validating, and cleaning the acquired data.It is not unusual for data to have discrepancies in consistency or format and, as a consequence, it is important to create a cohesive dataset that is ready to be organized.In Figure 3, the details of the evaluated quality dimensions are listed.In particular [27,28]: Data completeness refers to whether there are any gaps in the data, between what was expected to be collected and what was actually collected.

•
Data consistency refers to whether the types of data align with the expected versions of the data that should be coming in.

•
Data uniqueness ensures that each record should be unique.

•
Validity of data is determined by whether the data measures that which it was intended to measure.The tests for these different dimensions are performed using scripts, mainly R-codes, written by the researchers of the Institute for Alpine Environment.

Organization
Collected data are stored in a relational database management system (RDBMS) using the database model implementation (https://wiki.52north.org/SensorWeb/SensorObservationServiceDatabaseModel) of the 52 • North SOS project (version 4.1), based on the OGC standard PostgreSQL [29].This database has all the necessary tables to provide the SOS service, without the need to develop a new one unless specific project customizations are required.
The 52 • North SOS provides a RESTful API for inserting data in the database using the POST method and the operations defined in the SOS standard:

•
InsertSensor: Register sensors in the DataBase (DB) following the SensorML standard.

•
InsertObservation: Upload measurements into the DB following the O&M standard.
The InsertSensor operation is done once and contains the following information: • PhysicalSystem: Identifier for the physical instrument in one of the platforms (stations).The InsertObservation operation allows the client to insert new observations for a sensor system formerly registered with the SOS.By means of this last operation, the records are inserted in the database.
Considering the big number of records to be collected, it is necessary to have a solution able to speed up not only the InsertSensor, but also the InsertObservation operations.

Methodology
The availability of an automation process gives the chance to upload new observations into the database as soon as they are collected.The innovation of the methodology is to facilitate the insertion of the observations into the database with a semi-automatic procedure.In fact, only a minimal interaction of the user and/or data provider is required, see Figure 4. Furthermore, the possibility to have a methodology based on OGC standards can improve the comparability of data collected by any other institutions, offering the possibility of their reuse and, thus, avoiding duplication.The first step to be performed by the user is the standardization of the input file.The user has to define a univocal structure for the input file, as every station has its own data-logger using a proprietary standard.
The structure we propose is the following, see Figure 5:  A fixed structure for the input data allows us to know, in advance, how to correctly retrieve all the data to be inserted into the database.A well-established naming convention is necessary for correct use of a predefined folder structure.In fact, the code uses the naming convention, extracting the FOI string, to automatically create paths to locate and access files for the entire process.This will avoid modifying the script for each new data insertion.
The second step is the creation of the configuration file.
The configuration file is an XML file, which has to be filled by the user because it contains all the information for describing the sensor metadata and observed properties list; see Figure 6.The idea of creating a configuration file is essentially twofold: Firstly, it will contain all the information useful for automatically creating both the Insert Sensor and Insert Observation XML files, by means of a XSL Transformation (XSLT); secondly, it can be seen as a raw version of the metadata for the considered FOI.The XSLT is strictly connected to the configuration file.In fact, all the necessary information to be inserted into the Insert Sensor and Insert Observation XML files will be retrieved from it.As a consequence, it is not possible to use a standard SensorML document because it will not be useful for the automation of the aforementioned SOS operations.Furthermore, the definition of a custom XML document offers the possibility to include information not foreseen by the standard model.
After the creation of the configuration files, the procedure can execute the following steps: 1. Run the environment_generator.shscript.This code creates the data folder structure and all the auxiliary files and folders for the insertion process.

2.
Save the the .csvinput data files in the Data folder.

3.
Run the SOS_Insert_Sensor.pyscript once to register the sensors in the SOS service.4.
Run the SOS_Insert_Observation.py script to upload the observation into the SOS service, every time new data are available.
The implemented Python script uses the Insert Observation XML template file to read a sensor-specific .csvfile.Afterwards, it extracts the information to store in databases (time and real values registered) and posts them to the database server.During the process of data insertion, the correct folder structure is highly relevant.The Data folder is organized in the following tree structure: Station category folder → sensors folders → folder containing the .csvdata file of observation.The folder structure, created with the environment_generator.shscript, already foresees this organization to guide the user in the preliminary process of input data-file organization.Two versions of the SOS_Insert_Observation.py script have been implemented: 1.
Serial approach: The process reads every folder and .csvfile of the sensor consequently, one after the other; and 2.
Parallel approach: The process analyzes multiple folders simultaneously and inserts multiple observations at the same time, assigning to each of the available computer's cores the first available station category folder.
The different performances of these two versions were compared, in a benchmark test, to test the efficiency and the speed in uploading data.We used 14 input data files, describing one entire year of acquisition of the corresponding 14 station categories.In particular, we tested both versions with respect to an increasing number of input files, checking the time required to complete the process.As shown in Figure 7, it transpires that the choice between the two versions is almost time-independent, when considering less than four station categories.When the number of station categories starts to increase, the parallel approach can be about 80 min faster than the serial.
As a consequence, we suggest using the serial version when uploading data belonging to a station category number less than ten and if it is necessary to debug the code.If the number of the sensors is greater than 10, the parallel approach will save time.

Metadata
Metadata is structured information that describes data.With well-defined metadata, users should be able to get basic information about data without the need to have knowledge about its entire content.
The availability of metadata in the sensor web can be very useful for discovering the right sensor, together with its spatial and temporal information.When the sensor is described in a standard way, a more expressive representation, advanced access, and formal analysis of its resources can be granted.
Sensor metadata provides descriptions of the properties of the sensors and the related infrastructure; it includes a description of the sensor, platform, time of acquisition, location, and other relevant information to characterize the measurement.After inserting the sensor into the database, the DescribeSensor operation allows the client to retrieve metadata about the sensors, encapsulated by a SOS instance.It is a mandatory element of the SOS specification.The request of the DescribeSensor operation is very simple, as the only parameter consists of the identifier of the sensor for which a metadata document is requested.The response is a SensorML document [30].In addition to this, we created metadata in our GeoNetwork (http://sdi.eurac.edu)catalogue, which implements the Catalogue Service for the Web (CSW) standards.An extension of the SensorML document can be the creation of a metadata file that can be ingested by a Catalogue Service for the Web (CSW), such as GeoNetwork.As Geonetwork is not able to harvest metadata from an SOS service, we use an XSL transformation, starting from the configuration file and a XSLT-style sheet defining the transformation rules.The use of an XSLT is fundamental.XSL stands for Extensible Stylesheet Language, and it is a stylesheet language for XML documents [31].The main advantage of using an XSLT consists in having the possibility of adapting all the information dealing with the sensor in the corresponding standard to be used or directly customized to the requirements of a specific project; see Figure 8.In this way, a unique configuration file allows us to have both the XML file for the InsertSensor operation in the SOS service, and an XML file for a CSW catalogue, compliant with the OGC standards as well as INSPIRE and ISO-19139.

Analysis
The RESTful API, provided by the SOS service, allows access to a complete dataset by means of various clients.We use R because it easily allows the handling of large data tables and the creation of a visualization client.We created the R package MonalisR (https://gitlab.inf.unibz.it/earth_observation_public/MonalisR), which is able to communicate with the SOS API to download data in JSON format, giving a complete overview of the database content; see Figure 9.
It consists of four main functions: Setup, Exploration, Download, and Visualization.To start using this package, it is necessary to set the URL path to interface-API, in order to explore the database.Once connected, it is possible to extract data, filtering by time span, FOI, property, or procedure-the MonalisR package creates the URL path to query the SOS service.The visualization functions allow us to plot the spatial location of the measurement stations or to visualize the downloaded time-series by simple lines or box plots.In addition, basic statistics are computed and NaN values are automatically excluded, if present.Users can also directly interact with the RESTful API by creating individual queries and adding their own R scripts for other data analysis.This package can also query any other APIs which provide data in JSON format.

Share
As a further development, we have created a basic web application which has all the functionalities of the MonalisR package, as listed above.We used Shiny, an R package, to develop this interactive web application.The main advantage of using Shiny is the possibility of creating a customizable Graphical User Interface (GUI), through which the user can perform queries using the available drop-down menus to select the input criteria, as well as an interactive map to geographically locate the ground stations.The web application communicates with the SOS service in real time and allows interactive visualizations of the extracted dataset by selecting different upper tabs, as shown in Figure 10.As this web application is installed on a server, all of the users have access to it by means of a web browser, facilitating popularization of the data.At its most basic level, the web application provides the means to make data available to the researcher in a format that can be more easily handled.While repositories physically hold datasets, a web portal offers a faster way to search and visualize data by creating ad-hoc queries using GUI, without writing codes.

Study Area
The Long-Term Ecosystem Research (LTER) Europe (http://www.lter-europe.net/) is an initiative which aims to study how ecosystems react, in the long term, when influenced by different factors (belonging to different fields) at different scales (local, regional, continental, and global).Different countries are involved in this project, providing research analysis dealing with their typical environmental typologies.In the case of Italy, freshwater and marine ecosystems have been analyzed in the LTER italian network (http://www.lteritalia.it/).
The Institute for Alpine Environment, with the technical collaboration of the Institute for Earth Observation at Eurac Research in Bolzano, Italy is involved in this network, within the framework of the initiative LT(S)ER Matsch-Mazia ("S" stands for Social) (http://www.lteritalia.it/macrositi/it25).At present, this network has collected about 80 million observations from 30 ground stations, corresponding to 500 sensors spread over the Val Mazia-Matsch area in South Tyrol, as shown in Figure 2b.This valley is characterized by a low anthropic impact, relative dryness, and its hydrographic basin does not extend very far.In general, Val Venosta is the reference arid valley, but it is too large.The position of the ground stations reflects the need to cover a range of altitudes (between 1000-2700 m) where typical mountain agriculture is developed and the collected data will drive the studies to correlate how the environmental conditions affect functional trait variability within communities and, thus, shape ecosystem properties ( [32,33]).

Sensors
The LTER Stations in Val Mazia-Matsch are individually equipped with diverse sensors, detecting various key environmental parameters.Environmental parameters measured and transferred automatically by the LTER stations are the surface and soil temperatures, wind fluctuations and speed, relative humidity, soil water content, and potential of precipitation measurements.Furthermore, sensors for the analysis of the spectral characteristics near the stations have been installed.These detect variables such as the total solar radiation and the photo-synthetically active radiance (PAR), or vegetation indexes such as the Normalized Difference Vegetation Index (NDVI) and the Photochemical Reflectance Index (PRI).In Table 1 list of the sensors used in the LT(S)ER initiative is shown.For the sake of brevity, this list is not fully representative.For complete information, please refer to the LT(S)ER website (http://www.lteritalia.it/macrositi/it25).

Results
In this case study, SOS represents the common platform for sharing the collected data using SWE standards.As described before, it is fundamental to create the necessary configuration files for describing the entire network of ground stations.For this project, 30 configuration files are necessary.Once they are ready, the process to insert the corresponding sensors into the SOS can start and the remaining steps are completely automatic.The system is able to verify if new input files arrive in the dedicated data folder.If new data are available, the corresponding Insert Observation file is created and automatically inserted in the database.Considering the frequency of ground measurements, we set this automatic and customizable routine to start every fifteen minutes, allowing a near real-time database update, as well as analysis-ready data for researchers.
Due to the data policy of the LT(S)ER initiative, we developed two different web applications: Public and internal.Both are based on the same SOS, 52 • North, and regularly updated with the incoming observations.The public one contains only a restricted subset of the data that can be visualized on the dedicated LT(S)ER website (http://sdi.eurac.edu/lter_sos/static/client/helgoland/index.html);see Figure 11.Here, the user can have a first look at the data and compare the different trends in the latest collected data.If further data are needed, please refer to the contact page in the website.The internal one contains the full dataset, and is accessible only by authorized researchers involved in the project.
The creation of such an infrastructure can be helpful for research purposes.In fact, once the database is filled with the observations it is possible to query for specific values, such as NDVI, and to extract time series belonging to different sensors which are ready to be analyzed [34].

Conclusions
This describes our solution to automatically upload data, in near real-time, in any standard SOS implementation.In particular, we jointly used Python scripts, R codes, and the SOS service 52 • North, which implements the SWE OGC standards.The advantage of this solution consists in having a lightweight system requiring only minor actions from the user.These interactions comprise of the creation of a configuration file describing all the necessary information about the sensors, and to format the input data files following a few requirements.Consequently, once the sensors have been registered in the SOS service, the system is ready to automatically check for new data and to insert them using Python scripts.
Another important result is the possibility of simplifying the sensor metadata creation, compliant with the OGC standards (as well as the INSPIRE and ISO-19139 standards).The development of an R client to dynamically interact with the database, as well as a Shiny web interface, represents another further improvement important for data visualization and analysis.
This solution has been applied in the framework of the LT(S)ER Matsch-Mazia initiative, producing significant time savings in managing about 80 million observations.This, in turn, facilitates the sharing of collected time-series data.Some of the most significant advantages that we have achieved are:

•
High performance, thanks to the availability of a parallelized process which reduces the time for inserting new measurements in a SOS implementation when the number of station categories is greater than ten.

•
An OGC standard-compliant methodology, allowing the reuse of data outside the research context and, thus, avoiding duplication.

•
A flexible methodology, which can be applied to upload data into any SOS implementation.

•
Open source and custom software: Our python scripts can be improved, with respect to a more dynamic way to handle the parallelization processes.

•
Collected data are available in near real-time and accessible through a web application, offering a faster way to search through and visualize data by creating ad-hoc queries using GUI without writing codes.

•
Source code available in a dedicated GIT repository.
Future work will focus on the optimization of the creation of the configuration file by developing a GUI.We plan to develop a driver to connect the OpenEO API (http://openeo.org/)to the SOS API, in order to access different services from different clients (such as R, Python, and Javascript) for big earth observation cloud back-ends, in a simple and unified way.

Figure 1 .
Figure1.The adopted data life cycle for implementing our methodology.In the double-framed box, our contribution is summarized.
(a) Ground station network used for data collection.(b) Ground station network overview in Val Mazia, Matsch area.

Figure 2 .
South Tyrol network of ground stations.

Figure 3 .
Figure 3. Flowchart describing the data quality-check performed to harmonize the collected data.

•
Data accuracy refers to whether the collected data is correct, and accurately represents what it should.

Figure 4 .
Figure 4. Schema for the insertion of the sensors and observations.From the data provider/user point of view, only two actions are requested (in orange): Create a configuration file and standardize the input data file.
header describing the data column; • an established date and time format, (i.e., YYYY-MM-DDThh:mm); • the time-stamp is always stored in the first column; and • a predetermined naming convention (i.e., XXX_FOI_XXX.csv).

Figure 5 .
Figure 5. Standardization of the input file: Example of univocal structure for the input data file EuracID00216A_M2paf2200_201701010000201712312345.csv.

Figure 6 .
Figure 6.Creation of the configuration file: The XML fragment shows an example of some tags to be filled with the required parameters.

Figure 7 .
Figure 7. Performance analysis of the serial and parallel versions of the SOS_Insert_Observation.py script.

Figure 8 .
Figure 8. Metadata creation process, by means of the configuration file and an XSL transformation.

Figure 9 .
Figure 9. Schematic representation of the workflow to visualize SOS data using R.

Figure 10 .
Figure 10.Shiny web application to access data using a Graphical User Interface (GUI).

Figure 11 .
Figure 11.Public LT(S)ER web application, view of the Map section.
• Offering: A group of parameters measured by one sensor.It is worth noting that one procedure (PhysicalSystem) can have multiple offerings, but the same offering cannot be applied to multiple procedures.
• Observable property: The output parameters of the sensors.• Feature Of Interest (FOI): The coordinates and altitude of the station or platform where the sensor is installed.

Table 1 .
List of the main sensors used in the LT(S)ER initiative.