Assessment and Benchmarking of Spatially Enabled RDF Stores for the Next Generation of Spatial Data Infrastructure

: Geospatial information is indispensable for various real-world applications and is thus a prominent part of today’s data science landscape. Geospatial data is primarily maintained and disseminated through spatial data infrastructures (SDIs). However, current SDIs are facing challenges in terms of data integration and semantic heterogeneity because of their partially siloed data organization. In this context, linked data provides a promising means to unravel these challenges, and it is seen as one of the key factors moving SDIs toward the next generation. In this study, we investigate the technical environment of the support for geospatial linked data by assessing and benchmarking some popular and well-known spatially enabled RDF stores (RDF4J, GeoSPARQL-Jena, Virtuoso, Stardog, and GraphDB), with a focus on GeoSPARQL compliance and query performance. The tests were performed in two di ﬀ erent scenarios. In the ﬁrst scenario, geospatial data forms a part of a large-scale data infrastructure and is integrated with other types of data. In this scenario, we used ICOS Carbon Portal’s metadata—a real-world Earth Science linked data infrastructure. In the second scenario, we benchmarked the RDF stores in a dedicated SDI environment that contains purely geospatial data, and we used geospatial datasets with both crowd-sourced and authoritative data (the same test data used in a previous benchmark study, the Geographica benchmark). The assessment and benchmarking results demonstrate that the GeoSPARQL compliance of the RDF stores has encouragingly advanced in the last several years. The query performances are generally acceptable, and spatial indexing is imperative when handling a large number of geospatial objects. Nevertheless, query correctness remains a challenge for cross-database interoperability. In conclusion, the results indicate that the spatial capacity of the RDF stores has become increasingly mature, which could beneﬁt the development of future SDIs.


Introduction
Geospatial information is indispensable for spatially informed decision-making and analyses and is thereby a prominent part of today's data science landscape.Significant progress in geospatial data availability and sharing has been achieved as a result of the development of spatial data infrastructures (SDIs) that aim to make geospatial data available for the benefit of the economy and the society [1].In Europe, the INSPIRE directive-a legal framework and standardization body for SDI development-sets the data specifications, and it mandates its member states to provide data mainly using Open Geospatial Consortium (OGC) web services [2].
Despite the significant progress, SDIs still face a number of limitations, especially in terms of discovery, reuse, and integration of the data.SDIs have partially achieved dissolving environmental ISPRS Int.J. Geo-Inf.2019, 8, 310 2 of 19 and geospatial data held in silos, but the data is still largely isolated from other information domains [3].For example, the OGC web features service (WFS) can make geospatial data available through its data query protocol, yet such data cannot be discovered by search engines or, more importantly, linked by other data resources.This makes the data lying in the so-called deep web [4].
Today's geospatial data is available and used not only in dedicated SDIs but also in various general data infrastructures/projects that are not dedicated to geospatial data.One open data example is the general-purpose knowledge graph DBpedia (https://wiki.dbpedia.org/),which has a large number of geospatial objects.In other words, geospatial data has become a part of today's big data landscape; thus, siloed data management and delivery should be revisited [5].This is also in line with the development and vision of open SDIs, which highlight the integration and harmonization with other data [6].
Another significant issue in SDIs is semantic heterogeneity, which is an impediment to integrating multi-source geospatial data and fusing geospatial data with other types of data, as the semantics of metadata, schemas, and data content are not usually harmonized for multi-source geospatial data or with other types of data [7].
Semantic Web technologies, particularly the parts relevant to linked data, provide a promising way to resolve the aforementioned limitations.Linked data is built around a set of data publishing best practices and facilitates data access, interlinking, and integration on the web.A recent survey conducted in 2018 by EuroSDR demonstrated that linked data is seen as one of the most important research issues and key factors moving SDIs toward the next generation [8].Linked data was also voted one of the most important SDI research topics during the AGILE 2018 workshop 'SDI research and strategies towards 2030' [9].An increasing amount of geospatial data has been delivered as linked data on the web and has become part of the linked open data (LOD) cloud (https://lod-cloud.net/).
Linked data is organized in the data model Resource Description Framework (RDF) [10], which is a generic graph-based data model that describes entities and relations.Linked data is also built upon formally defined ontologies, providing the means to define the concepts and relations in data, in order to make explicit any underlying assumptions regarding the data, and make it easier to understand and reuse the data.In practice, linked data needs to be managed, stored, and delivered by utilizing RDF stores (also known as triplestores), which are databases for storing and retrieving RDF data (linked data) through semantic queries (SPARQL queries [11]).The OGC extended SPARQL to develop the query language for geospatial linked data-GeoSPARQL, which comprises a lightweight vocabulary to represent and query geospatial data [12].The number of spatially enabled RDF stores (RDF stores that handle geospatial queries) is currently growing, and their compliance with GeoSPARQL has progressed.Therefore, there is a need to survey the status of spatially enabled RDF stores in terms of both geospatial query performance and GeoSPARQL compliance.
The aim of this study is to assess and benchmark several well-known and popular spatially enabled RDF stores for potential use in future SDIs and the geospatial linked data community at large (see supplementary files).In this context, we performed benchmarking in two different scenarios in future SDIs.The first scenario is one in which geospatial data plays an important role in and constitutes a part of a large data infrastructure; here, the focus is on the integration of geospatial data with other data.Two issues must be resolved here: the ontology of the geospatial components of the data should conform to the GeoSPARQL standard, and the RDF stores should be able to efficiently perform geospatial queries on a large volume of data that is a mixture of geospatial and other data.To evaluate the first scenario, we used data from the Integrated Carbon Observation System (ICOS) carbon portal (ICOS CP) [13]-a large-scale Earth Science scientific data infrastructure.The second scenario illustrates a dedicated SDI with purely spatial data; for this case, we used test datasets from Geographica, a previous geospatial benchmark for RDF stores [14].These datasets include crowd-sourced (e.g., GeoNames, DBpedia, and LinkedGeoData) and authoritative geospatial data.
Following this introduction, the background and related work are presented in Section 2. The data used in this study is illustrated in Section 3, including the ICOS CP's ontology design.Section 4 describes the assessment and benchmarking methodology, and the results are presented in Section 5 (for qualitative evaluation) and Section 6 (for quantitative evaluation).The paper ends with a discussion (Section 7) and conclusions (Section 8).

Geospatial Semantic Web and Linked Data
The Semantic Web is a common framework that allows data to be shared and reused across application, enterprise, and community boundaries [15].In order to make the Semantic Web a reality, it is important to make a huge amount of data on the web available with recommended best practices for exposing, sharing, and connecting pieces of data, information, and knowledge.These best practices, as well as the delivered data, are also referred to as linked data.At the core of the linked data principles are the ideas of globally unique identifiers, i.e., Uniform Resource Identifiers (URIs) for data elements and a universal graph data model Resource Description Framework (RDF).By reusing the addressing system used for web pages, one can uniquely identify and link to data elements and datasets anywhere on the web [16].The appreciation of Semantic Web technologies and linked data has increased considerably in the geospatial domain in the last decade, and they have fostered a promising approach to connecting SDIs with mainstream IT to augment the application of geospatial data [3].Semantic Web technologies, especially linked data, provide a promising means to address some long-standing challenges in the geospatial domain, e.g., data integration (e.g., [3]) and knowledge formalization (e.g., [17]).
Pilot studies have been performed releasing INSPIRE-compliant data as linked data, and draft guidelines and vocabularies have been developed [18].The development of INSPIRE linked data's URIs leveraged previous work on the standardization of unique identifiers for geospatial objects [19].In the meantime, an increasing amount of geospatial data has been delivered as linked data, mainly by governmental agencies and large-scale data infrastructures [20].The UK is a pioneer to this end; Ordnance Survey, Great Britain's national mapping agency (NMA), released several geospatial datasets as linked data nearly a decade ago [21].However, the data relied on unstandardized methods to represent data semantics and thus lacked usability.In the Netherlands, Kadaster delivered several key geospatial datasets, e.g., building data and address data, as linked data on the web, together with other governmental open data, e.g., statistical data [22].In Finland, the National Land Survey piloted the delivery of geographic name data, authoritative data, and building data as linked data [23].In Norway, Kartverket also released some geospatial datasets as linked data [24].A recent report summarized and reflected on the development of geospatial linked data in the Netherlands, Finland, Norway, and Spain.The fact that different projects use different RDF stores also renders the aim of this study necessary [25].In the US, several geospatial linked data projects have been conducted: a pilot of design and development of linked data from The National Map was performed [26]; the Geographic Names Information System was served as linked data, and its geospatial visualization was enabled [20]; the GeoLink knowledge graph was published following linked data principles and served through a SPARQL endpoint, including Earth Science information captured by oceanographic cruises, physical sample metadata, etc. [27].Along with these linked data, development endeavors from authorities, crowd-sourcing projects have also produced several geospatial linked datasets, and some of them are serving as central hubs of the LOD cloud, e.g., GeoNames (https://www.geonames.org/)and LinkedGeoData (a linked data distribution of OpenStreetMap [28]).Moreover, van den Brink et al. [29] proposed the best practice of delivering geospatial linked data, and they bridged the OGC web services and the Semantic Web.In the Earth Science domain, there have also been several discussions about how to utilize linked data for data integration and discovery (e.g., [30]).
Semantic Web technologies and linked data have also been utilized in a number of studies in the geospatial domain.The studies on this subject span several research areas, e.g., geoprocessing, information retrieval, and visualization.For example, Hofer et al. [31] developed a knowledge base to support the composition of geoprocessing workflows with ontologies and Semantic Web rule language (SWRL).Keßler et al. [32] leveraged linked data, ontologies, and SWRL rules for geospatial information retrieval with context awareness.Wiemann and Bernard [33] used linked data for data integration in the environment of SDIs.Huang et al. [34] leveraged linked data and ontologies to realize the relative positioning of geospatial data, thus enabling geometrically self-adapting web maps.Huang and Harrie [17] used linked data, ontologies, and semantic rules to realize knowledge-based visualization of geospatial data, thereby formalizing some visualization knowledge on the aspects of cartographic scale, data portrayal, and geometry source.To realize the potentials revealed by the above studies (e.g., the use of ontological reasoning, rule-based reasoning, and spatial operations), we need RDF stores with capabilities such as semantic query, semantic reasoning, and geospatial query.Therefore, we used these capabilities in this study as part of the RDF store selection criteria (cf.Section 4.1).

Assessment and Benchmarking of Spatially Enabled RDF Stores
As the Semantic Web evolved into the mainstream of the web and has been adopted in many scientific domains (e.g., life sciences, geosciences), assessments and benchmarks of RDF stores have been abundant, mainly on synthetic and artificial test datasets.Popular benchmarks include, in chronological order, the Lehigh University Benchmark (LUBM) [35], the SPARQL performance benchmark (SP 2 Bench) [36], and the Berlin SPARQL Benchmark (BSBM) [37].The DBpedia SPARQL benchmark (DBSB) [38] is a popular benchmark used for real-world linked data and queries (the queries are extracted from actual server logs).However, these benchmarks are mainly for common-use data and data from other domains, not geospatial data and queries.In addition, benchmarks based on synthetic data have been criticized because they have very little in common with the needs of real application domains [39].
For the assessment of spatially enabled RDF stores, in which an even higher level of complexity arises [40,41], Kolas [42] proposed and performed a benchmark for the geospatial query capacity of RDF stores; however, since it was proposed before the standardization of GeoSPARQL, not much from that work can be applied to today's developments.Battle and Kolas [43] demonstrated the geospatial capacity of Parliament and successfully ran a number of GeoSPARQL-compliant queries.Garbis et al. [14] presented the benchmark Geographica to assess several spatially enabled RDF stores in which spatial queries were written in both GeoSPARQL and stSPARQL (the spatiotemporal query language in the RDF store Strabon).In that benchmark, three RDF stores were evaluated, i.e., Strabon, uSeekM, and Parliament, in a micro-benchmark and a macro-benchmark.The micro-benchmark aims to test the efficiency of primitive spatial functions in spatially enabled RDF stores; the macro-benchmark aims to test the performance of the stores in some certain application scenarios, e.g., reverse geocoding, map search, etc.This benchmark's datasets and queries have been published online (http://geographica. di.uoa.gr/), and the benchmark was based on both real-world geospatial data (e.g., LinkedGeoData) and synthetic data.The GeoKnow project, which dealt with geospatial Semantic Web and linked data, released a thorough survey and evaluation of spatially enabled RDF stores, with a partial focus on GeoSPARQL compliance [44].The stores evaluated in GeoKnow include Virtuoso, Parliament, OWLIM, uSeekM, and Strabon, as well as spatially enabled relational databases, i.e., Oracle Spatial and PostgreSQL with PostGIS extension.Bellini and Nesi [45] assessed several well-known RDF stores, including Virtuoso, GraphDB, Oracle, and Stardog, for semantically enabled smart city services.The geospatial capacity of these RDF stores was one of the focuses of this study, as smart city services also have the need for capabilities such as temporal data query.The benchmark was based on the Florence Smart City model; the used datasets and tools are available online.These benchmarks clearly demonstrated the sparse support for spatial operations in RDF stores, and the RDF stores supporting GeoSPARQL were very few.Specifically, many RDF stores, e.g., Virtuoso, used their own syntaxes for geospatial queries rather than GeoSPARQL, and most RDF stores supporting GeoSPARQL queries were developed in academic environments, e.g., Parliament.Furthermore, the query performance was generally unsatisfactory, which also undermined the usability of these very few spatially enabled RDF stores.
The abovementioned previous works provide useful grounds for this study to evaluate the geospatial query capacity of RDF stores for future SDIs and for the geospatial linked data community at large.However, these previous studies have some limitations.First, the results are now mostly outdated, as the status of the tested RDF stores have changed considerably: some of them have developed with more advanced support for geospatial queries and increased GeoSPARQL compliance, and some of them have become obsolete and are rarely used.Second, the assessments and benchmarks targeting geospatial query (i.e., Geographica and GeoKnow benchmarks) depended on either synthetic data or purely geospatial data (in which nearly all the data objects have geometric information and are involved in spatial indexing/search).Our first test scenario, which uses data from ICOS CP, is, however, an Earth Science data infrastructure with a portion of geospatial data, which is more in line with the current role of geospatial data in large data infrastructures (open SDI).In addition, we provide a reproducible benchmark with deliverables that others can use to assess the RDF stores on their own datasets.Additionally, one shortcoming of previous spatially enabled RDF stores' benchmarking works is that they fully focused on evaluating the query performance (response time), but they did not assess the correctness of the returned results.In this paper, we assess query correctness in the first scenario.

ICOS Carbon Portal Metadata
In the first scenario, we used data from ICOS CP (see supplementary files).ICOS is a Pan-European research infrastructure that currently has 12 member countries and a legal status of European Research Infrastructure Consortium (ERIC) (https://ec.europa.eu/info/research-andinnovation/strategy/european-research-infrastructures/eric_en).It is a European measurement system for high-quality and precision greenhouse gas observations and environmental monitoring.Currently, there are 135 measurement stations (including co-located ones), with 33 atmosphere stations, 81 ecosystem stations, and 21 ocean stations (Figure 1 shows the geographic locations of the stations).
ICOS CP is the data portal that provides free and open access to all ICOS datasets.ICOS data products include quality-controlled observational data, elaborated (model) products, and synthesis reports, which is material for policymakers.The users of ICOS CP span various domains, e.g., (Earth Science) researchers, education users, policymakers, and stakeholders in the negotiation of carbon reduction policies.ICOS produces around 25-30 TB of sensor data per year, together with about 1 GB of processed data products and 5-20 TB of elaborated data products.Additionally, as ICOS CP has become a well-recognized data sharing and distribution platform, some other data initiatives and producers, e.g., SOCAT (https://www.socat.info/),have also contributed by publishing their data through ICOS CP.The observation data at ICOS CP is linked to georeferenced locations.The atmospheric and ecosystem observations are connected to the coordinates of the measurement stations.For the ocean data, ship trajectories are stored as lists of XY coordinate pairs.The huge amount of data delivered and the complex organizational structure and responsibility raise the importance of data cataloging and discovery.
ICOS CP is an active practitioner of the FAIR principles, which aim to make data Findable, Accessible, Interoperable, and Reusable [46,47].In this context, ICOS CP has adopted linked data for delivering and publishing all its metadata (including metadata for ICOS data and other data harvested by ICOS CP, e.g., SOCAT data) to make such data more discoverable.The metadata is available through, among others, a SPARQL endpoint (https://meta.icos-cp.eu/sparqlclient).Geospatial data forms a part of the ICOS CP metadata.As the size of ICOS CP metadata is constantly growing because observational data is continually ingested, query performance will become a notable issue.To accelerate the spatial search of ocean data, each trajectory is simplified into a line string or a polygon (concave hull of the trajectory) containing a maximum of 20 coordinate pairs (by an in-house developed streaming algorithm that extends the algorithm from [48]).These simplified geometries are stored in the ICOS CP metadata.

ICOS Carbon Portal Metadata
In the first scenario, we used data from ICOS CP (see supplementary files).ICOS is a Pan-European research infrastructure that currently has 12 member countries and a legal status of European Research Infrastructure Consortium (ERIC) (https://ec.europa.eu/info/research-andinnovation/strategy/european-research-infrastructures/eric_en).It is a European measurement system for high-quality and precision greenhouse gas observations and environmental monitoring.Currently, there are 135 measurement stations (including co-located ones), with 33 atmosphere stations, 81 ecosystem stations, and 21 ocean stations (Figure 1 shows the geographic locations of the stations).The linked data implementation is built upon a set of ontologies for different scopes of the data portal responsibility.Among them, the most important ontology is the ICOS CP metadata ontology (with the prefix cpmeta (https://meta.icos-cp.eu/ontologies/cpmeta/)).The ICOS CP metadata ontology relies on and has strong interoperability with some W3C standard ontologies, e.g., W3C PROV ontology [49] and W3C organization ontology [50].For the details of ICOS CP ontologies, please refer to its GitHub repository (https://github.com/ICOS-Carbon-Portal/meta/tree/master/src/main/resources/owl)or the online description (http://static.icos-cp.eu/share/slides/dataServiceWorkshop/#/).
In the ICOS CP metadata ontology, the instances of the class DataObject can be associated with the instances of the class SpatialCoverage, and the instances of SpatialCoverage can be associated with the serialization of the corresponding geometries (Figure 2 demonstrates a part of the ICOS metadata ontology that is relevant to spatial information.).Currently, the ICOS metadata ontology is not GeoSPARQL-compliant (the GeoSPARQL classes are not introduced into ICOS metadata ontology, and the geometries are serialized in GeoJSON, which is not supported by GeoSPARQL).To support geospatial (GeoSPARQL) queries, we redesigned the ontology to accomplish GeoSPARQL compliance, as illustrated in Figure 2 (we use geo for the prefix of GeoSPARQL).That is, we built an inheritance relation in which SpatialCoverage is a subclass of geo:Geometry, and the instances can thereby be associated with the geometries in Well-Known Text (WKT) to enable GeoSPARQL-compliant geospatial queries.Afterward, we transformed all the geometries from GeoJSON to WKT using several SPARQL CONSTRUCT queries (the queries are available online at https://github.com/RightBank/Benchmarkingspatially-enabled-RDF-stores/tree/master/TransformationSPARQLQueries.querying on relevant geospatial data from mass data, including relevant and irrelevant data, is costlier for query planners in the RDF stores than merely operating without query-irrelevant data.The most important geospatial query requirement for ICOS CP is to enable users to directly spatially select different types of data objects (e.g., measurement trajectories) in user-defined geometric ranges, which could be a simple rectangle or an arbitrary complex polygon that is drawn by the users.In this context, the topological relations within, intersects, and overlaps are useful, but we also would like to support other geospatial functions available in GeoSPARQL, such as buffer, disjoint, and crosses, for specific user needs and requirements.Therefore, we tested the available spatial functions in some RDF stores that are not restricted to the functions for spatial selections (cf.Section 4.2).

Geographica Benchmarking Datasets
For the second scenario, in which the benchmarking is performed on a large amount of purely geospatial data, we used real-world datasets from the Geographica benchmark.Six real-world geospatial datasets in RDF were used: DBpedia, GeoNames, road networks and rivers from Greece, the Greek Administrative Geography dataset, the CORINE Land Use/Land Cover dataset, and wildfire hotspots from the National Observatory of Athens.The geographic coverages of the six datasets are in Greece.The six datasets contain more than 30,000 points, 12,000 polylines, and 104,000 polygons.Details of the datasets are provided in [14] and its online repository (http://geographica.di.uoa.gr/).

Evaluation Methodology
The evaluation of spatially enabled RDF stores was carried out in two stages.In the first stage, we selected the RDF stores using a set of criteria and deeply analyzed the geospatial features provided by the selected stores (e.g., GeoSPARQL compliance, licensing, spatial indexing, etc.).The successive second stage applied a benchmark to the RDF stores in the above-discussed two scenarios.It is based on a set of SPARQL queries that are capable of testing the geospatial query performance of the stores.

RDF Store Selection and Analysis
The selection of the tested RDF stores is based on the needs both of large-scale information infrastructures (ICOS CP in this case) and dedicated SDIs.First, general selection criteria were applied:

•
The RDF store should be popular, well-known, and actively supported by a community or backed by a commercial vendor.

•
The RDF store should support W3C standards, e.g., SPARQL 1.1.The test data for RDF store assessment and benchmarking is the entire set of metadata of ICOS CP, which has 2,194,299 RDF statements as of 18 March 2019.The dataset has been published online [51].Among the data, there are 1068 spatial objects (88 polygons, 853 polylines, and 127 points).We believe that this situation mirrors the current development of geospatial data that it forms a part of a large-scale information infrastructure.Therefore, the results of this study can also be used as a reference for other linked data implementations with similar situations.Technically, extracting and querying on relevant geospatial data from mass data, including relevant and irrelevant data, is costlier for query planners in the RDF stores than merely operating without query-irrelevant data.
The most important geospatial query requirement for ICOS CP is to enable users to directly spatially select different types of data objects (e.g., measurement trajectories) in user-defined geometric ranges, which could be a simple rectangle or an arbitrary complex polygon that is drawn by the users.In this context, the topological relations within, intersects, and overlaps are useful, but we also would like to support other geospatial functions available in GeoSPARQL, such as buffer, disjoint, and crosses, for specific user needs and requirements.Therefore, we tested the available spatial functions in some RDF stores that are not restricted to the functions for spatial selections (cf.Section 4.2).

Geographica Benchmarking Datasets
For the second scenario, in which the benchmarking is performed on a large amount of purely geospatial data, we used real-world datasets from the Geographica benchmark.Six real-world geospatial datasets in RDF were used: DBpedia, GeoNames, road networks and rivers from Greece, the Greek Administrative Geography dataset, the CORINE Land Use/Land Cover dataset, and wildfire hotspots from the National Observatory of Athens.The geographic coverages of the six datasets are in Greece.The six datasets contain more than 30,000 points, 12,000 polylines, and 104,000 polygons.Details of the datasets are provided in [14] and its online repository (http://geographica.di.uoa.gr/).

Evaluation Methodology
The evaluation of spatially enabled RDF stores was carried out in two stages.In the first stage, we selected the RDF stores using a set of criteria and deeply analyzed the geospatial features provided by the selected stores (e.g., GeoSPARQL compliance, licensing, spatial indexing, etc.).The successive second stage applied a benchmark to the RDF stores in the above-discussed two scenarios.It is based on a set of SPARQL queries that are capable of testing the geospatial query performance of the stores.

RDF Store Selection and Analysis
The selection of the tested RDF stores is based on the needs both of large-scale information infrastructures (ICOS CP in this case) and dedicated SDIs.First, general selection criteria were applied:

•
The RDF store should be popular, well-known, and actively supported by a community or backed by a commercial vendor.

•
The RDF store should support semantic reasoning, which can be either triple materialization at load time or at query time (query rewriting), and the widely used reasoning types should be supported (e.g., RDFS, OWL, OWL2, OWL2-DL, etc.).Additionally, rule-based reasoning should be supported.

•
The RDF store should have geospatial query capacity, preferably with GeoSPARQL support and compliance.
On the basis of these criteria, a pre-selection was made.The final selection was then based on a qualitative analysis of the pre-selected RDF stores by reading the documentation (we contacted the vendor for Stardog, as we could not find information about its spatial index technique in its documentation).The key aspects of this analysis include the following: The popularity of the RDF stores is partially consulted from DB-Engines ranking (https://dbengines.com/en/ranking/rdf+store).
Through the qualitative analysis, not only can we choose the evaluated RDF stores in our work, but we can also obtain an up-to-date view of the popular RDF stores, especially to gain insight concerning the recent development of spatially enabled RDF stores and their GeoSPARQL compliance.

Performance Benchmark of Geospatial Query in RDF Stores
In this study, we reused and tailored the micro-benchmark from the Geographica benchmark [14] to evaluate the RDF stores.The micro-benchmark from Geographica aims to test the efficiency of primitive spatial functions in spatially enabled RDF stores.Simple SPARQL queries that consist of one or two triple patterns and a spatial function were used as benchmark queries.This benchmark includes non-topological geometric construction, simple spatial selections, and more complex operations (e.g., spatial join).In the first scenario, we tailored the benchmark queries for ICOS CP metadata; a brief description of the tailored queries can be found in Table 1.For the second scenario, we adopted the original query set from Geographica [14].In addition, in both scenarios, Q6 (area calculation), Q28 (extension constructing), and Q29 (union constructing) were removed because these functions are not supported by GeoSPARQL and seldom supported by RDF stores.Q14 (spatial within function to real-time constructed buffers) was also removed, as this query is semantically equivalent to Q15 but more computationally expensive than Q15 [14], and this type of nested spatial function is not always supported by RDF stores.
In our benchmark, we first warmed up the RDF stores with warm-up SPARQL queries in order to get the benchmark systems under normal working conditions, as the query performance in a cold state is often unstable and unpredictably low in the beginning because of factors such as the initial interpretation and compilation of codes.The warm-up queries are disjoint from the actual benchmark queries (cf.Table 1), and they are taken from the pre-defined queries at ICOS CP's SPARQL endpoint.

Implementation-Reusable Benchmark Deliverables
The benchmarking of the RDF stores was implemented in Java.We encapsulated the SPARQL queries and the codes interoperating with the underlying RDF stores in executable Jar (Java archive) packages that can be directly run with Java Runtime Environment (JRE).The delivered Jar packages request the location of data source, warm-up query iteration times, and benchmark query iteration times.The deliverable programs and source codes (including the benchmark queries) are available online at https://github.com/RightBank/Benchmarking-spatially-enabled-RDF-stores.
After benchmarking, text files were generated with comprehensive information regarding data loading time, the execution time of each query in each iteration, and the query results (including resulted object numbers and the resulted features-mainly their geometries).The query execution time refers to the time elapsed between the point a query is sent to the RDF store and the point the query results are completely returned to the benchmark systems.The benchmark systems use the RDF stores in an embedded mode whenever possible.

Results of RDF Store Selection and Analysis
Using the selection criteria for testing RDF stores for this work, we thoroughly investigated a number of RDF stores, and we ultimately selected the following RDF stores for evaluation.The rationale for not selecting the formerly assessed and benchmarked spatially enabled RDF stores Parliament, Strabon, and uSeekM is that they are currently not actively supported by the community, and some of them have limited capacity for reasoning, particularly rule-based reasoning.That is, we only evaluated fully fledged and popular RDF stores with spatial query support.
The qualitative analysis of the selected stores resulted in a cross-store qualitative comparison.Table 2 compiles the results of qualitative analysis with a focus on spatial query capacity and GeoSPARQL compliance.The storage solutions adopted by the RDF stores are mainly divisible into two types: native (designed from scratch) and RDBMS-based (based on an existing relational database management system).Four of the five tested stores utilize native solutions for storage; only Virtuoso relies on an underlying RDBMS.All tested RDF stores support spatial operations for geometries serialized in WKT; only GraphDB and GeoSPARQL-Jena support GML as well.RDF4J, GeoSPARQL-Jena, Virtuoso, and GraphDB currently provide full support for GeoSPARQL functions (the queries with spatial relations in the simple features relation family), including non-topological construct functions (Q1-Q5 in Table 1), spatial selection functions (Q7-Q17 in Table 1), and spatial join functions (Q18-Q27 in Table 1).Stardog only supports the functions that find the relations within, nearby, intersect, contains, disjoint, and equal, and it uses its own spatial query syntax.With regard to the spatial index technique, Lucene Spatial is commonly used because of its fast development and active support from the community.GeoSPARQL-Jena indexes and caches intermediate spatial query results to accelerate queries with similar graph patterns thereafter, and it supports dataset-custom spatial index constructing, which cannot be migrated to other datasets.Virtuoso uses R-tree for spatial indexing.In Virtuoso and Stardog, there is no way to switch off spatial queries with a spatial index, while the others support switching off spatial indexing.GeoSPARQL-Jena has been very recently developed, and it supports transformation between different spatial reference systems (SRSs), whereas the other stores only support WGS84.This usually entails SRS transformation before importing into the stores.6. Results of the Spatially Enabled RDF Store Benchmark

Experimental Setup
We ran the benchmark in a machine with the processor Intel Core i7-6700 (8M Cache, up to 4.00 GHz), 24 GB of RAM, and the operating system Ubuntu 18.04.1 LTS.
In the first scenario, the ICOS CP metadata was exported from its current RDF4J-based store into an RDF dump file with the 2.2 M triples.In the second scenario, the Geographica data was downloaded from its online repository as dump files.The benchmark programs first loaded the dump files into each store and recorded the loading time (including the spatial index construction time).
Each query in the benchmark (Table 1) was run three times after a number of warm-up queries were finished.In order to test the difference between using and not using a spatial index, we tested GraphDB in both modes (the queries Q1-Q5 and Q15 do not differ in either manner, as spatial indexing cannot be used in these queries in GraphDB).To determine the influence of the means of communication with the stores, we tested different communication interfaces with Virtuoso and Stardog.We tested Virtuoso's native interface Java Database Connectivity (JDBC) and RDF4J for operation and communication (as RDF4J is also commonly used as a library to manipulate other stores).We also tested Stardog's native interface SNARL and RDF4J for communication.We set a 1-h timeout for all queries.

Query Performance
Table 3 summarizes the loading time for the ICOS CP metadata of each store.All the stores import, and possibly construct, the spatial index for the 2.2 M triple dataset in a reasonable time.Notice that the loading time is for the entire ICOS CP dataset, which contains around 1000 spatial objects and many other object types.Table 4 summarizes the results for the average query execution time regarding RDF4J, GeoSPARQL-Jena, Virtuoso (connected through JDBC and RDF4J), Stardog (connected through SNARL and RDF4J), and GraphDB (with and without using a spatial index).For non-topological functions (Q1-Q5), GraphDB generally triumphs over the other stores.The performance of RDF4J is comparable to that of GraphDB.Compared with the other stores, GeoSPARQL-Jena and Virtuoso take much more time to calculate buffers of polylines and polygons, which might be the result of their more complex custom implementations.Stardog does not support any of the non-topological functions.For spatial selection queries (Q7-Q17), RDF4J provides generally good performance in terms of query response time.GraphDB also has comparable performance records, and it is much faster than the other stores for Q7 (equal polyline finding).Virtuoso has the best performance for Q13 (i.e., find all points in a given polygon, which is a very useful query for ICOS CP and many other linked data-based projects).Stardog has a reasonable performance but is much slower for Q7 using its native SNARL interface.For spatial join queries (Q18-Q27), RDF4J provides the best performance for four queries (Q20, Q21, Q22, Q27), and it is generally fast at intersection queries.GeoSPARQL-Jena is fastest at Q23, Q24, and Q25 and is generally superior at within functions.GraphDB is the best at Q18 (without using a spatial index), Q19 (with a spatial index), and Q26 (with an index), and it generally provides reasonable performance for all queries.Virtuoso and Stardog are relatively slow for Q19, Q20, Q23, and Q24, which are mainly within and intersection queries; for these queries, the query performance differs by nearly three orders of magnitude, which indicates that some stores (Stardog and Virtuoso) may not be suited to the tasks of conducting spatial join queries.We also observe that the performance with Virtuoso's native JDBC interface is similar to that with the RDF4J interface.With Stardog, using RDF4J as the interface generally leads to better performance than using its native interface SNARL, as RDF4J caches some intermediate query results.From the results, we observe that GeoSPARQL-Jena and RDF4J demonstrate a significant caching effect, i.e., the query time of the second and third times substantially drops compared with that of the first time.This is in line with their means of implementation: they cache a lot of intermediate query results.Other stores do not show a clear caching effect.

Query Correctness
Evaluating query correctness for spatial queries is complex, particularly when the queries deal with a large amount of data.However, query correctness is an important aspect in the assessment of the selected stores, especially because it is common for different stores to implement the spatial query functions differently.In this paper, we partially evaluate and discuss the query correctness by observing the results from the above-described benchmarking.
For topological queries, GeoSPARQL follows the definitions of topological relations in the dimensionally extended nine-intersection model DE-9IM [52].A well-known and reliable implementation of DE-9IM is the Java library JTS Topology Suite, JTS (https://github.com/locationtech/jts).In this study, we performed all the benchmark queries using the JTS library, and we treat the returned results as reference results for the evaluation of the RDF stores.Queries whose number of returned results from the RDF stores differs from the number returned from JTS are shaded in Table 4.One exception is Q15, which is not supported by JTS (as JTS does not support distance calculation is geographic SRSs).Thus, we calculated it in ArcGIS 10.3.1 as reference results.
For Q1-Q9, all the evaluated stores provide the same number of returned results as JTS.For Q10, we find that Stardog handles the spatial relation intersect (for polygons) in a manner that differs from the other stores; it returns the same results as the other stores return for Q11, which queries all the polygons that overlap a given polygon.That is, the intersect function for polygons in Stardog is actually equivalent to the overlap function in other stores, and Stardog does not have the function overlap.For Q15, only GraphDB provides the same results as ArcGIS (10 results); RDF4J, Virtuoso, and Stardog return 11 results (probably linked to precision settings); and GeoSPARQL-Jena fails to give any result in spite of the relatively long time it takes on this query.For Q18, RDF4J fails to return any result, and this problem is potentially linked to the precision setting in RDF4J when finding equal points.For Q21 and Q25, RDF4J, Stardog (only for Q21), and GraphDB (using spatial indexing) return 563 results; Virtuoso returns 567 results; and GeoSPARQL-Jena, and GraphDB (without using spatial indexing) return 565 results.This divergence may be linked to Lucene Spatial filtering out some results because of factors such as precision settings in different stores.JTS returns 565 results for these queries.

Benchmark Results with Geographica Datasets
In the second scenario, we tested the selected RDF stores with large geospatial datasets.This scenario is more in line with conventional SDIs, in which geospatial data dominates.Therefore, benchmarking the RDF stores with such large datasets to test their scalabilities will potentially benefit the SDI and geospatial linked data communities, as it is common for a project (especially dedicated SDIs) to have a vast number of geospatial objects.
The loading time of the six datasets in the five selected stores is presented in Table 5, and the query performance is demonstrated in Table 6.
From Table 5, we can observe that a large number of geospatial objects do not lengthen the loading time for RDF4J, GeoSPARQL-Jena, and GraphDB.For RDF4J and GeoSPARQL-Jena, this is because they do not build a spatial index while data loading; for GraphDB, the spatial index construction is completed in a short time.Virtuoso takes longer (more than 10 min) to load and construct a spatial index for the data.For Stardog, the spatial indexing process is slow, as the whole loading and index construction process takes nearly five hours.The query performance of GraphDB is generally better than that of the others for the non-topological construct queries Q1-Q5, and RDF4J, GeoSPARQL-Jena, and Virtuoso have comparable performances.For the spatial selection queries Q7-Q17, all the RDF stores respond in a reasonable time, and GeoSPARQL-Jena performs better than the others in most of the queries.The spatial join query Q19 is the most computationally expensive query in the benchmark: RDF4J, Stardog, and GraphDB without spatial indexing all time out for this query, while GeoSPARQL-Jena provides the shortest time for this query (less than 10 min).For other spatial join queries, Q20-Q27, GeoSPARQL-Jena generally performs better than the others, and all stores have reasonable response times.It is observed that different query interfaces do not have much effect on the query response time.For GraphDB, the indexed mode generally returns the results much quicker than the non-indexed mode.The exceptions are Q16 and Q17, for which GraphDB has a very similar performance to that of RDF4J with quick responses; this might be the result of the simplistic implementation of the disjoint function in RDF4J (GraphDB is dependent on RDF4J in the mode that does not use spatial indexing).

Discussion
In this paper, we comprehensively assess and benchmark five popular and well-known spatially enabled RDF stores, i.e., RDF4J, GeoSPARQL-Jena, Virtuoso, Stardog, and GraphDB.It is encouraging to see the increasing maturity of the technical environment for the support of geospatial linked data, as well as the increasing compliance with GeoSPARQL compared with previous benchmarks.That is, progressively more mainstream and well-known RDF stores are (partially) supporting GeoSPARQL.Another positive observation is that the syntaxes used for geospatial queries with GeoSPARQL are the same in RDF4J, GeoSPARQL-Jena, Virtuoso, and GraphDB in this benchmark, which implies that the geospatial queries are cross-database interoperable in terms of query syntax (Stardog does not have the same geospatial query syntax as the others).Listing 1 is an example query of Q23 in the first scenario in RDF4J, GeoSPARQL-Jena, Virtuoso, and GraphDB (without using spatial indexing, as the filter should be replaced with a triple relation in the query when using spatial indexing in GraphDB, i.e., ?geom1 geo:sfWithin ?geom2.).Listing 2 is the corresponding query used in Stardog.The query performance is generally acceptable, and it is much better than previous benchmarking results because RDF stores have developed and computer hardware has advanced.GeoSPARQL was supported in all the stores except for Stardog after 2018, which also makes this paper timely in its contribution to the comprehensive understanding of this subject.We believe the increasingly mature technical environment will benefit the development of the next generation of SDIs, in which linked data will expectedly play an important role.
From the query performance of the evaluated stores in the two scenarios, we observe that GraphDB is generally better than the others at non-topological queries, which are useful in many real-world spatial analyses: e.g., buffering is important for location selection analysis.GeoSPARQL-Jena and RDF4J are generally better than the other RDF stores at spatial selection queries, which are useful for many real-world use cases: e.g., for ICOS CP, the overlap and within functions are the most useful queries for enabling a user-defined spatial search.GeoSPARQL-Jena is superior at spatial join queries-operations used for functions such as establishing relations between the cadaster registries (points) and building objects (polygons).
A prerequisite of (partially) achieving cross-database interoperability is that the GeoSPARQL standard should be used when possible.The lightweight nature of the GeoSPARQL vocabulary means that accomplishing interoperability with GeoSPARQL for other spatial-relevant ontologies does not entail much work since, in most cases, it can be accomplished with subclass/subproperty inheritance.Nevertheless, we believe that GeoSPARQL should support more serializations to realize its wider adoption.It is especially desirable to have support for GeoJSON, which is widely accepted by the web development community.
One lesson learned from the experimental results is that, for a moderate amount of geospatial data (scenario 1 with about 1000 spatial objects), spatial indexing could be an overhead both for data loading and querying, whereas spatial indexing is certainly necessary when querying a large number of geospatial objects (scenario 2 with about 150,000 spatial objects).Most selected RDF stores provide reasonable data loading and spatial index construction times, except for Stardog, which takes nearly five hours to load and index the Geographica datasets.That is, we believe that enabling spatial indexing for querying large geospatial datasets is imperative, and constant change and injection of data are also feasible as long as the data loading and indexing times are reasonable.In this context, further assessment of the RDF stores with an even larger amount of data is desirable, which is interesting for large-scale geospatial linked data deployment.
From this assessment, we observe that most of the selected RDF stores with spatial indexing use Lucene Spatial for its easy deployment and wide support from the community.We argue that no spatial indexing technique can best fit all applications.In fact, it would be better to also enable developers and geospatial experts to configure specific and optimized spatial indexes tailored for certain datasets.This functionality is already provided by some RDF stores, e.g., RDF4J and GeoSPARQL-Jena.
Despite the promising results and advancements, there are still some challenges.One of the most significant challenges is query correctness.Although the queries are interoperable in terms of query syntax across most of the selected RDF stores, the returned results are sometimes not the same because of different implementations and interpretations of, for example, spatial topological relations.This issue renders the cross-database interoperability problematic for geospatial queries, which is rarely the case for other types of queries following the W3C recommendations.We think further development of the RDF stores might mitigate this issue, but to overcome this problem, we may need a community-backed and commonly used compliance testing suite regarding the OGC Implementation Standard for Geographic Information [52] for the implementation and interpretation of spatial functions.For the query correctness issue, we propose that a major cause is the different strategies for handling precision in the stores.Furthermore, as only four of the five stores support the SRS of WGS84, conducting spatial operations in a geographic SRS and converting data from other SRSs to WGS84 can lead to precision loss and thus incorrect or inaccurate results.Therefore, further investigation of the effect of precision settings in RDF stores is deemed necessary.
Another important topic that deserves investigation is the performance comparison between spatially enabled RDF stores and state-of-the-art OGC services (e.g., WFS).We speculate that current OGC services are superior to RDF stores at spatial queries.This raises the question of how much faster OGC services are than RDF stores.The answer to that question will potentially unveil the answers to two other questions: (1) Should we (partly) leave the spatial operations to RDBMS-backed OGC services or other GIS tools, especially since spatial join queries do not perform favorably in the evaluated RDF stores, until their spatial capacities are significantly advanced?(2) Should data publishers or third parties pre-compute important and relevant spatial relations and publish them along with the data, which will greatly diminish the need for real-time spatial operations at the cost of pre-computation and increase in data volume?Our initial opinion is that it will be beneficial to pre-compute some important spatial relations and release the relations together with geospatial linked data.

Conclusions
Linked data is a promising means to resolve the limitations concerning data integration and semantic heterogeneity of the current SDI solutions; thus, linked data has been seen as one of the key factors moving SDIs toward the next generation.The technical environment and support are important for deploying geospatial linked data.In this paper, we present an assessment and benchmarking concerning the spatial query capacities of five RDF stores, i.e., RDF4J, GeoSPARQL-Jena, Virtuoso, Stardog, and GraphDB.We tested the selected stores in two scenarios.One scenario involves benchmarking the RDF stores with ICOS CP metadata, a large-scale Earth Science data infrastructure in which geospatial data is integrated with other types of data.The other scenario is in a dedicated SDI environment with a large amount of purely geospatial data, which is a mixture of crowd-sourced and authoritative geospatial data.The queries used in this study are mainly from the Geographica benchmark.The results demonstrate that GeoSPARQL compliance has advanced dramatically in the last several years for the RDF stores, and query performances are generally acceptable.Furthermore, spatial indexing is important when querying a large number of geospatial objects.However, query correctness remains a challenge for cross-database interoperability.

Figure 1 .
Figure 1.Geographic locations of Integrated Carbon Observation System (ICOS) measurement stations.

Figure 1 .
Figure 1.Geographic locations of Integrated Carbon Observation System (ICOS) measurement stations.

Figure 2 .
Figure 2. Geospatial part of the ICOS metadata ontology.The concepts and relations without prefix annotation are from ICOS metadata ontology.

Figure 2 .
Figure 2. Geospatial part of the ICOS metadata ontology.The concepts and relations without prefix annotation are from ICOS metadata ontology.

Table 1 .
Benchmark queries for spatially enabled Resource Description Framework (RDF) stores in the first scenario with Integrated Carbon Observation System carbon portal (ICOS CP) metadata.Q1-Q5 are non-topological construct functions, Q7-Q17 (excluding Q14) are spatial selection queries, and Q18-Q27 are spatial join queries.
RDF4J can be used as an RDF store or a library that communicates and operates with many third-party storage solutions (RDF stores).2. Jena 3.9.0+ GeoSPARQL-Jena 1.0.3:an open-source Java framework for building Semantic Web and linked data applications.It supports SPARQL 1.1 and both ontological and rule-based reasoning.It provides both RDF API, which manipulates RDF data, and TDB, an RDF store solution.Jena is one of the most widely adopted RDF frameworks in various research and production projects.Jena itself has very limited spatial query capacity and does not support GeoSPARQL.The recently developed open-source plugin GeoSPARQL-Jena (https://github.com/galbiston/geosparql-jena)provides fully GeoSPARQL-compliant spatial query capacity with a custom spatial indexing technique.Both Jena and GeoSAPRQL-Jena are under Apache License 2.0.3. Virtuoso Enterprise 8.2: one of the most well-known RDF stores because of its adoption by DBpedia.It supports SPARQL 1.1 and ontological and rule-based reasoning.The reasoning is performed by query rewriting, so inferred statements are not materialized.It has had geospatial query support for a few years, and it started to support GeoSPARQL in its commercial version in 2018 (it also claimed to support GeoSPARQL in its open-source edition, but, to date, no release has appeared, so we chose to use the commercial version).It uses R-tree as its spatial indexing technique.A proprietary license for the commercial edition and a GPL 2 license for the open-source version are used.4. Stardog 6.0.1: a commercial knowledge graph product that supports parsing, storing, inferencing, and querying RDF data.It supports SPARQL 1.1 and both ontological and rule-based reasoning with a query rewriting strategy.It supports a few GeoSPARQL query functions with Lucene Spatial for spatial indexing.It is actively supported by a commercial company and uses proprietary licenses.5.
1. RDF4J 2.4.2: an open-source Java RDF framework under the license of Eclipse Distribution License, v1.0, formerly known as Sesame.It supports parsing, storing, inferencing, and querying RDF data.It supports SPARQL 1.1 and both ontological and rule-based reasoning.Inferred statements are materialized.It supports geospatial query in GeoSPARQL, and its spatial queries can be performed without spatial indexing or with Lucene Spatial (currently, Lucene Spatial in RDF4J results in errors).GraphDB 8.8.0: a linked data platform built upon RDF4J.It is a commercial solution that provides support for SPARQL 1.1 and ontological and rule-based reasoning.It supports GeoSPARQL with spatial indexing of Lucene Spatial (specifically, quad-prefix-tree and geohash-prefix-tree).It utilizes different strategies for handling queries with and without using a spatial index.GraphDB is under proprietary licenses.

Table 2 .
Qualitative analysis results of geospatial query support of the selected RDF stores.

Table 3 .
Loading time of each store for ICOS CP metadata.

Table 4 .
Average query response time of selected stores of benchmark queries with ICOS CP metadata (shortest response times in bold).Time unit is millisecond.The results that are different from the results produced from JTS (ArcGIS for Q15) are shaded (see Section 6.2.2).

Table 5 .
Loading time of each store for Geographica datasets.

Table 6 .
Average query response time of selected stores of benchmark queries with Geographica datasets (shortest response time in bold).Time unit is second unless specified as hour.

Listing 1 .
Query syntax of Q23 in the first scenario in RDF4J, GeoSPARQL-Jena, Virtuoso, and GraphDB (without indexing).Query syntax of Q23 in the first scenario in Stardog.