Mapping Spatiotemporal Data to RDF: A SPARQL Endpoint for Brussels

: This paper describes how a platform for publishing and querying linked open data for the Brussels Capital region in Belgium is built. Data are provided as relational tables or XML documents and are mapped into the RDF data model using R2RML, a standard language that allows deﬁning customized mappings from relational databases to RDF datasets. In this work, data are spatiotemporal in nature; therefore, R2RML must be adapted to allow producing spatiotemporal Linked Open Data.Data generated in this way are used to populate a SPARQL endpoint, where queries are submitted and the result can be displayed on a map. This endpoint is implemented using Strabon, a spatiotemporal RDF triple store built by extending the RDF store Sesame. The ﬁrst part of the paper describes how R2RML is adapted to allow producing spatial RDF data and to support XML data sources. These techniques are then used to map data about cultural events and public transport in Brussels into RDF. Spatial data are stored in the form of stRDF triples, the format required by Strabon. In addition, the endpoint is enriched with external data obtained from the Linked Open Data Cloud, from sites like DBpedia, Geonames, and LinkedGeoData, to provide context for analysis. The second part of the paper shows, through a comprehensive set of the spatial extension to SPARQL (stSPARQL) queries, how the endpoint can be exploited.


Introduction
The Semantic Web (a term coined by Tim Berners-Lee) aims at providing a common framework in order to allow data to be shared and reused across applications and to be consumed by machines rather than human beings.The idea behind this is to transform the current "Web of Documents" (the "Syntactic Web") into a "Web of Data" (in the traditional database sense).This requires a technology stack, denoted the "Semantic Web Stack" [1], that enables people to create data stores on the web, build vocabularies, and write rules for handling data.To this end, the World Wide Web Consortium (W3C) (http://www.w3.org/) develops and defines standards with the vision of building a Web of Linked Data.Linked Data (http://www.linkeddata.org)refers to a set of best practices for publishing and interlinking structured data (denoted as resources) on the web, where resources are represented and described using the Resource Description Framework (RDF) [2] (see below).These best practices are known as the Linked Data principles and consist of: (a) using Internationalized Resource Identifiers (IRIs) as names for things; (b) using HTTP IRIs, to make it easy to look up the names in (a); (c) providing useful information in the IRIs, using standards (e.g., RDF for modeling, SPARQL for querying); and (d) including links to other IRIs, to discover further resources.Linked Data are empowered by technologies such as RDF and SPARQL, among others.RDF, the data model for the Semantic Web expresses assertions over resources identified by an IRI.These assertions have the form of subject-predicate-object triples, where the subject is an IRI-identified resource or a blank node, the predicate is an IRI-identified resource, and the object could be a resource or a string.A blank node is a special kind of node representing an anonymous resource, typically with a structural function.Data values in RDF are called literals.Many formats for RDF serialization exist.This paper adopts Turtle [3] and assumes that the reader is familiar with this notation.SPARQL 1.1 [4] is the W3C standard language for querying RDF data.SPARQL stands for SPARQL Protocol and RDF Query Language.The query evaluation mechanism of SPARQL is based on subgraph matching: RDF triples are interpreted as nodes and edges of directed graphs and the query graph is matched to the data graph, instantiating the variables in the query graph definition.The selection criteria are expressed as a graph pattern in the WHERE clause, consisting basically of a set of triple patterns connected by the "." operator.Further, SPARQL 1.1.supports aggregate functions and the GROUP BY clause, essential for analytical queries.
In addition to the above, the Open Knowledge Foundation (https://okfn.org/)defines "Open Data" as data that can be freely used, shared, and built-on by anyone, anywhere, for any purpose.Putting together the concepts of Linked Data and Open Data yields the notion of "Linked Open Data" (LOD).Publicly-available linked datasets are depicted in the Linked Open Data Cloud (https: //lod-cloud.net/),which is updated and maintained by the Insight Center for Data Analytics (https:// www.insight-centre.org/).Crucial issues that arise in an LOD scenario are data acquisition, integration, and exploitation.Most data on the web are obtained from relational databases, although there are other kinds of data sources from which data could be acquired.Several mechanisms exist to produce RDF data from a wide variety of data sources.These mechanisms are studied in this paper.Data on the web can be made available in many different formats and can be accessed in different ways.For example, data represented as XML or JSON documents can be obtained through specialized RESTful APIs, while data represented and stored using the RDF data model can also be published and accessed through SPARQL endpoints (in a nutshell, services that accept SPARQL queries and return results).RDF data could also be extracted from RDF-embedded HTML pages (called RDFa) (http: //www.w3.org/TR/xhtml-rdfa-primer/).Further, the Semantic Web is increasingly being populated with geospatial data [5][6][7][8], particularly in the fields of Earth and Environmental science.Probably the main example in this sense is LinkedGeoData [9], which exposes data from OpenStreetMap (https://www.openstreetmap.org) in RDF format, to be queried using SPARQL.Furthermore, in the United Kingdom, the Ordnance Survey (https://www.ordnancesurvey.co.uk/) (the British mapping agency) is publishing geospatial datasets.The research community has also produced relevant work in the field [10][11][12][13].
In light of the above, this paper addresses the problem of capturing (possibly spatial) data from different sources, e.g., relational databases, XML documents, web APIs, SPARQL endpoints, producing (spatial) RDF data, and integrating and exposing such data in a spatially-enabled SPARQL endpoint.Many challenges appear in this setting and are studied throughout the paper.For this, a real-world case study is used.This case study refers to an LOD platform for the region of Brussels, in Belgium.The goal of this platform is two-fold: On the one hand, data providers are encouraged to publish their data, since the semi-automatic mapping tools described in this paper allow doing this with minimal effort and cost.On the other hand, users can either query the data over the web or applications can consume these data, allowing developers to provide added value services.This case study is part of a larger project funded by the Brussels Capital Region and aims at developing a prototype for integrating data from cultural events in Brussels, provided by two partners, Agenda.be(https://agenda.brussels/en)and Bozar https://www.bozar.be/),with the public transport schedule (provided by the main public transport company in the city, the STIB (http://www.stib.be/)).The integrated data are offered to the public over a SPARQL endpoint and can be consumed for direct analysis or by applications.One use case in this project (whose details are outside the scope of this paper) consists of developing an application that can, for instance, take a picture of a place and, after generating the appropriate SPARQL query, obtain the events taking place there within the next two hours and inform about how to reach them from the user's current location using public transport.The problem addressed here tackles only the data infrastructure to support applications of this kind.The paper provides an in-depth discussion and description of the deployment and querying of the spatial SPARQL endpoint, backed by the Strabon triple store, a geospatial database management system (DBMS) that supports the stRDF representation language (a spatiotemporal extension to RDF) and comes equipped with a spatial extension to SPARQL, called stSPARQL.The case study introduced above involves several challenges and allows interesting querying possibilities that are explored and discussed in this paper.First, the cultural and public transport data used in the project come in the form of relational tables and XML documents.However, only data from STIB include spatial features like coordinates of places.Therefore, spatial data from the other providers must be produced from alphanumeric data (e.g., addresses).Second, to be openly accessed and linked, data need to be mapped into RDF triples, which in the solution proposed here, is performed using the R2RMLmapping language.(http://www.w3.org/TR/r2rml/)Since R2RML neither directly supports XML data sources, nor spatial data; thus, the R2RML mapping language must be extended in order to tackle both issues, and this is discussed in the paper.Further, to provide context for analysis, data are captured from external sources (DBpedia, Geonames, and LinkedOpenData) and put into stRDF format.Finally, a large part of the paper is devoted to discussing a comprehensive set of stSPARQL queries, which show how the endpoint can be exploited for analysis.This set of queries is classified into different subsets of different characteristics, for example queries requiring aggregation, spatial buffering, and so on.
The remainder of the paper is organized as follows: Section 2 introduces Strabon and discusses related work on geospatial RDF data stores.Section 3 describes the R2RML mapping language and explains how it is extended to support spatial data present in XML and relational data sources.It also shows how this technique is applied to data provided by the partners in the project.In addition, related work on mapping data to RDF is also briefly discussed.Section 4 shows how data from external sources (DBpedia, Geonames, and LinkedGeoData) were added to the triple store to provide context for querying and for dataset enrichment.Section 5 provides extensive example queries over the endpoint.Section 6 concludes the paper.
Note: A preliminary version of this paper has been published as two extended abstracts [14,15], briefly describing the project's features.The present paper substantially expands and updates such work, providing a full, detailed explanation of each step of the methodology and a comprehensive set of queries that shows the functionality and usefulness of the case study.

Geospatial Triple Stores
The availability of geospatial data on the Semantic Web has triggered the interest of the research community on developing spatial (and even spatiotemporal) extensions to SPARQL, which resulted in the GeoSPARQL standard.Other proposals based on this standard exist.One of them is Strabon (and its accompanying language, denoted stSPARQL), developed almost in parallel with GeoSPARQL.Strabon is also the system used for the work presented in this paper.This section discusses the reason for such a decision and briefly comments on the characteristics of other representative geospatial triple stores.

GeoSPARQL
GeoSPARQL (https://www.opengeospatial.org/standards/geosparql) is the Open Geospatial Consortium's (OGC) (http://www.opengeospatial.org/)standard extension to SPARQL.GeoSPARQL supports the representation and querying of geospatial data on the Semantic Web, defining a vocabulary for representing geospatial data in RDF and an extension to SPARQL.GeoSPARQL and Strabon (discussed below) were developed independently at around the same time, resulting in very similar representational and querying constructs.GeoSPARQL represents geometries as literals of a certain data type.These literals may be encoded in various formats like GML, well-known text (WKT), and so on.Furthermore, like Strabon, GeoSPARQL maps spatial predicates and functions that support spatial analysis to SPARQL extension functions, although GeoSPARQL allows binary topological relations to be used as RDF properties.Note however that GeoSPARQL does not provide aggregate functions, a crucial feature for data analysis, which is the main goal of the work in this paper, as can be seen in Section 5. On the other hand, Strabon was developed based on SPARQL 1.1.,which includes (opposite to previous SPARQL versions) SQL-like aggregate functions.

The Strabon Triple Store
Strabon (http://www.strabon.di.uoa.gr/)[16][17][18][19] is an RDF data store with spatiotemporal support.That means Strabon can be used to store and query linked geospatial data that change across time.Queries in Strabon can be expressed using two well-known SPARQL extensions: stSPARQL and a subset of GeoSPARQL.The former can be used to query data represented in an extension of RDF called stRDF.Both stRDF and stSPARQL are designed to represent and query geospatial temporal data (e.g., like the reduction of forests in a region over the years due to uncontrolled exploitation).On the other hand, Strabon also supports querying geospatial RDF data using a subset of GeoSPARQL.At the time of writing this paper, this includes the core, geometry extension, and geometry topology extension.However, it is remarked again that stSPARQL provides geospatial aggregate functions, while GeoSPARQL does not.This feature is crucial for data analysis and one of the main reasons for choosing stSPARQL for this project.Further, the temporal dimension is also not addressed by GeoSPARQL.(https://event.cwi.nl/eswc2015-geo/03-stsparql-geosparql.pdf) Strabon supports spatial data types, allowing representing geometric objects by means of the OGC standards WKT [20] and GML [21].This is achieved through the definition in stRDF of the strdf:WKT and strdf:GML data types.WKT stands for well-known text, a text markup language for representing vector geometry objects on a map.GML is the XML grammar defined by the OGC to express geographical features.
Strabon provides spatial and temporal selections and joins and a collection of spatial functions like the ones included in geospatial relational database systems, also supporting many different coordinate reference systems.Further, since Strabon originally extended the RDF store Sesame (currently known as RDF4J) (http://rdf4j.org/), it can handle alphanumerical and spatial RDF data stored in a PostGIS [22] backend.
The stSPARQL query language extends SPARQL with spatial functions.These functions can be used in three parts of a SPARQL query: the SELECT, FILTER, and HAVING clauses.Arguments of these functions are spatial terms, which can be of three kinds: (a) a spatial type literal with data type strdf:geometry or its subtypes; (b) a query variable that can be bound to a spatial literal; (c) the result of a set operation on spatial literals (e.g., intersection, union, etc.); (d) The result of a geometric operation on spatial terms (e.g., buffer).Furthermore, a Boolean SPARQL extension function for each topological relation defined in the OGC-SFA(topological relations for simple features) [23] is supported.An example of a spatial join in a FILTER clause using two variables as arguments is the expression: strdf:contains(?geoA, ?geoB).An example of a spatial join, but in this case used in the SELECT clause of a query, can read: SELECT(strdf:buffer(?river, 0.04) AS ?buffer).
The expression above allows drawing a buffer around a geometry representing a river using a variable called ?river.
Finally, aggregate functions that deal with geospatial data are also supported as follows: • strdf:union(set of strdf:geometry a) returns a geometry that is the union of the set of input geometries.• strdf:intersection(set of strdf:geometry a) returns a geometry that is the intersection of the set of input geometries.• strdf:extent(set of strdf:geometry a) returns a geometry that is the minimum bounding box of the set of input geometries In addition to the above, insertion, deletion, and update of stRDF triples are supported by stSPARQL.This is however outside the scope of this paper.

Other Geospatial RDF Stores
In addition to Strabon, there are other RDF geospatial RDF data stores.The oldest ones are Parliament [24] and uSeekM [25].The former supports a large portion of GeoSPARQL.Like Strabon, the WKT and GML serializations are supported.OpenSahara's uSeekM is based on the RDF store Sesame, storing and querying spatial data using PostGIS.Although most GeoSPARQL functionalities are supported, it does not use IRIs for CRS (coordinate reference systems); therefore, it does not satisfy the project's requirements.
On the NoSQL-Big Data side, RDF4J (https://projects.eclipse.org/projects/technology.rdf4j) is a Java-based RDF framework for Linked Data.From the Version 2.4.3, released in late 2018, RDF4J provides geospatial functionality and GeoSPARQL support.Anyway, in spite of not being mature enough, at the time of starting the project described here, this system was not available, and thus, it was not considered as a candidate for the geospatial RDF data backend for the project.The same occurred with Ontotext's GraphDB (http://graphdb.ontotext.com/documentation/free/)v8.6.1 (previously OWLIM), a NoSQL semantic graph database with geospatial support (although it just supports WKT geometry serialization and WGS84 CRS).
Finally, there are other semantically-enabled systems with limited geospatial capabilities, which, for this reason, were not considered as candidates for the project.Examples of these systems are OpenLink Virtuoso (https://virtuoso.openlinksw.com//)and Allegro Graph (https://franz.com/agraph/allegrograph/).
From the analysis of the systems mentioned above, and taking into account functionality, standard support, and its open source condition, Strabon was adopted as the spatial RDF triple store for the work described in the sequel.

Extending R2RML Mapping for Spatial Data and XML Support
Since most Semantic Web data come from relational databases, several proposals exist, aimed at producing RDF data from relational data stored in different repositories.Two problems arise here.On the one hand, data have to be translated (mapped) from the relational format to the RDF data model.On the other hand, one can choose between storing RDF data in a data store or producing RDF triples on-the-fly.Several proposals and standards have been developed for this.The most relevant of them are discussed below.Section 3.1 studies the R2RML mapping language, the standard language chosen for this project, while Section 3.2 comments on other proposals and compares them against R2RML.Sections 3.3 and 3.4 present the XML and spatial extensions to R2RML developed for the project.

The R2RML Mapping Language
The W3C standard for mapping relational to RDF data, denoted R2RML [26], is a customized mapping whose result is a collection of RDF triples in Turtle syntax that represents all or a portion of a relational database.The main object of an R2RML mapping is the "triples map".Each triples' map yields a collection of triples, and it is composed of a "logical table", a "subject map", and zero or more "predicate object maps".A logical table can be either a table name (defined using predicate rr:tableName) or an SQL query (defined using predicate rr:sqlQuery).A predicate object map is composed of a predicate map and an object map.Subject maps, predicate maps, and predicate object maps can be constants (using predicate rr:constant), column-based maps (using predicate rr:column), or template-based maps (predicate rr:template).

Listing 1: Mapping a
In summary, given a relational database, to translate it into RDF triples, a mapping file containing the R2RML specification is created.This mapping is then applied to a given instance of the relational database.
In this paper, the mappings are generated using a common vocabulary and an ontology for the cultural domain, defined using the GOSPLtool [27].GOSPL is a collaborative ontology evolution methodology (it has also an associated tool), which supports stakeholders in interpreting and modeling their common ontologies in their own terminology and context, also feeding back results to the owning community.In this case, the ontology was created in a collaborative way by all the project's participants, in an evolving manner.To illustrate the need for this ontology, consider for example that a key element in the cultural domain, namely an event, is denoted Event in the Agenda.bedata and Activity in the Bozar database.Further, the cultural ontology is extended to support elements not particularly belonging to the cultural domain (defined mainly by the partners involved in these issues).This way, for instance, references to places and locations of an event must be included in the ontology.The same occurs with dates, event categories, and others.Table 1 shows some examples.
Finally, a mapping engine takes the database instance and the mapping file to produce the RDF document.Given that the standard specification of R2RML does not support spatial data, in order to satisfy the requirements of the problem at hand, the scheme commented on above was extended as explained in Section 3.4.For completeness and to put this decision in context, other mapping options are discussed in the next section.

Other Relational to RDF Mapping Tools and Languages
R2RML is not the only language for mapping relational to RDF data.Therefore, other options are discussed next, to understand the rationale for using R2RML in this project.

RML
The RDF Mapping Language (RML) [28,29] has been proposed to override R2RML's limitation on the kind of supported data sources.For this, RML is designed as a superset of R2RML whose main feature consists of a vocabulary that defines a generic data source instead of the three kinds allowed in R2RML.RML extends the notion of the logical table, defined in R2RML, and introduces the notion of the logical source, with the property rml:logicalsource, which can be any data input.It also defines the properties rml:source, rml:reference, and rml:referenceFormulation, instead of R2RML's rr:tableName, rr:column, and rr:SQLQuery, respectively.It also defines an iterator, denoted rml:iterator.However, opposite R2RML, RML is not a standard recommendation; thus, for this work, R2RML was chosen.

Direct Mapping
Direct mapping (DM) [30] was the first W3C standard for mapping relational to RDF data sources.It is the simplest mapping method for such purposes.A direct mapping takes as input a relational schema and an instance of such a relational schema and produces an RDF graph as the output.Each table defines a class, using the rdf:type predicate.Each row in a table produces a set of triples with a common subject, which is an IRI formed from the concatenation of the base IRI, the table name, the primary key column name, and a primary key value.To indicate a foreign key, a reference triple is produced with a predicate formed by the concatenation of the table name, a "ref" string, and the referencing column name.As an example, the triples in Listing 2 were produced by mapping two tables, namely Laboratories and Researchers.
DM is very easy to learn and use and produces as the output an RDF graph that reflects exactly the same structure as the one of the source database.However, compared to R2RML, the main drawback of DM is that it does not provide a mapping vocabulary; therefore, it does not allow using existing vocabularies to describe the data to be translated, since the name of all the predicates are the column names of the relational tables.Therefore, it would not be possible to take advantage of existing ontologies, which rules out DM as a candidate to be used in the work presented in this paper.Listing 2: An example of a Direct Mapping @base <http://ulb.ac.be/db/> .@prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .... <Laboratories/LID -1> rdf:type <Laboratories > ; <Laboratories#LID > 1 ; <Laboratories#Lname > "Web and Information Technologies" ; <Laboratories#Llocation> "Building A.5"; <Laboratories#Head_ID > 5 ; <Laboratories#ref -Head_ID <RESEARCHERS/RID -5> .##... <Researchers/RID -5> rdf:type <Researchers > ; <Researchers#RID > 5; <Researchers#Rname > "Jack" ; <Researchers#Lab_ID > 1 ; <Researchers#ref-Lab_ID > <Laboratories/LID -1>.

The D2RQ Platform
The D2RQPlatform (http://d2rq.org/) is a system for accessing relational databases as virtual, read-only RDF graphs.It offers RDF-based access to the content of relational databases without having to replicate it into an RDF store.D2RQ allows querying a non-RDF database using SPARQL, accessing the content of the database as Linked Data over the Web, and creating custom RDF dumps.The platform consists of three parts: (a) the D2RQ Mapping Language, with similar goals as R2RML; (b) the D2RQ Engine, which uses the mappings to rewrite Jena API calls as SQL queries against the database, passing the query results up to the outer layers of the frameworks; (c) the D2R Server, which provides, among others, the SPARQL endpoint.Requests from the web are translated on-the-fly into SQL queries using the D2RQ mapping.Therefore, it is possible to query the sources in SPARQL without using an RDF triple store.Furthermore, as mentioned above, an RDF dump could be produced.From the above, it follows that R2RML and D2RQ could work together in both ways, on-the-fly, and as an RDF dump of the triples produced by the mapping.Regarding the project described in the present work, it was considered that the D2RQ server for spatiotemporal data support would, at least, be a very hard task.Further, the solution adopted here makes it easier to support XML data sources.

Extending R2RML Sources for XML Support
As mentioned above, R2RML was originally designed to receive relational data as input, through three mechanisms: a table name, an SQL query, and an R2RML view.Therefore, a different mechanism should be used to cope with XML documents.The trivial solution would be to transform the XML document into a table.This solution may work for simple XML documents.However, when this is not the case (like in the Agenda.bedata used in this project), this approach would not suffice.Therefore, the approach discussed here enriches the R2RML specification with special terms that account for XML documents.Although this solution actually modifies the standard in some sense, the modification proposed here is just a vocabulary extension, which does not prevent any standard mapping from being defined.The idea is to allow the user to provide as input a list of XPath expressions, such that the first element in this list gives the context node for all the expressions that follow.To extend R2RML, a new source rr:xpathQuery was defined in the R2RML namespace used in the rr:logicalTable blank node, and taking a string literal as associated object.This string contains a set of valid XPath expressions separated by the " ˜" character (chosen since it is not allowed in XPath expressions).It is worth remarking that, although the new predicate belongs to the R2RML namespace, a local namespace could have been used as well (e.g.http://ulb.ac.be/ontology/), to remain independent of possible future R2RML updates.The new predicate looks as follows: rr:xpathQuery """//document ./professor./name./worksIn"""; As an example, consider the data provided for this project by Agenda.be, a service that informs about cultural and artistic events in Belgium's capital city and the institutions where such events take place.Data were provided by means of two XML documents: events.xml,containing the events and their description, and institutions.xml,containing the institutions mentioned above.Both documents were related to each other in a way such that the place and organizer appearing in events.xmlcorrespond to an institution appearing in institutions.xml.For example, consider the event with ID 234217 in events.xml,corresponding to the presentation by mezzo-soprano Guillemette Laurens, indicated in Listing 3. The event is organized by an institution with ID 233287.The mapping produces the RDF triples in Listing 5, which inform that the event is organized by an institution with ID 233287, which is called Opus 3, and takes place at the Conservatoire Royal de Bruxelles, which has ID = 471.This semantics that is used in the translation is given by the ontology created using the GOSPL tool.
Listing 5: Triples produced by the mapping in Listing 4.
Data in the XML files also contain spatial information.The mapping of spatial data in XML documents is addressed in the next section.

Mapping Spatial Data
This section describes the mapping of spatial data into RDF aimed at providing LOD for Brussels.Although there are proposals to map geospatial data to RDF automatically [12,13], based on R2RML and RML, in the remainder, it will become clear that the problem is too complex to be solved just using automated tools.Further, as explained above, RML is not a standard, and a design decision was to work with standard tools.
Three datasets were used for this purpose, available from three Belgian companies: Agenda.be,Bozar, and STIB.Different places of interest (PoI) were included in these datasets.In order to be able to geolocate these PoIs, longitude and latitude were needed.Therefore, a conversion from full addresses of the PoIs (containing street, number, city, and postal code) into geographic coordinates was needed, a process known as geocoding.Many open-source geocoding tools are available, for example MapQuest Open (https://developer.mapquest.com/documentation/open/nominatimsearch/),OpenCage Geocoder https://opencagedata.com/), LocationIQ (https://locationiq.com/), and OpenStreetMap's Nominatim (https://nominatim.openstreetmap.org/).Section 3.3 showed how XML data in the events.xmlfile provided by Agenda.be were mapped into RDF.However, the other XML document, namely institutions.xml,contained also spatial data (the institutions' addresses).Mapping this spatial information coming from an XML document requires a previous step, explained next.
Institutions and their addresses in the source XML file are represented as shown in Listing 6.
Listing 6: An institution in the institutions.xmlfile.The institutions' addresses must be transformed into spatial points, given that the geolocation API must receive a string representing the full address.Thus, to produce the full addresses that are the input to the geocoding process, the information in Listing 6 must be concatenated, for example, to get something like: "r.Fumal 28 5000 Namur".To avoid transforming XML data into relational tables, XPath (http://www.w3.org/TR/xpath/) was used instead of SQL to define the logical tables in the mapping document, as explained in Section 3.3.However, XPath does not allow creating a new node containing the full address, which is needed in order to obtain the corresponding spatial coordinates.Thus, XSLT(http://www.w3.org/TR/xslt) was used to produce a temporary file with such full addresses.The XSLT code is shown in Listing 7. The code browses the XML file, and each node is simply copied in a new file.When a node like "Institution_Street_FR" is reached, the zip code and city terms are concatenated, and a new node is created, with the name "Institution_Full_Address" (although data come in Dutch and French, the latter was used to find out the coordinates).The temporary file is then used as the original file on which the mapping is performed.After this process, a spatial mapping file can be produced and applied over the new file.The mapping file is shown in Listing 9. Note that only the part relevant to the spatial mapping is shown.An XPath query was used as the logical table.The spatial mapping produced triples in stRDF format, where the predicate is <http://www.w3.org/2003/01/geo/wgs84_pos#geometry>, and the object is a geometry in the WKT stRDF format.The following URL: http://www.opengis.net/def/crs/EPSG/0/4326represents the spatial reference system used.
Listing 10 depicts the mapping of the full address for the institution with ID 24, obtained applying the process explained above.The workflow of the whole process is given in Figure 1.The coordinates were obtained applying geocoding to the element Institution_Full_Address (note that the "a" in Turtle means type, for example, institution is of type address).For example, the address above can be obtained using the OpenStreetMap's search API as follows: https://nominatim.openstreetmap.org/search/r.Fumal%2028%20Namur?format=json&addressdetails=1&limit=1&point_geojson.The second dataset to be mapped was the one belonging to Bozar.Bozar is Brussels' Center of Fine Arts, hosting cultural events, concerts, and exhibitions all year long.The logical table in the Bozar mapping file was an aggregation of several tables dealing with activities.Addresses were stored in a table denoted location_lng, with schema (id, lng, field, content).The field attribute contained the types of the components of a certain address (e.g., zip code, city, etc.).Table 2 shows an example of the location_lng table.The five tuples in Table 2 represent a single address.In order to perform the mapping, the five tuples in Table 2 must be concatenated.This concatenation is performed by an SQL query and yields the tuple Mussé d Ixelles 71 JeanVanVolsem 1050 Ixelles Belgique , over the schema (fullAddress,id), as shown in Table 3.Over this result, the mapping in Listing 11 is applied, and the result is shown in Listing 12.The third dataset to be mapped corresponded to STIB, the main company of public transport in Brussels.Data from STIB were stored in an SQL database that consisted of four tables: Block, Stop, Trip and Tripstop.A Block represents the time elapsed between the moments when a vehicle leaves the warehouse and returns to it.The Stop table contains the name (in French and Dutch) and location of a bus, tram, or metro stop.A Trip is a part of a block, defined as the path between the starting point and the ending point on a particular route.Finally, Tripstop tells the time during a trip when the vehicle reaches a stop.Figure 2 shows the interaction between components and describes the process in more detail.show instances of the dataset.Table 7 shows how the concepts in the ontology map to the columns in the tables of the STIB database.The spatial mapping in the STIB case was straightforward, because longitude and latitude of a stop were given; therefore, each point could be computed through an SQL query.Listing 13 shows a portion of the mapping file for a stop.
Listing 13: STIB mapping for a stop.It can be seen that the POINT geometry was represented in WKT.Furthermore, the latitude and longitude coordinates were produced using the mapping.
To avoid redundancy, the mapping files for the trip stops and for the trips are not shown.However, to allow the reader to be fully able to understand what will finally be at the endpoint, the triples produced for the trip stops and for the trips are shown in Listings 16 and 17, respectively.Listing 16: RDF triples for a trip stop.GeoNames (http://geonames.org) is a geographical database of the world that contains over 25 million geographical names and more than 11 million features.GeoNames provides a Java library allowing one to access its web services through the code in Listing 20.LinkedGeoData (http://linkedgeodata.org/) is a spatial database derived from OpenStreetMap, a project that aims at building a free editable world map.LinkedGeoData can be accessed and queried via a SPARQL endpoint.As in the case of DBpedia, a service was queried and RDF triples were constructed, transforming the results.Since LinkedGeoData does not contain information about the city, the user must check for the places actually in Brussels.This was done through a SPARQL query, which included a FILTER clause that looked for places within a radius of 8 km around Brussels' Grand Place (of course, this radius length was chosen arbitrarily), which is considered the city center (see Figure 3).Over 2191 new places were obtained in this way, running the query in Listing 22.This query retrieved all places with their labels, their type (e.g., restaurant, cafe, school), and their coordinates (longitude and latitude).The FILTER clause at the end verified that the place was in Brussels, as mentioned above.Examples of the resulting RDF triples are presented in Listing 23.In total, 2355 places were collected from the three external sources.The sizes and number of triples of the data in the triple store are given in Table 8.

Querying the SPARQL Endpoint
With the data obtained as explained in previous sections, a Strabon triple store was built and deployed as an stSPARQL endpoint.This section shows how this endpoint can be exploited using the stSPARQL language.Figure 4 shows the user interface.Next, examples of queries that can be run over the endpoint (http://eao4.ulb.ac.be:8080/strabonendpoint/) are given.For the clarity of the presentation, these queries are classified according to their type.That is, this classification is not aimed at representing a query taxonomy.Nevertheless, the queries cover the characteristics of the queries included in the benchmark for geospatial RDF stores proposed by Garbis et al. [19] and more recently Ioannidis et al. [31].The prefixes in Listing 24 were used in the queries, and they are not repeated for the sake of space.The endpoint prototype was installed on an OpenVZ container on a Debian server, with one core from a Xeon X3350, 1 GB of dedicated RAM, and RAID10 storage.The heap size is currently 5 GB.Most of the queries ran very fast, as readers can themselves check over the endpoint, and most of the queries executed on the sub-second timescale.However, experimental results are not reported here since this paper's focuses is on usability and query expressiveness.

Queries that Explore the Datasets
The first group of queries were posed over just one dataset and allowed exploring the dataset's data and metadata.Therefore, the stSPARQL queries contained only one FROM clause.Although these queries were quite simple, they were aimed at introducing the language.Query 1 "List the Institutions in the Agenda.bedataset that located at less than 200 m from Brussels' Grand Place" The query reads in stSPARQL: SELECT ?name ?geoFROM <http://agenda.be>WHERE { ?node a gospl:Address.?node gospl:Address_of_Institution ?ins .
?node geo:geometry ?geo.filter(strdf:distance(?geo, "POINT (4.3525 50.8467); http://www.opengis.net/def/crs/EPSG/0/4326",<http://www.opengis.net/def/uom/OGC/1.0/meter>)< 200)} In this query, variables ?name and ?geo returned the name of an institution and the point coordinates of its location, which allowed displaying the result on a map.Note that the FILTER clause was used to keep just the institutions (represented by ?geo) located less than 200 meters from "Grand Place" (it was assumed that the coordinates were known by the user or given by an application).The function strdf:distance computed the distance between the Grand Place and the corresponding institution.The result is shown in Figure 5.The query returned a link to the triple corresponding to the stop's location.The stop's location was returned in variable ?stop.The geometry was returned in variable ?geo (that is, the point coordinates, which allowed displaying the result on a map).The name of the stop was returned in variable ?dsc.This is shown in the portion of the result as a JSON document, displayed below.The result in graphic form is shown in Figure 6.

Queries that Include Aggregation
This kind of query contained a GROUP BY clause.The first query of this kind computed the twenty stops closest to the Brussels Central Square.In the SELECT clause, the GROUP_CONCAT concatenated all the lines stopping at the same location (e.g., the ULBstop was used by Lines 25, 71, 72, and 94).The distance between a stop (?geo) and the "Grand Place" (POINT (4.3525 50.8467)) was computed and stored in variable ?dist.In order to get the 20 closest stops, first the distance (?dist) between each stop and the "Grand Place" was computed.Then, the results were sorted by distance in ascending order, and finally, the closest 20 stops were kept, using the LIMIT keyword.The GROUP BY clause was used to ensure that a stop was only displayed once since it could have several descriptions (in French and Dutch) for the same location.Figure 7 displays the result.The aggregate function count() in this query counted the number of triples for each instantiation of the ?res variable.The stops are displayed in ascending order, by using the ORDER BY clause.However, since the name of a route is a string, a casting into an integer had to be performed, using the BIND function.Then, it was converted back into a string in the SELECT clause, in order to avoid the data type "xsd:integer".With this little trick, the lines can be displayed in ascending order like "1", "2", "3" . . ."98", instead of "1", "11" . . ."2", "21" . . ."98", which would be the case if they were strings.Table 9 shows the result.This query displayed institutions from Agenda.be and Bozar (in variable ?geo), together with their names (?institname), address (?street, zipcode and ?cityname) and the events taking place at them (through the concatenation of events in ?events).There were two FROM clauses mentioning both data graphs.Given that the mapping of the Bozar and Agenda.bedata was performed using the vocabulary of the GOSPL-produced ontology, it was not possible to distinguish the source of an institution.In the query, results were first grouped by institution.Then, the total number of events by institution (count(*)) was computed, and finally, results were filtered through the HAVING clause.Finally, note that the first filter used the strdf:distance function to include only places that were within Brussels (as mentioned above, a place was considered to be within Brussels if it is within a radius of 8 km around the center).

Queries Drawing Buffers or Lines on a Map
Many decision queries require drawing a buffer around a geometric object (e.g., a city, a river) or a line linking two points (e.g., two PoIs).These are the queries in this class, explained below.The query above used the function buffer to draw a circle with ?geo1 as the origin and a radius of 200 m.All places within a radius of 200 meters from a stop of Line 94 were displayed (using variable ?geo) together with their names (in variable ?label).The result is shown in Figure 8.

Spatial Queries Including Temporal Conditions
Queries in this class, besides including one or more temporal condition(s), may also include the features in the classes already studied.For example, the next query includes aggregation, and it takes data from more than one graph.

} } } GROUP BY ?geo2 ?description
This query included a subquery to find out all the routes passing within a radius of 100 m from the "BUYL" stop.This tolerance was required since several lines in Brussels intersect here.Then, all stops of each route were searched and displayed.The GROUP BY clause concatenated the routes in the SELECT clause.Figure 11 shows the lines intersecting at the stop: Tram Lines 7, 94, and 25, and Bus 71.?route stib:Route_with_Route_Name ?name.?stop stib:Stop_with_Description "BASCULE"@fr-BE .?stop stib:Stop_with_Direction "HEYSEL"@fr-BE .

Conclusions
The paper studied the problem of capturing spatiotemporal data from different data sources, integrating these data, storing them in a geospatial RDF data store, and exposing the integrated data in a spatially-enabled SPARQL endpoint.The source data used for the research reported in this paper were obtained from relational databases, XML documents, and different external Semantic Web repositories (accessed through APIs and SPARQL queries).The case study used in this paper is part of a larger project aimed at developing a prototype for integrating data from cultural events in Brussels with the public city transport schedule.Therefore, the data mentioned above were provided by the project partners, namely Agenda.be and Bozar (for the cultural data), and STIB (for the transportation data).These (relational and XML) data were enriched with external data coming from the Semantic Web, specifically from LinkedGeoData, GeoNames, and DBpedia.
The first research problem addressed consisted of mapping the spatial component of the source data, into spatial RDF.This was done by extending the standard R2RML mapping language in order to produce data represented in the stRDF data model supported by Strabon, the spatial data store used in this project.Since R2RML does not directly support XML data sources, it was also extended with such a capability.Over the deployed endpoint, the second problem tackled aimed at showing how these integrated data could be exploited.For this, a comprehensive set of stSPARQL queries was devised, divided into five query classes: dataset exploration queries, aggregate queries, queries involving several datasets, spatial analytical queries (e.g., queries that draw buffers), and queries involving temporal conditions.This variety illustrates the power of the solution.Last but not least, the paper also discussed the design alternatives considered to solve the problem at hand and the rationale for the decisions that were finally made.
Although for this project, only spatial geometries of type pointwere needed and addressed, the proposed solutions and the endpoint can be extended to include other kinds of spatial data, like for instance polygons.Moreover, since Strabon has support for column-store databases, the endpoint can be moved to platforms of this kind, to support larger databases efficiently.Finally, the work presented here can be used as a reference for new case studies that can tackle many other different domains.

Listing 9 :
(Spatial) Mapping file for an institution.

Figure 8 .
Figure 8. Places of interest less than 200 m away from a stop of Tram 94 (zoomed-in).

Figure 10 .
Figure 10.Swimming pools and restaurants less than 1 km away from each other.
Table using R2RML

Table 1 .
Mapping of elements in the Bozar database to concepts in the cultural ontology.

Table 2 .
Sample of the location_lng table of bozar database.

Table 4 .
Sample of the Stop table.

Table 5 .
Sample of the Trip table.

Table 6 .
Sample of the TripStop table.

Table 7 .
Mapping STIB database columns to concepts in the ontology.

Table 4 .
Listing 14 depicts the R2RML mapping document for STIB's stop locations.Note, in the logical table, the concatenation in the SQL query mentioned above.Finally, Listing 15 shows the resulting triple for the location of Stop 6306 in Listing 15: An RDF triple for stop #6306.

Table 8 .
Size of the datasets in the endpoint.

Query 3 "
List the twenty STIB stops closest to Brussels' Grand Place"

Table 9 .
Number of stops per route.