RDF 1.1: Knowledge Representation and Data Integration Language for the Web

Resource Description Framework (RDF) can seen as a solution in today's landscape of knowledge representation research. An RDF language has symmetrical features because subjects and objects in triples can be interchangeably used. Moreover, the regularity and symmetry of the RDF language allow knowledge representation that is easily processed by machines, and because its structure is similar to natural languages, it is reasonably readable for people. RDF provides some useful features for generalized knowledge representation. Its distributed nature, due to its identifier grounding in IRIs, naturally scales to the size of the Web. However, its use is often hidden from view and is, therefore, one of the less well-known of the knowledge representation frameworks. Therefore, we summarise RDF v1.0 and v1.1 to broaden its audience within the knowledge representation community. This article reviews current approaches, tools, and applications for mapping from relational databases to RDF and from XML to RDF. We discuss RDF serializations, including formats with support for multiple graphs and we analyze RDF compression proposals. Finally, we present a summarized formal definition of RDF 1.1 that provides additional insights into the modeling of reification, blank nodes, and entailments.


Introduction
The Resource Description Framework version 1.1 is a modern and complete knowledge representation framework that is seemingly underrepresented within the traditional knowledge representation research community. We seek to clarify some differences between the way RDF 1.1 was defined in World Wide Web Consortium (W3C) specifications and the ways in which it is reinterpreted during implementation. Firstly, we need to discuss how RDF relates to the broader field of knowledge representation.
Knowledge representation can be seen as the way in which knowledge is presented in a language. More precisely it was clarified by Sowa [138], who presents five characteristics of knowledge representation: 1. It is most fundamentally a surrogate.
2. It is a collection of ontological commitments. 3. It is a fragmental intelligent reasoning theory. 4. It is a pragmatically efficient computation medium.

It is a human expression medium.
Natural language can be defined as one of the methods of knowledge representation. The fundamental unit of knowledge in such languages is often a sentence that consists of a set of words arranged according to grammatical rules. In spite of the existence of grammatical rules that encode expectations of word order, irregularities and exceptions to the rules make it difficult for machines to process natural languages.
The RDF data model was a response to this problem for knowledge representation on the World Wide Web. This language and the notions from which it originates have enabled free data exchange, formalization, and unification of stored knowledge. RDF was developed iteratively over nearly two decades to address knowledge representation problems at Apple Computer, Netscape Communications Corporation, and the Semantic Web and Linked Data projects at the World Wide Web Consortium.
An assumption in RDF [123] is to define resources by means of the statement consisting of three elements (the so-called RDF triple): subject, predicate, and object. RDF borrows strongly from natural languages. An RDF triple may then be seen as an expression with subject corresponding to the subject of a sentence, predicate corresponding to its verb and object corresponding to its object [130]. So the RDF language may be categorized into the same syntactic criteria as natural languages. According to these premises, RDF belongs to the group of Subject Verb Object languages (SVO) [44]. The consistency and symmetry of the RDF language allows knowledge representation that is easily processed by machines, and because its structure is similar to that of natural languages, it is reasonably readable for people.
On the other hand, following Lenzerini [100], data integration is the issue of joining data stored at disparate sources, and providing the user with an integrated perspective of these data. Much of the data on the Web is stored in relational databases. A similar amount of data exists in hierarchical files such as XML documents. Integration of all of this data would provide huge profits to the organizations, enterprises, and governments that own the data.
Interoperability is the capability of two or more (different) software systems or their components to interchange information and to use the information that has been shared [36]. In the context of the Web, interoperability is concerned with the support of applications that exchange and share information across the boundaries of existing data sources. The RDF world is a satisfying method for

Review Organization
The remainder of this article is as follows: Section 2 presents related work and formalized concepts for RDF. In Section 3 we discuss RDF blank nodes and their complexity. Section 4 analyzes a semantics for the RDF and, outlines a set of different entailment regimes. Section 5 overviews and compares various proposals for RDF data integration. Section 6 briefly introduces and compares various serialization formats for RDF 1.1. In Section 7 we overview and compare various proposals for RDF compression. Finally, Section 8 gives some concluding remarks.

Literature Review
In this Section, we present chronologically related works, and show a formalized syntax and concept for RDF.
The pre-version of RDF was published in 1999 [97]. In this document RDF did not have many features known from current versions, e.g. there were no explicit blank nodes. One of the first paper is published in 2000 [54], which present RDF and RDFS. In 2001 Champin [37] focuses on RDF model and XML syntax of RDF. In [31] Carroll provides a formal analysis of comparing RDF graphs. An author proves that isomorphism of an RDF graph can be reduced to known graph isomorphism problems. In [114], the authors focus on delineating RDFS(FA) semantics for RDF Schema, which can interoperate with common first-order languages. Grau [75] continues RDF(FA) approach and proposes a possible simplification of the Semantic Web architecture. Yang et al. [163] propose a semantics for anonymous resources and statements in F-logic [93]. Another overview is presented in [12] by Berners-Lee and in [105] by Marin. The RDF 1.0 recommendation [104] has been reviewed of several analyzes. In 2004 Gutierrez et al. [78] formalized RDF and investigate computational aspects of testing entailment and redundancy. In 2005, in [50,69], the authors propose a logical reconstruction of the RDF family languages. Feigenbaum [61] briefly describes RDF with emphasis on the Semantic Web. In [107,108], the authors discuss fragments of RDF and systematize them. These papers also outline complexity bound for ground entailment in the proposed fragment. Yet another approach is provided in [118], which presents domainrestricted RDF (dRDF). In 2011 and 2012 another descriptions are shown in [85] and [4], which presents Semantic Web technologies. In 2014 Curé et al. [45] briefly introduce RDF 1.0 and RDFS 1.0. There are also a lot of papers that extend to RDF annotations [26,155,165] e.g. for fuzzy metrics [144], temporal metrics [77], spatial metrics [96] and trust metrics [153]. A separate group are publications on data integration in the context of RDF [81,134,139,147]. The term relational database to RDF mapping has been used in [139]. In [134,147] the authors propose direct mappings, and in [81] indirect mappings is presented.
In order for machines to exchange machine-readable data, they need to agree upon a universal data model under which to structure, represent and store content. This data model should be general enough to provide representation for arbitrary data content regardless of its structure. The data model should also enable the processing of this content. The core data model selected, for use on the Semantic Web and Web of Data digital ecosystems is RDF.
RDF constitutes a common method of the conceptual description or information modeling accessible in Web resources. It provides the crucial foundation and framework to support the description and management of data. Especially, RDF is a general data model for resources and relationship descriptions between them.
The RDF data model rests on the concept of creating web-resource statements in the form of subject-predicate-object expressions, which in the RDF terminology, are referred to as triples.
An RDF triple consists of a subject, a predicate, and an object. In [47], the meaning of subject, predicate and object is clarified. The subject expresses a resource, the object fills the value of the relation, the predicate refers to the features or aspects of resource and expresses a subject-object relationship. The predicate indicates a binary relation, also known as a property.
Following [47], we provide definitions of RDF triples below. The primitive constituents of the RDF data model are terms that can be utilized in reference to resources: anything with identity. The set of the terms is divided into three disjoint subsets: • IRIs, • blank nodes, • literals. Definition 2.2 (IRIs). IRIs are a set of Unicode names in registered namespaces and addresses referring to registered protocols or namespaces used to identify a web resource. For example, <http://dbpe dia.org/resource/House> is used to identify the house in DBpedia [7].
Note that in RDF 1.0 identifiers were RDF URI (Uniform Resource Identifier) References. Identifiers in RDF 1.1 are now IRIs, which are URIs generalization, which allows a wider range of Unicode codes. Please notice that each absolute URL (and URI) is an Internationalized Resource Identifier, but not every IRI is a URI. When one is implemented in operations that are exclusively specified for URI, it should first be transformed.
IRIs can be shortened. RDF syntaxes use two similar ways: CURIE (compact URI expressions) [15] and QNames (qualified names) [21]. Both are comprised of two components: an optional prefix and a reference. The prefix is separated from the reference by a colon. The syntax of QNames is restrictive and does not allow all possible IRIs to be represented, i.e. issn:15700844 is valid CURIE but invalid QName. Syntactically, QNames are a subset of CURIEs. Definition 2.3 (Literals). Literals are a set of lexical forms and datatype IRIs. A lexical form is a Unicode string, and a datatype IRI identifies an attribute of data that defines how the user intends to use the data. RDF borrows many of the datatypes defined in XML Schema 1.1 [140]. For example, "1"^^http://www.w3.org/2001/XMLSchema#unsignedInt, where 1 is a lexical form and should be treated as unsigned integer number.
Note that in RDF 1.0 literals with a language tag do not support a datatype URI. In RDF 1.1 literals with language tags have the datatype IRI rdf:langString. In current version of RDF all literals have datatypes. Implementations might choose to support syntax for literals that have lexical form only, but it should be treated as synonyms for xsd:string literals. Moreover, RDF 1.1 may support the new datatype rdf:HTML. Both rdf:HTML and rdf:XMLLiteral depend on DOM4 (Document Object Model level 4) 1 .
Definition 2.4 (Blank nodes). Blank nodes are defined as elements of an infinite set disjoint from IRIs and Literals.
In RDF 1.1 blank node identifiers are local identifiers adapted in particular RDF serializations or implementations of an RDF store.
A set of RDF triples represents a labeled directed multigraph. The nodes are the subjects and objects of their triples. RDF is also related to as being graph structured data where each s, p, o triple can be interpreted as an edge s p − → o. Example 2.2. The example in Fig. 1 presents an RDF graph of a FOAF [23] profile. This graph includes four RDF triples: <#js> rdf:type foaf:Person . <#js> foaf:name "John Smith" . <#js> foaf:workplaceHomepage <http://univ.com/> . <http://univ.com/> rdfs:label "University" .
The RDF syntax and semantics can be widened to named graphs [32]. The named graph data model is a variation of the RDF data model. The basic concept of the model consists of proposing a graph naming mechanism. Definition 2.6 (Named graph). A named graph N G is a pair u, G , where u ∈ I ∪ B is a graph name and G is an RDF graph.
Example 2.3. The example in Fig. 2 presents a named graph of a FOAF profile. This graph has the name http://example.com/#people and includes three RDF triples: <#people> { <#js> rdf:type foaf:Person . <#js> foaf:name "John Smith" . <#js> foaf:workplaceHomepage <http://univ.com/> . } Figure 2: A named graph identified by <#people> with three triples. RDF 1.1 describes the idea of RDF datasets, a collection of a distinguished RDF graph and zero or more graphs with context.Whereas RDF graphs have a formal semantics that establishes what arrangements of the universe make an RDF graph true, no agreed model-theoretic semantics exists for RDF datasets. For more about the above characteristics, we refer the interested reader to the RDF 1.1: On Semantics of RDF Datasets [164] which specify several semantics in terms of model theory.
Definition 2.7 (RDF dataset). An RDF dataset DS includes one nameless RDF graph, called the default graph and zero or more named graphs, where each is identified by IRI or blank node, DS = {G, u 1 , G 1 , u 2 , G 2 , . . . , u i , G i }. In addition, the RDF Schema Recommendation [22] provides a set of built-in vocabulary terms under a core RDF namespace that unifies popular RDF patterns, such as RDF collections, containers, and RDF reification.
RDF is a provider of vocabularies for container description. Each container has a type; what is more, their members can be itemized with the use of a fixed set of container membership properties. In order to provide a way to separate the members from one another, the properties must be indexed by integers, however, these indexes can not be regarded as specifying an ordering of the RDF container itself. The RDF containers are RDF graph entities that use the vocabulary in order to provide basic information about the entities and give a description of the container members. Following [22], RDF gives vocabularies for specifying three container classes: • rdf:Bag is unordered container and allows duplicates, • rdf:Seq is ordered container, • rdf:Alt is considered to define a group of alternatives.
Another feature of RDF is its vocabulary for describing RDF collections. Since the RDF data model has no inherent ordering, collections can be used to determine an ordered and linear collection using a linked list pattern. RDF collections are in the form of a linked list structure such that it comprises of elements with a member and a pointer to the next element. Moreover, collections are closed lists in contrast to containers, which allow the set of items in the group to be precisely determined by applications. However cyclic or unterminated lists in RDF are possible.
Example 2.6. The example presents a collection representing the group of resources. In the graphs, each member of the collection is the object of the rdf:first predicate whose subject is a blank node representing a list, that links by the rdf:rest predicate. The rdf:rest predicate, with the rdf:nil resource as its object, indicates the end of the list. <http://example.com/p> ex:teachers _:x . _:x rdf:first <http://example.com/p/js> . _:x rdf:rest _:y . _:y rdf:first <http://example.com/p/ak> . _:y rdf:rest rdf:nil .
Another RDF feature is reification (denoted sr in Table 1), which provides an approach to talk about individual RDF triples themselves within RDF. The method allows for constructing a new resource that refers to a triple, and then for adding supplementary information about that RDF statement.
An extension of the previous method is N-ary Relations [111] (denoted nr in Table 1). This approach is not strictly designed for reification, but focuses on additional arguments in the relation to provide extra information about the relation instance itself. There are other proposals [82,109] of RDF reification. The first proposal called RDF*/RDR 2 [82] (denoted rdr in Table 1) is an alternative approach to represent statement-level metadata. It is based on the idea of using a triple in the subject or object positions of other triples that represent metadata about the embedded statement. To reified RDF data an additional file format based on Turtle has been introduced. Unfortunately, in RDF*/RDR several reification statements of the same triple are translated into one standard reification part so that, it is not possible to distinguish grouped annotations.
The second proposal is called Singleton Property [109] (denoted sp in Table 1). It is for representing statements about statements. It uses a unique predicate for every triple with associated metadata to the statement, which can be linked to the high-level predicate. Authors propose special predicate singletonPropertyOf to link to original predicate. Since the predicate resource use predicate singletonPropertyOf, it is possible to use RDFS entailment rules to infer the original statements. foaf:name#1 rdf:singletonPropertyOf foaf:name . <#js> foaf:name#1 "John Smith" . foaf:name#1 ex2:certainty 0.5 .
It is also possible to use named graphs directly (denoted ng in Table 1). A named graph concept allows assigning an IRI for one or more triples as a graph name. In that scenario, the graph name is used as a subject that can store the metadata about the associated triples. In [76] authors present a model of nanopublications along with a named graph notation.
Example 2.11. The example presents a named graph with metadata.
In Table 1 we present features of the above-mentioned standardized serializations, namely: having W3C Recommendation, special syntax that is an extension of RDF, and number of extra RDF statements required to represent an RDF triple (i.e. O(n)).

Modeling Blank Nodes
The standard semantics for blank nodes interprets them as existential variables. We provide an alternative formulation for the blank nodes, and look at theoretic aspects of blank nodes.
Following [38], blank nodes give the capability to: • define the information to encapsulate the N-ary association, • describe reification, • offer protection of the inner data, • describe multi-component structures (e.g. RDF containers), • represent complex attributes without having to name explicitly the auxiliary node.
The complexity of the problem of deciding whether or not two RDF graphs with blank nodes are isomorphic is GI-complete as noticed in [31]. Graph isomorphism complexity with a total absence of blank nodes is PTIME [103].
There is a complication in the notion of RDF graphs, caused by blank nodes. Blank nodes are intended to be locally-scoped terms that are interpreted as existential variables. Blank nodes are shared by graphs only if they are derived from the ones described by documents or RDF datasets that provide for the blank nodes sharing between different graphs. Performing a document download is not synonymous with the blank nodes in a resulting graph being identical to the blank nodes coming from the same source or other downloads of the same file. This gives rise to a notion of isomorphism between RDF graphs that are the same up to blank node relabeling: isomorphic RDF graphs can be considered as containing the same content.
Moreover, merging two or more RDF graphs is important to ensure that there are no conflicts in blank node labels. A merging operation performs the union after forcing all of shared blank nodes that are present in two or more graphs to be distinct in each RDF graph. The graph after this operation is called the merge. The result of this operation on RDF graphs can produce more nodes than the original graphs.
Definition 3.2 (RDF merge). Given two graphs, G 1 and G 2 , an RDF merge of these two graphs, denoted G 1 G 2 , is defined as the set union G 1 ∪ G 2 , where G 1 and G 2 are isomorphic copies of G 1 and G 2 respectively such that the copies do not share any blank nodes.
As in the RDF graphs we can compare RDF datasets.
Blank nodes identifiers cannot be found in the RDF abstract syntax. Giving a constant name to the blank nodes can be helped by the skolemization mechanism [86]. In situations where stronger identification is needed, some or all of the blank nodes can be replaced with IRIs. Systems that wish to do so ought to create a globally unique IRI (called a skolem IRI ) for every blank node so replaced. This conversion does not significantly change the graph meaning. It permits the possibility of other RDF graphs by subsequent use the skolem IRIs that is impossible for blank nodes. Systems use a well-known IRI [110] when they need skolem IRIs to be distinguishable outside of the system boundaries with the registered name genid. Definition 3.4 (Skolemization). Assume that G is a graph including blank nodes, ξ : B → I skolem is a skolemization injective function and I skolem is a subset of skolem IRIs which is substituted for a blank node and not occur in any other RDF graph.
From the above definition it follows that I skolem ∩ I G = ∅, where I G is a set of IRIs that are used in G.
On the other hand, in [103], authors propose two skolemization schemes, such as: centralized and decentralized. The first one is very similar to what the URL shortening service does. Formally, there is a distinguished subset of the URIs. Whenever the service receives a request, it returns an element that is in Skolem constants such that it has not been used before. The second proposal resembles the first but with no central service. As a result, each publisher will generate its constants locally. Another proposal [87] focuses on a scheme to produce canonical labels for blank nodes, which maps them to globally canonical IRIs. It guarantees that two skolemized graphs will be equal if and only if the two RDF graphs are isomorphic.
NP-completeness originate from cyclic blank nodes. In [88], authors discuss a number of possible alternatives for blank nodes, such as: 1. disallow blank node, 2. ground semantics, 3. well-behaved RDF.
The first alternative disallows the use of blank nodes in RDF. However, blank nodes are a useful convenience for publishers. The second alternative proposes to assign blank nodes a ground semantics, such that they are interpreted in a similar fashion to IRIs. The third alternative is also presented in [20]. The core motivation for this proposal is to allow implementers to develop tractable lightweight methods that support the semantics of blank nodes for an acyclical case. Following [20], a well-behaved RDF graph is a graph that conforms to the restrictions, which we present below.
Definition 3.5 (Well-behaved RDF graph). A well-behaved RDF graph is an RDF graph that conforms to the following restrictions: 1. it can be serialized as Turtle without the use of explicit blank node identifiers,

it uses no deprecated features of RDF.
Note that the first version of RDF published in 1999 did not have named blank nodes and thus was per definition well-behaved.
Another important concept is leanness, which is checking if an RDF graph contains redundancy. Please note that a subgraph can be a graph with fewer triples. The complexity of the problem of verifying whether or not an RDF graph is lean is coNP-complete as noticed in [79]. Alongside the notion of graphs being non-lean, we also intuitively refer to blank nodes as being non-lean. Non-lean blank nodes are the cause of redundant triples in non-lean graphs. A graph is non-lean if and only if it contains one or more non-lean blank nodes.
Example 3.1. The example presents that the top graph is lean, because there is no proper map into itself. The bottom graph in not lean.

Entailments
An interpretation in RDF is a function from literals and IRIs into a set, together with restrictions upon the mapping and the set. In this section we introduce different interpretation notions from the RDF area, each corresponding to an entailment regime in a standard way.
Following [115], a simple interpretation I is a structure which we present below. 1. R I is a (nonempty) set of named resources (the universe of I), 2. P I is a set, called the set of properties of I, 3. EXT I is an extension function used to associate properties with their property extension, EXT I : . IN T I is the interpretation function which assigns a resource or a property to every element of V such that IN T I is the identity for literals, The interpretation is a map from expressions i.e. triples, graphs and names to truth values and universe elements. Following terminology, we say that I satisfies an RDF graph H when I(H) = true, that H is satisfiable if it is a simple interpretation that satisfies it. Moreover, an RDF graph G entails a graph H when every interpretation that satisfies G, will satisfy H as well. In this case, we shall write G |= H. The graphs G and H are logically equivalent if each entails the other. Simple entailment can be directly stated in terms of graph homomorphism as noticed in [115].
Ter Horst also proved the NP-completeness of simple entailment by reduction from the clique problem [146]. If the RDF graph H lacks of blank nodes, this issue is in PTIME, because one should only check that each triple in H is in G too.
Following [115], blank nodes are seen as simply indicating a thing's existence, without handling an IRI to identify any name of a resource. We need to define a version of the simple interpretation mapping that includes the set of blank nodes as part of its domain.
When two RDF graphs share a blank node, their meaning is not fully captured by treating them in isolation. RDF graphs can be viewed as conjunctions of simple atomic sentences in First-Order Logic (FOL) [49], where blank nodes are free variables which are existential.
Further interpretations are depending on which IRIs are known as datatypes. Hayes and Patel-Schneider [115] propose the use of a parameter D (the set of known datatypes) on simple interpretations. The next considered interpretation is D-interpretation. 1. If rdf:langString ∈ D, then for every language-tagged string E with lexical form s l and language tag t l , L I (E) = s l , t l , where t l is t l transformed to lower case, 2. For every other IRI d ∈ D, I(d) is the datatype identified by d, and for every literal "s l "ˆˆd, L I ("s l "ˆˆd) = L2V (I(d))(s l ), where L2V is a function from datatypes to their lexical-to-value mapping.
Note that in the RDF 1.0 specification, datatype D-entailment was described as an RDFS-entailment semantic extension. In RDF 1.1 it is defined as a simple direct extension. Moreover, in RDF 1.1 datatype entailment formally refers to a set of recognized datatypes IRIs. RDF 1.0 used the concept of a datatype map: in the new semantic description, this is the mapping from recognized IRIs to the datatypes they identify.
A graph is satisfiable recognizing D (or simply D-satisfiable) if it has the true in some D-interpretation, and G entails recognizing D (or D-entails) H when every D-interpretation satisfies G, will satisfy satisfies H as well.
In [146] ter Horst proposes the D * semantics, which is a weaker variant of RDFS 1.0 D semantics. This semantics generalizes the RDFS 1.0 semantics [83] to add reasoning with datatypes.
RDF interpretation imposes additional semantic conditions on part of the (infinite) set of IRIs with the namespace prefix rdf: and xsd:string datatype. In RDF there are three key terms: • rdf:Property (P) -the class of RDF properties, • rdf:type (a) -the subject is an instance of a class, • rdf:langString (ls) -the class of language-tagged string literal values.  This RDF vocabulary is defined by the RDF Semantics [115] in terms of the RDF model theory. A selection of the inference rules are presented in Table 2. As earlier G RDF entails H recognizing D when RDF interpretation recognizing D that satisfies G, will satisfy H as well. When D is {xsd:string, rdf:langString} then G RDF entails H.
RDF Schema [22] extends RDF to additional vocabulary with more complex semantics. RDFS semantics introduces a class. It is a resource which constitutes a set of things that all have class as a value of their rdf:type predicate (so-called property). They are outlined to be things of type rdfs:Class. We introduce C I which is the set of all classes in an interpretation. Additionally, the semantic conditions are stated in terms of a function CEXT I : C I → 2 R I . In RDFS there are ten key terms: • rdfs:Class (C) -the class of classes, • rdfs:Literal (Lit) -the class of literal values, • rdfs:Resource (Res) -the class resource, everything, • rdfs:Datatype (Dt) -the class of RDF datatypes, • rdfs:subPropertyOf (spo) -the property that allows for stating that all things related by a given property x are also necessarily related by another property y, • rdfs:subClassOf (sco) -the property that allows for stating that the extension of one class X is necessarily contained within the extension of another class Y , • rdfs:domain (dom) -the property that allows for stating that the subject of a relation with a given property x is a member of a given class X, • rdfs:range (rng) -the property that allows for stating that the object of a relation with a given property x is a member of a given class X, • rdfs:ContainerMembershipProperty (Cmp) -the class of container membership properties, rdf: i, • rdfs:member (m) -a member of the subject resource. 8. x, y ∈ EXT I (IN T I (dom)) ∧ u, v ∈ EXT I (x) ⇒ u ∈ CEXT I (y), 15. This RDFS vocabulary is defined by the RDF Semantics [115] in terms of the RDF model theory. Selections of the RDFS inference rules are presented in Table 3.
As earlier G RDFS entails H recognizing D when every RDFS interpretation recognizing D which satisfies G, will satisfy H as well.
Example 4.2. The example presents that the top graph entails the bottom graph. When we blank the node <#js> from the top, the bottom graph still has a node that represents a person and preserves semantics. Similarly, when we delete the node "John Smith" from the top graph, the bottom graph still preserves the graph's semantics.

RDF Data Integration
The role of RDF as an integration system for data from different sources is one of the crucial motivations for research efforts. It is important to provide integration methods to bridge the gap between the RDF and other environments. Following [100], the definition of data integration systems is provided below.
Definition 5.1 (Data integration system). A data integration system is a tuple G, S, M where G is the global schema, S is the source schema and M is the mapping between G and S, constituted by a set of assertions.
In this section the approaches of mapping of relational databases to RDF are discussed (Subsection 5.1) as well as approaches of mapping XML to RDF (Subsection 5.2). However, there are also more general approaches [1,41,42,56,117,122,145].

Bringing Relational Databases into the RDF
This subsection contains an overview and comparison of the approaches for mapping from a relational database into RDF. Table 5 presents the key approaches from related work. It presents the features of the below-mentioned proposals, namely: mapping representation (SQL, RDF, XML, etc.), schema representation (RDF, OWL [89] and F-Logic [93]) and level of automation.
At the beginning, we focus on solutions [8,128,158] based on SQL as mapping representation. Triplify [8] is based on the mapping of HTTP requests onto database queries expressed in SQL queries, which are used to match subsets of the store contents and map them to classes and properties. It converts the resulting relations into RDF triples and subsequently publishes it in various RDF syntaxes. That proposal includes an approach for publishing update logs in RDF which contain all RDF resources in order to enable incremental crawling of data sources. An additional advantage is that it can be easily integrated and deployed with the numerous, widely installed Web applications. The next approach is StdTrip [128], which proposes a structure-based framework using existing ontology alignment software. The approach finds ontology mappings between simple vocabulary that is generated from a database. The results of the ontology alignment algorithm are presented as suggestions to the user, who matches the most appropriate ontology mapping. RDOTE [158] also uses SQL for the specification of the data subset. In that proposal, the suitable SQL query is stored in a file. That approach transforms data residing in the database into RDF graph dump using classes and properties.
The next group of approaches [18,119] uses D2RQ as mapping representation. D2RQ [18] supports both automatic and manual operation modes. In the first mode, RDFS vocabulary is created, in accordance with the reverse engineering methodologies, for the translation of foreign keys to properties. In the second mode, the contents of the database are exported to an RDF in accordance with mappings stored in RDF. It allows RDF applications to treat non-RDF stores as a virtual RDF graph. It allows RDF applications to treat non-RDF stores as virtual RDF graphs. One of the disadvantages of D2RQ is a read-only RDF view of the database. Another D2RQ-based proposal is AuReLi [119], which uses several string similarity measures associating attribute names to existing vocabulary entities in order to complete the automation of the transformation of databases. It also tries to link database values with ontology individuals. RDF Views [58] has similar functionality to D2RQ. It supports both automatic and manual operation modes. That solution uses the table as RDFS class and column as a predicate and takes into account cases such as whether a column is part of the unique key. The data is represented as virtual RDF graphs without the physical formation of RDF datasets. .
Approaches Mapping Schema Automatic Represent.
Represent. [5,6] n/a RDFS, OWL, F-Logic [8] SQL RDFS [34,35] Constraint rules RDFS, OWL [18] D2RQ RDFS [46] RDF, Rel.OWL RDFS, OWL [24] n/a RDFS, OWL [90] FOL, Horn RDFS, OWL [25] SQL RDFS, OWL [101] Logic rules RDFS, OWL [27] n/a RDFS [28] XML RDFS, OWL [58] SPARQL RDFS [72] R 2 O RDFS, OWL [84] RDF RDFS [112] RDF, Rel.OWL RDFS, OWL [119] D2RQ RDFS, OWL [127] XPath (XSLT) RDFS, OWL [128] SQL RDFS, OWL [136] n/a RDFS, OWL [143] n/a RDFS, F-Logic [150] FOL RDFS, OWL [158] SQL RDFS, OWL [160] RDF/XML RDFS, OWL [132] RDF (Direct) RDFS [17,102] XQuery n/a possible manual level Another group of proposals [84,132] use RDF. The first approach is OntoAccess [84], which is a vocabulary-based write access to data. That paper consists of the relational database to RDF mapping language called R3M, which consists of an RDF format and algorithms for translating queries to SQL. The next proposal is SquirrelRDF [132], which extracts data from a number of databases and integrates that data into a business process. That proposal supports RDF views and allows for the execution of queries against it. In that group we can also distinguish solutions [46,112] based on Relational.OWL [52]. ROSEX [46] uses Relational.OWL to represent the relational schema of a database as an OWL ontology. The created database schema annotation and documentation is mapped to a domain-specific vocabulary, which is achieved automatically by reverse-engineering the schema. An additional advantage is that it supports automatic query translation. DataMaster [112] also uses Relational.OWL for the importing of schema structure and data from relational databases.
R2RML [48] is the best-known language based of RDF for expressing mappings from relational databases to RDF datasets because it is a W3C recommendation and has a lot of implementations [59,120,133]. R2RML is a language for specifying mappings from relational to RDF data. A mapping takes as input a logical table (logicalTable predicate), i.e. a database table, an SQL query, or a database view. In the next step, a logical table is mapped to triples map that is a set of triples. Triples map has two main parts. The first part is a subject map (subjectMap predicate) that generates the subject of all RDF triples that will be generated from a logical table row. The second part is a predicate-object map (predicateObjectMap predicate) specifies the target property and the generation of the object via objectMap.
Yet another type of approaches are [28,160]. DartGrid [160] describes a database integration architecture. It uses a visual mapping system to align the relational database to existing vocabulary. Correspondences between components of the models which are defined via a graphical user interface are created and stored in an RDF/XML syntax (see Subsection 6.1). The next proposal is MASTRO [28], which is a framework that enables the definition of mappings between a relational database and vocabulary.
Another group of proposals [90,150] use First-Order Logic (FOL) [49]. Tirmizi et al. [150] present formal rules in FOL to transform column to predicate and table to class. The authors present the system which is complete with respect to a space of the possible primary key and foreign key mixtures. The next proposal is MARSON [90], which uses mappings based on virtual documents (called vectors of coefficients). It removes incorrect mappings via validating mapping consistency. Moreover, a special type of semantic mappings (called contextual mappings) is presented in the paper.
Some approaches [34,72,101,102,127,150] adopt other mapping representations. RDBToOnto [34,35] is a tool that simplifies the implementation of methods for vocabulary acquisition from databases. It provides a visual interface for manual modification and adjustment of the learning parameters. RDBToOnto links the data analysis with heuristic rules and generates an ontology. DB2OWL [72] creates local vocabulary from a relational database that is aligned to reference vocabulary. The vocab- Table 6: XML mapping. This table presents the features of transformation approaches, namely: existing vocabulary (yes -, no -), schema representation and level of automation (automatic -, semiautomatic -and manual -) Approaches Existing Schema Autovocabulary Representation matic [3] RDFS, DAML+OIL [10] RDFS, OWL [11] RDFS, OWL [13] n/a [14] RDFS, OWL [16,17] n/a [19] RDFS, OWL [40] RDFS, OWL [43] RDFS, OWL [55] RDFS, OWL [56] RDFS, OWL [57] n/a [60] n/a [62] RDFS, OWL [71] RDFS, OWL [73] RDFS, OWL [94] RDFS [95] RDFS [99] RDFS, OWL [113] RDFS, OWL [124] RDFS, OWL [126] RDFS, OWL [135] RDFS, OWL [142] RDFS, OWL [148,149] RDFS, OWL [154] n/a [161] RDFS [162] RDFS, OWL ulary generated in that proposal reflects the database semantics. The mappings are stored in an R 2 O [9] document. The next proposal is SOAM [101], which uses the column to predicate and table to class approach. It creates an initial schema, which is refined by referring to a dictionary. Constraints are mapped to constraints in the vocabulary schema. That approach tries to establish the quality of the constructed vocabulary. Another proposal is [127]. It is a domain semantics-driven mapping generation approach. Mapping is created on the XSLT [92] and XPath [137]. Yet another approach is XSPARQL [102], which can be used both in relational databases and XML (see Subsection 5.2). There are also several approaches [5,24,27,136,143] that do not have a defined mapping representation. Astrova [5,6] discusses correlation among key, data in key attributes between two relations and non-key attributes. In these papers, the quality of transformation is considered. Buccella et al. [24], Shen et al. [136] and Stojanovic et al. [143] examine heuristic rules. Byrne [27] proposes a domain-specific approach for the generic design of cultural heritage data and discusses the options for including published heritage thesauri.

Bringing XML into the RDF
This subsection overviews and compares the approaches for mapping from XML into RDF. Table 6, presents key approaches from related work. It presents features of the below-mentioned proposals, namely: existing vocabulary, schema representation (RDF, OWL and DAML+OIL [89]) and level of automation.
At the beginning, we focus on solutions [43,55,73,124,126] that use existing vocabulary and/or ontology. This means that the XML data is transformed according to the mapped vocabularies. Cruz et al. [43] propose basic mapping rules to specify the transformation rules on properties, which are defined in the XML Schema. Deursen et al. [55] propose the method for the transformation of XML data into RDF instances in an ontology-dependent way. X2OWL [73] is a tool that builds an OWL ontology from an XML data source and a set of mapping bridges. That proposal is based on an XML Schema that can be modeled using different styles to create the vocabulary structure. The next proposal is WEESA [124], which is an approach for Web engineering techniques and developing semantically tagged applications. Another tool is JXML2OWL [126]. It supports the transformation from syntactic data sources in XML format to a common shared global model defined by vocabulary.
Another group of proposals [3,11,14,62,71,94,95,99,113,149,161,162] do not support mappings between XML Schemas and existing vocabularies. Amann et al. [3] discuss a data integration system, where XML is mapped into vocabulary that supports roles and inheritance. That tool focuses on offering the appropriate high-level primitives and mechanisms for representing the semantics of XML data. Janus [11] is a framework that focuses on an advanced logical representation of XML Schema components and a set of patterns that enable the transformation from XML Schema into the vocabulary. It supports a set of patterns that enable the translation from XML Schema into ontology. SPARQL2XQuery [14] is a framework that transforms SPARQL [80] query into a XQuery [125] using mapping from vocabulary to XML Schema. It allows query XML databases. Ferdinand et al. [62] propose two independent mappings: from XML to RDF graphs and XML Schema to OWL. That proposal allows items in XML documents to be mapped to different items in OWL. Garcia et al. [71] present a domain-specific approach that maps the MPEG-7 standard to RDF. Klein [94] proposes a procedure to transform the XML tree using the RDF primitives by annotating the XML by RDFS. This procedure can multiply the availability of semantically annotated RDF data. SWIM [95] is an integration middleware for mediating high-level queries to XML sources using RDFS. Lehti et al. [99] show how the ontologies can be used for mapping data sources to a global schema. In this work, the authors introduce how the inference rules can be used to check the consistency of such mappings. In this paper, a query language based on XQuery is presented. O'Connor et al. [113] propose an OWL-based language that can transform XML documents to arbitrary ontologies. It extends it with Manchester syntax for XPath to support references to XML fragments. Another framework is DTD2OWL [149], which changes XML into vocabularies. It also allows transforming specific XML instances into OWL individuals. Xiao et al. [161] propose the mappings between XML schemas and local RDFS vocabularies and those between local vocabulary and the global RDFS vocabulary. The authors discuss the problem of query containment and present a query rewriting algorithm for RDQL [51] and XQuery. Yahi et al. [162] propose an approach which covers schema level and data level. In this proposal XML Schema documents are generated for XML documents with no schema using the trang tool.
Solutions [13,19,57] using XSLT are separate from the ones mentioned above. Berrueta et al. [13] discuss an XSLT+SPARQL framework, which allows to perform SPARQL queries from XSLT. It is a collection of functions for XSLT, which allows to transform the XML results format. Yet another tool is XML2OWL [19], which uses XSLT for mapping from XML to ontology. Droop et al. [57] propose another XSLT solution, which allows embedding XPath into SPARQL. Shapkin et al. [135] propose a transformation language, which is not strictly based on XLST but inspired by it. That proposal focuses on matching of types of RDF resources.
Another subgroup of approaches [10,17,40,60,142,154] supports mutual transformation. XS-PARQL [16,17] is a query language based on SPARQL and XQuery for transformations from RDF into XML and back. It is built on top of XQuery in a syntactic and semantic view. Gloze [10] is another tool for bidirectional mapping between XML and RDF. It uses information available in the XML schema for describing how XML is mapped into RDF and back again. GRDDL [40] is a markup language that obtains RDF data from XML documents. It is represented in XSLT. SAWSDL [60] is also a markup language but it proposes a collection of new attributes for the WSDL [39] and XML Schema. Yet other tools are SPARQL2XQuery [142] and XS2OWL [154].

RDF Serializations
Several of RDF syntax formats exist for writing down graphs. RDF 1.1 introduces a number of serialization formats, such as: Turtle, N-Triples, TriG, N-Quads, JSON-LD, RDFa, and RDF/XML. Note that in RDF 1.1 RDF/XML is no longer the only recommended serialization format.
In Subsection 6.1, we present serializations that support single graphs. In Subsection 6.2, we briefly introduce serialization which supports multiple graphs. Moreover, in this section we show the abovementioned formats in examples.

Single Graph Support
RDFa [106] (denoted rdfa in Table 7) is an RDF syntax which embeds RDF triples in HTML and XML documents. The RDF data is mixed within Document Object Model. This implies that document content can be marked up with RDFa. It adds a set of attribute-level extensions to HTML and various XML-based document types for embedding rich metadata within Web documents. What is more, RDFa allows for free intermixing of terms from multiple vocabularies. It is also designed in such a way that the format can be processed without information of the specific vocabulary being used. It is common in contexts where data publishers are able to change Web templates but have little additional control over the publishing infrastructure.
Following [106], we provide the most important attributes that can be used in RDFa, such as: • about -an attribute that is an IRI or CURIE [15] specifying the resource the metadata is about (an RDF subject), • rel and rev -attributes that expresses (reverse) relationships between two resources (an RDF predicate), • property -an attribute that expresses relationships between a subject and some literal value (an RDF predicate), • resource -an attribute for expressing a relationship's partner resource that is not intended to be navigable (an RDF object), • href -an attribute that expresses the partner resource of a relationship (an RDF resource object), • src -an attribute that expresses a relationship's partner resource when the resource is embedded (an RDF object that is a resource), • content -an attribute that overrides the content of the element when using the property (an RDF object that is a literal), • datatype -an attribute that specifies the datatype of a literal, • typeof -an attribute that specifies the RDF types of the subject or the partner resource, • inlist -an attribute that specifies that the object associated with property or rel attributes on the same element is to be pushed onto the list for that predicate, • vocab -an attribute that specifies the mapping to be used when an RDF term is assigned in a value of attribute.
Example 6.1. The example presents an RDFa 1.1 serialization that represents RDF triples of Example 2.2.
Following [70], we provide the most important elements and attributes that can be used in RDF/XML, such as: • rdf:RDF -a root element of RDF/XML documents, • rdf:Description -an element that contains elements that describe the resource, it contains the description of the resource identified by the rdf:about attribute, • rdf:Alt, rdf:Bag and rdf:Seq -elements that are containers used to describe a group of things (see Section 2), • rdf:parseType="Collection" -an attribute that describe groups that can only contain the specified members, • rdf:parseType="Resource" -an attribute that is used to omit blank nodes, • xml:lang -an attribute that is used to allow content language identification, • rdf:datatype -an attribute that is used to define a typed literal, • rdf:nodeID -an attribute that identify a blank node, • rdf:ID and xml:base -attributes that abbreviate IRIs.  [121] (denoted ttl in Table 7). That solution offers textual syntax that enables recording RDF graphs in a compact form, including abbreviations that use data patterns and datatypes. Following [121], we provide the most important rules for constructing the Turtle document: • The simplest triple statement consists of a sequence of subject, predicate, and object, separated by space, tabulation or other whitespace and terminated by a dot after each triple.
• Often, the same subject will be referenced by several predicates. In this situation, a series of predicates and objects are separated by a semicolon.
• As with predicates, objects are often repeated with the same subject and predicate. In this case, a comma should be used as a separator.
• IRIs may be written as relative or absolute IRIs or prefixed names. Both absolute and relative IRIs are enclosed in less-than sign and greater-than sign.
• Quoted Literals have a lexical form followed by a datatype IRI, a language tag or neither. Literals should be delimited by apostrophe or double quotes.
• Blank nodes are expressed as underscore, colon and a blank node label that is a series of name characters. Blank nodes can be nested, abbreviated, and delimited by square brackets.
• Collections are enclosed by parentheses. N-Triples [30] (denoted nt in Table 7) is a line-based, plain text serialization format and a subset of the Turtle format minus features such as shorthands. It means that there is a lot of redundancy, and its files can be larger than Turtle and RDF/XML. N-Triples was designed to be a simpler format than Turtle, and therefore easier for software to parse and generate. Following [30], we provide the most important rules for constructing the N-Triples document: • The triple statement consists of a sequence of subject, predicate, and object, divided by whitespace, and terminated by a dot after each triple.
• IRIs should be represented as absolute IRIs and they are enclosed in less-than sign and greaterthan sign.
• The representation of the lexical form is a sequence of a double quote (an initial delimiter), a list of characters or escape sequence, and a double quote (a final delimiter).
• Blank nodes are expressed as underscore, colon and a blank node label that is a series of name characters.
There are some changes in RDF 1.1 N-Triples, e.g. encoding is UTF-8 rather than US-ASCII and blank node labels may begin with a digit. Syntactically, N-Triples is a subset of Turtle.
Example 6.4. The example presents a N-Triples serialization that represents RDF triples of Example 2.2. Note that we change some RDF term because of legibility.

Multiple Graphs Support
JSON-LD [141] (denoted jld in Table 7) is a JSON-based format to serialize structured data such as RDF. The syntax is designed to easily integrate into deployed systems that use JSON and provides a smooth upgrade path from JSON to JSON-LD. The use of RDF in JSON makes RDF data accessible to Web developers without the obligation to install additional parsers, software libraries or other programs for changing RDF data. Like JSON, JSON-LD uses human-readable text to transmit data objects consisting of key-value pairs. Keywords in JSON-LD are starting with at sign. Following [141], we provide the most important keywords that can be used in JSON-LD, such as: • @context -set the short-hand names that are used throughout a document, • @id -uniquely identify things that are being described in the document with blank nodes or IRIs, • @value -specify the data that is associated with a particular property, • @language -define the language for a particular string value or the default language of a document, • @type -set the data type of an IRI, a blank node, a JSON-LD value or a list, • @container -set the default container type for a short-hand string that expands to an IRI or a blank node identifier, • @list -define an ordered set of data, • @set -define an unordered set of data (values are represented as arrays), • @reverse -used for reverse relationship expression between two resources, • @index -specify that a container is used to index information, • @base -define the base IRI against which relative IRIs are resolved, • @vocab -expand properties and values in @type with a common prefix IRI, • @graph -express a graph.
Example 6.5. The example presents a JSON-LD serialization that represents RDF triples of Example 2.3.
Example 6.6. The example presents a JSON-LD serialization that represents an RDF triple: <#a> as a subject, blank node as a predicate and "Alice" as an object.
Graph statements are a pair of blank node label or an IRI with a group of RDF triples surrounded by curly brackets. The blank node label or IRI of the graph statement may be used in another graph statement which implies taking the union of the triples generated by each graph statement. A blank node label or IRI used as a graph label may also reoccur as part of any RDF triples.
Example 6.8. The example presents a N-Quads serialization that represents RDF triples of Example 2.3. Note that we change some RDF term because of legibility. In Table 7 we present features of the above-mentioned standardized serializations, namely: having W3C Recommendation, human-friendly syntax (partial support means that some fragments may be difficult to read), easy to process, compact form (partial support means that there are several different forms and not all are normalized), similarity to Turtle syntax and XML-based syntax and multigraph support. Furthermore, there are a few RDF serializations that are not standardized, such as TriX [33], RDF/JSON [151,152].  -, no -), human-friendly syntax (yes -, partial -, no -), easy to process (yes -, partial -, no -), compact form (yes -, partial -, no -), similarity to Turtle syntax and XML-based syntax (yes -, no -) and multigraph support (yes -, no - A recent work [64] points out that RDF datasets are highly compressible because of the RDF graph structure and RDF syntax verbosity. In that paper, different compression approaches are analyzed, including: 1. direct compression, 2. adjacency list compression, 3. an RDF split into the element dictionaries and the statements.
The conclusions of that paper suggest that RDF is highly compressible.
Definition 7.1 (RDF compression processor). An RDF compression processor is used by application programs for encoding their RDF data into compressed RDF data and/or to decode compressed RDF data to make the data accessible.
Header-Dictionary-Triples [66] (denoted hdt in Table 8), a binary format that is based on three main parts: 1. a header, which includes metadata describing the RDF dataset, 2. a dictionary, which organizes all the identifiers in the graph (it provides a list of the RDF terms such as literals, IRIs and blank nodes), 3. a triples component, which consists of the underlying RDF graph pure structure.
HDT achieves high levels of compression and provides retrieving features to the compressed data. That approach works on the complete dataset, with non-negligible processing time. That idea is extended in [63]. In that thesis, the author proposes techniques to compress rich-functional RDF dictionaries and triple indexing. That thesis shows the use of a succinct data configuration to browse HDT-encoded datasets. HDT can be used as a backend of Triple Pattern Fragments [159] is natively support fast triple-pattern extraction. In [2,65] (denoted eri in Table 8), authors exploit a feature of RDF data streams, which is the regularity of their data values and structure. They propose a compressed Efficient RDF Interchange format, which can reduce the amount of data transmitted when processing RDF streams. ERI considers an RDF stream as a continuous flow of blocks of triples. A standard compressor can be used in each channel and impact on its data regularities to produce better compression results.
Another approach is RDF Differential Stream compressor based on Zlib [67] (denoted rdsz in Table 8), which is a proposal for RDF streaming compression. It applies the general-purpose stream compressor Zlib to RDF streams. It uses differential encoding to obtain structural similarities. The results of this process are compressed with Zlib to exploit additional redundancies. Furthermore, that approach achieves gains in compression at the cost of increasing the processing time.
The interest on RDF compression over streaming data has been indirectly covered by RDF stream processing systems such as Continuous Query Evaluation over Linked Streams Cloud [98] (denoted cqels in Table 8) and Ztreamy [68] (denoted ztr in Table 8). These papers emphasize the importance of compression for scalable transmission of RDF streams over the network. In [98], authors suggest an approach to deal with this issue by dictionary encoding. In [68], authors discuss a scalable middleware for stream publishing.
Several research areas have emerged around MapReduce and RDF compression, e.g. scalable compression of large RDF datasets [157] and large RDF data compression and decompression efficiency [156] (denoted mr in Table 8). The first paper presents an approach based on providing another dictionary-based compression on top of MapReduce [53]. In the second one, authors expand [156] and achieve linear scalability concerning the number of nodes and input size. Another proposal is presented in [74], where HDT-MR is introduced. HDT-MR uses MapReduce technique to process huge RDF and build the HDT serialization.
It is worth noting that EXI (Efficient XML Interchange) [91] can be used to RDF compression but it can only serialize XML [129] i.e. RDF/XML or TriX or JSON [116] i.e. JSON-LD. In Table 8 we present features of the above-mentioned proposals, namely: having W3C Recommendation, binary syntax, ability to stream, ability to scale and support of a software library used for data compression (Zlib). Recommendation (yes -, no -), binary syntax (yes -, no -), ability to stream (yes -, no -), ability to scale (yes -, no -), and support of a software library used for data compression (yes -, no -) .

Feature
hdt eri rdsz cqels ztr mr Standard Binary format Streamable Scalable Zlib W3C Member Submission

Conclusions
Standards are instrumental in achieving a significant level of interoperability. W3C recommendations provide people and institutions a basis for mutual understanding. The recommendations that define RDF are used as tools to facilitate various providers to interact with one another. Despite the achievements of current RDF recommendations, they are not sufficient for achieving full end-to-end interoperability. The standards leave several areas vulnerable to variations in interpretation. In this article, we outlined various RDF recommendations, scientific papers that extend and clarify them, and presented a summarised formal description that we hope will clarify some of interpretative differences. We specifically provided insights on the interpretation of the handling of blank nodes and reification. We presented several interpretative differences, each corresponding to an entailment regime in a standard way. We surveyed various RDF serializations, RDF compression proposals, and RDF mapping approaches to highlight their differences. Finally, we presented a summarized formal definition of RDF 1.1 and emphasized changes between RDF versions 1.0 and 1.1.
We argue that knowledge representation and data integration on the Web faces some of the same challenges we were facing ten years ago, in spite of the significant work being accomplished by both researchers and implementers. We hope that this review contributes to a better understanding of RDF 1.1, and provides the basis for a discussion of interpretative differences. We hope some of these gaps may be able to be fixed in a future version of RDF, such as selection of concise reification, and a formal description of the data model that addresses practical experiences with reification, and blank nodes.