Acquiring Ontology Axioms through Mappings to Data Sources †

: Although current languages used in ontology-based data access (OBDA) systems allow for mapping source data to instances of concepts and relations in the ontology, several application domains need more ﬂexible tools for inferring knowledge from data, which are able to dynamically acquire axioms about new concepts and relations directly from the data. In this paper we introduce the notion of mapping-based knowledge base (MKB) to formalize the situation where both the extensional and the intensional level of the ontology are determined by suitable mappings to a set of data sources. This allows for making the intensional level of the ontology as dynamic as the extensional level traditionally is. To do so, we resort to the meta-modeling capabilities of higher-order description logics, in particular the description logic Hi ( DL-Lite R ) , which allows seeing concepts and relations as individuals, and vice versa. The challenge in this setting is to design efﬁcient algorithms for answering queries posed to MKBs. Besides the deﬁnition of MKBs, our main contribution is to prove that answering instance queries posed to MKBs expressed in Hi ( DL-Lite R ) can be done efﬁciently.


Introduction
Ontology-based data access (OBDA) [1,2] is a paradigm for accessing data using a conceptual representation of the domain of interest expressed as an ontology.An OBDA system relies on a three-level architecture, consisting of the data layer, the ontology, and the mapping between the two.More in detail,

•
the data layer is constituted by the existing data sources that are relevant for the organization, • the ontology is a declarative and explicit representation of the domain of interest for the organization, formulated in a description logic (DL) [3] so as to take advantage of various reasoning capabilities in accessing data, • the mapping is a set of declarative assertions specifying how the sources in the data layer relate to the ontology.
Several OBDA projects have been carried out in recent years [4][5][6][7], and various OBDA management systems have been designed to support OBDA applications, e.g., [8][9][10].In current OBDA systems the ontology is expressed as a DL TBox, i.e., a set of assertions on the relevant concepts and roles (i.e., binary relationships between concepts) of the domain of interest, constituting the intensional level of the representation, and the mapping assertions are used to specify how the data at the sources correspond to the instances of the concepts and the relations, which form the extensional level of the representation.Thus, the mapping assertions, together with the source data, determine the so-called virtual ABox, in the sense that the instance assertions are not explicitly given but are specified through the relationships between the data and the elements of the TBox.
From the above observations it should be clear that current works on OBDA share the idea, originally stemmed in data integration and in data exchange [11][12][13][14], that mappings are used to (virtually) retrieve extensional information of the ontology from the sources, as shown in Figure 1, while the intensional level information, represented by the TBox, remains fixed a priori, once and for all.In this paper, we challenge this preconception and propose to adopt a virtual approach to the specification of the TBox, thus virtualizing both the extensional and the intensional level of the ontology.In other words, we propose a setting where not only the ABox but also the TBox is specified through mappings linking the data to the ontology, as illustrated in Figure 2. Our approach is based on addressing two issues.The first issue is related to looking at the content of the data sources to identify concepts and roles that are relevant to the domain of interest but that have not been modeled in the TBox, because they were not known at design time.As an example, the database D in Figure 3 stores data about different models and different types of cars manufactured by motor companies (table T-CarTypes), as well as various cars of such types (table T-Cars).If we look carefully at the semantics of the data in D, we realize that such database not only stores information about the instances of the concepts in the domain of interest (e.g., the first row of table T-Cars collects data about an instance of the concept Car) but contains also pieces of data denoting new concepts of the domain.In particular, table T-CarTypes contains data denoting concepts such as 1973 FALCON XB GT, 1967 MUSTANG SHELBY, 1973 MUSTANG MACH 1, and so on.Considering the context where these concepts appear, it is not difficult to conclude that they are all mutually disjoint subconcepts of Car.Table T-Cars, on the other hand, provides information about the instances of the various concepts, as well as other properties about them (i.e., Color, ProdCountry).We observe that, in order to acquire the knowledge about the concepts mentioned in the data sources, we need a flexible mechanism that is able to map data at the sources to concepts in the ontology.Without such flexibility, the designer would be forced to manually inspect the data sources and enrich the ontology off-line.The second issue is related to the need of meta-modeling constructs in the language used to specify the ontology [16][17][18][19].Meta-modeling allows concepts and relations to be conceived as first-order citizens and to see them as individuals that are instances of other concepts, called meta-concepts.By exploiting meta-modeling we might introduce in the ontology the meta-concept Car-Type, with Coupe, Sedan, etc. as its specializations.Note that the instances of such specialized concepts include the subconcepts of cars listed in the rows stored in table T-CarTypes.With this mechanism, the designer is allowed to specify the means for dynamically acquiring TBox axioms, through simple queries asking for the instances of the meta-concept Car-Type.Indeed, without the possibility of using meta-concepts in the ontology, it would be impossible to deal with the first issue mentioned above, that is exactly based on the idea of using mappings in order to transfer the knowledge fragments residing in the data sources to the TBox of the ontology.Note that for the technical development related to meta-modeling, we base our work on the approach and the results reported in [18].
In this paper, we deal with both the issue of acquiring TBox axioms from data sources and the issue of meta-modeling.In particular, we focus on designing tractable algorithms for query answering in such a setting.Indeed, looking for query answering algorithms that are tractable in data complexity is a distinguishing characteristic of OBDA systems.We follow [20], and we work with the DL-Lite family of DLs, which enjoys the first-order logic (FOL) rewritability properties.In a DL, enjoying such property answering (unions) of conjunctive queries can be done in two steps.The first one, called rewriting, uses the TBox axioms in order to transform the query q into a new FOL query q .The second step evaluates q over the ABox seen as a database.
The challenge we face in this paper is to design tractable query answering algorithms even in cases where the mapping assertions map the data at the sources to both the extensional and the intensional level of the ontology and both meta-concepts and meta-relations are used in the queries.In particular, we present the following contributions.

•
We formalize the notion of mapping-based knowledge base (MKB), that captures the idea of acquiring the axioms of both the extensional and the intensional level of the ontology through mapping assertions linking the source data to the ontology.This mechanism allows the designer to achieve a level of flexibility that is not possible in current OBDA systems.Our formalization relies on the notion of higher-order DL, as introduced in [18].Indeed, De Giacomo et al. [18] describe a methodology that, starting from a traditional DL L, allows one to define its higher-order version Hi(L).Here, we apply this idea and make use of the higher-order DL Hi(DL-Lite R ).

•
We propose to query mapping-based knowledge bases by means of an extension of unions of conjunctive queries (UCQs), taking advantage of the higher-order features of Hi(DL-Lite R ).
In particular we define a suitable class of such queries, the so-called instance higher-order UCQs (IHUCQs), enjoying nice computational properties.The basic characteristic of IHUCQs is to allow higher-order features (i.e., meta-concepts and meta-properties) in the query expression but to disregard subclass and subproperty assertions in the body of the query.

•
We study the problem of answering IHUCQs posed to MKBs expressed in Hi(DL-Lite R ).We show that this problem is efficiently solvable by exhibiting an algorithm based on FOL rewriting.The algorithm works in AC 0 with respect to the extensional level of the data sources, i.e., the portion of the data sources that is not involved in the intensional level of the ontology.More precisely, our algorithm, given an IHUCQ q over an MKB, reformulates q into a FOL query that is evaluated taking into account only the portion of the MKB involving the extensional level of the ontology.As a consequence, query answering can be delegated to a database management system, exactly as in the traditional OBDA approach.
To sum up, the main achievement of this paper is to prove that we can extend the mapping language used in current OBDA systems so as to talk about concepts and roles as objects (requiring higher-order features), thus realizing the idea of virtualizing both the extensional and the intensional information in a way that is natural and even very effective.Indeed, our AC 0 result can be interpreted as follows: the complexity of answering higher-order queries posed on knowledge bases built using such extended mappings remains the same as in traditional OBDA settings, when measured only with respect to the extensional level of the data sources.
We observe that initial ideas on generating the intensional level of a representation through mappings was explored in the data exchange setting.In particular, Papotti and Torlone [21] propose a setting where both data and meta-data stored in relational tables typical of DBMSs, are exchanged.In the context of RDF, the standard language R2RML (https://www.w3.org/TR/r2rml/) for mapping relational databases to RDF datasets, also allows to map patterns in the relational data to assertions (e.g., subsetting) on RDF predicates.In this sense, this is in the same spirit as our proposal, that however is tailored to ontologies and not only to RDF data.In the context of ontologies, this issue is actually new.Indeed, to the best of our knowledge, the only reference dealing with generating TBoxes through mappings is [15], of which this paper is an extended version.More specifically, with respect to [15], the reader can find in the present paper additional examples and discussions, a detailed description of our technique for query answering, and complete proofs for all our results.Furthermore, we revised the analysis of combined complexity of query answering over Hi(DL-Lite R ) MKBs (Theorem 3).
The rest of this paper is structured as follows.In Section 2 we recall the definition of Hi(DL-Lite R ), the DL adopted in our work.In Section 3 we introduce and discuss the notion of mapping-based knowledge base.In Section 4 we illustrate the kinds of queries that we consider in this paper, and in Section 5 we present our algorithm for query answering.Section 6 concludes the paper.

Higher-Order DL-Lite R
We start by recalling some notions on DLs in general and DL-Lite R in particular.Then, we provide the syntax and the semantics of Hi(DL-Lite R ), i.e., the higher-order version of DL-Lite R .
DLs [3] are logics that represent the domain of interest in terms of concepts, denoting sets of objects, and roles, denoting binary relations between (instances of) concepts.Complex concept and role expressions are constructed starting from a set of atomic concepts and roles and by recursively applying suitable constructs.
A DL knowledge base (KB) K is constituted by two components, T and A, i.e., K = T ∪ A, where: • T , called TBox, is the terminological component of K, which contains statements representing intensional knowledge, and • A, called ABox, is the assertional component of K, which contains assertions representing extensional knowledge.
DL-Lite R is a member of the DL-Lite family of tractable DLs [22,23] and is the logical basis of OWL 2 QL, one of the profiles of OWL, the W3C standard for representing ontologies [24].
DL-Lite R expressions are given by the following syntax: • Concept expressions: • Role expressions: where A is an atomic concept (i.e., a unary predicate from the signature of the knowledge base), B is a basic concept (i.e, an atomic concept A or an existential restriction on a role ∃Q, which denotes individuals occurring in the first component of the role Q, which is called the domain of Q), C is a general concept (i.e., a basic concept B or its negation ¬B), P is an atomic role (i.e., a binary predicate from the signature of the knowledge base), Q is a basic role (i.e., an atomic role P or its inverse P − ), and R is a general role (i.e., a basic role Q or its negation ¬Q).The domain of P − (i.e., ∃P − ) is also called the range of P.
A TBox in DL-Lite R is a finite set of assertions in form: When C and R above assume the form ¬B and ¬Q , respectively, the above inclusions are called negative inclusions and allow us to specify disjointness between concepts or between roles, respectively.Otherwise, the above inclusions are called positive inclusions and allow us to specify is_a relations between concepts or roles, respectively.
An Abox in DL-Lite R is a finite set of membership assertions (i.e., facts) of the form: where a and b are constants, that is, names for individuals (i.e., predicates of arity 0 from the signature of the knowledge base).
We are now ready to describe the higher-order DL Hi(DL-Lite R ).We start by observing that, as discussed in [18], every traditional DL L can be characterized by a set OP(L) of operators, used to form concept and role expressions and a set of MP(L) of meta-predicates, used to form assertions.Each operator and each meta-predicate have an associated arity.Given a symbol T, we write T/n to denote that T has arity n.For DL-Lite R , we simply have: In words, the knowledge base says that Ford is a concept that specializes the concept Car, produced_in is a role typed on Car and ProdCountry, which is a concept denoting production countries.More precisely, the domain of produced_in can be instantiated only with instances of Car, as stated by the second inclusion, whereas the range of produced_in can be instantiated only with instances of ProdCountry, as stated by the third inclusion.The above knowledge base can be reformulated in the high-order style syntax for DL-Lite R as follows: We finally note that in both syntaxes given above, the first three assertions constitute the TBox of the KB, whereas the last two assertions constitute the ABox of the KB.
Let us turn our attention to the definition of the syntax of Hi(DL-Lite R ).Hereinafter we assume the existence of two disjoint, countably infinite alphabets: S, the set of predicates, also called names and V, the set of variables.Intuitively, the names in S are the symbols denoting the atomic elements of a Hi(DL-Lite R ) knowledge base.The building blocks of such a knowledge base are assertions, which in turn are based on terms and atoms.
We inductively define the set of terms, denoted by τ DL-Lite R (S, V ), over the alphabets S and V for Hi(DL-Lite R ) as follows: , and e is not of the form Inv(e ) (where e is any term), then Inv(e) ∈ τ DL-Lite R (S, V ) (Differently from [18], we avoid the construction of terms of the form Inv(Inv(e)) which, as roles, are equivalent to e.Under this assumption, we do not have safe-range issues when dealing with queries, thus, differently form [18], we consider here non-Boolean queries.); Intuitively, a term denotes either an atomic element, the inverse of an atomic element, or the projection of an atomic element on either the first or the second component.

Example 2.
If the names car, produced_in belong to the alphabet S, then the following are Hi(DL-Lite R ) terms: Car, Inv(produced_in), Exists(Inv(produced_in)), which, intuitively, denote the concept representing the set of cars, the role produced_in, and the concept representing those individuals (e.g., countries) where something is produced.
Ground terms, i.e., terms without variables, are called expressions, and the set of expressions is denoted by τ DL-Lite R (S).The terms in the example above are all expressions.
A Hi(DL-Lite R )-atom, or simply atom, over the alphabets S and V for Hi(DL-Lite R ) is a statement of the form p 1 (e 1 , e 2 ) or p 2 (e 1 , e 2 , e 3 ) where p 1 /2, p 2 /3 belong to MP(DL-Lite R ), and e 1 , e 2 , e 3 ∈ τ DL-Lite R (S, V ).Note that Hi(DL-Lite R )-atoms (and even terms) are not required to respect the proviso we have introduced before to limit the usage of predicates from OP(DL-Lite R ) and MP(DL-Lite R ) so that through them it is possible to express (only) DL-Lite R knowledge bases.For instance, a Inst C (C, D) and Isa C (C, E) are both legal Hi(DL-Lite R )-atoms, that is, differently from DL-Lite R we can instantiate a concept with another concept.If X is a subset of V, a is an atom, and all variables appearing in a belong to X, then a is called an X-atom.
Ground Hi(DL-Lite R )-atoms, i.e., Hi(DL-Lite R )-atoms without variables, are called Hi(DL-Lite R )-assertions, or simply assertions.Thus, an assertion is simply an application of a meta-predicate to a set of expressions, which intuitively means that an assertion is an axiom that predicates over a set of individuals, concepts, or roles.
A Hi(DL-Lite R ) KB over S is a finite set of Hi(DL-Lite R )-assertions over S. To agree with the usual terminology of DLs, we use the term TBox to denote a set of Isa C , Isa R , Disj C and Disj R assertions, also called intensional assertions, and the term ABox to denote a set of Inst C and Inst R assertions, also called extensional assertions.Obviously, a DL-Lite R KB is also (a special case of) an Hi(DL-Lite R ) KB.

Example 3.
In Figure 4 we provide a graphical representation of a Hi(DL-Lite R ) KB modeling (a portion of) the domain of cars that we use throughout the paper in our examples.The ontology is drawn in Graphol, a diagrammatic language that allows OWL ontologies to be graphically specified [25,26].In Graphol, as in ER diagrams, concepts and roles are denoted by rectangles and diamonds, respectively, and solid directed arrows represent inclusions.The full hexagon denotes at the same time the union of the concepts connected to it through a dashed line and the disjunction of such concepts.Finally, each octagon represents an individual, and each arrow labeled instanceOf from and element A to an element B specifies that A is an instance of B. In words, the ontology is saying that Coupe and Sedan are specific types of Cars (i.e., they are sub-concepts of CarType), and that they are disjoint.In addition, 1973_FALCON_XB_GT_COUPE is an instance of Coupe (in this respect it can also be seen as an individual) and at the same time it is a concept that specializes Ford (i.e., it is a subconcept of Ford) and has INTERCEPTOR as an instance.Finally, Ford is a subconcept of Car, and s 1 is an instance of Sedan.
Below we provide the above ontology through formulas, using the higher order notation given in this section: Notice how the second assertion exploits the meta-modeling capabilities of Hi(DL-Lite R ), i.e., it treats the concept 1973_FALCON_XB_GT_COUPE as an individual instance of the concept Coupe, rather than as a concept (which is instead the way the penultimate assertion refers to it).The semantics of Hi(DL-Lite R ) is based on the notion of interpretation structure.An interpretation structure is a triple Σ = ∆, I c , I r where (see Figure 5):

•
∆ is a non-empty, possibly countably infinite, set; In other words, Σ treats every element of ∆ simultaneously as: We now turn our attention to the interpretation of terms in Hi(DL-Lite R ).To interpret non-ground terms, we need assignments over interpretations, where an assignment µ over Σ, I o is a function µ : V → ∆.Given an interpretation I = Σ, I o and an assignment µ over I, the interpretation of terms is specified by the function (•) I o ,µ : τ DL-Lite R (S, V ) → ∆ defined as follows: • if e ∈ V then e I o ,µ = µ(e); • op(e) I o ,µ = op I o (e I o ,µ ).
Finally, we define the semantics of atoms, by defining the notion of satisfaction of an atom with respect to an interpretation I and an assignment µ over I as follows: ∈ (e ∈(e A Hi(DL-Lite R ) KB H is satisfied by I if all the assertions in H are satisfied by I (We do not need to mention assignments here, since all assertions in H are ground.).As usual, the interpretations I satisfying H are called the models of H.A Hi(DL-Lite R ) KB H is satisfiable if it has at least one model.

Mapping-Based Knowledge Bases
In a traditional OBDA system, the axioms concerning the intensional level of the representation are stated once and for all at design time.This is a reasonable assumption in many cases, for example when the ontology is built from scratch for a specific application.However, there are application domains where, with the goal of achieving a higher level of flexibility, it is convenient to build on-the-fly the KB directly from a set of data sources, through suitable mappings that insist both on extensional information (as usual in OBDA) and intensional information.The result is that all the axioms (not only the ones relative to the extensional level, like in current OBDA systems) of the knowledge base are defined by mappings to such data sources.This is exactly the idea behind the notion of mapping-based knowledge base (MKB), that we define in detail in this section.
In what follows, we assume that the data sources are structured as a relational database.We observe that this does not hamper the generality of our work, since existing data federation tools support the designer in wrapping a set of heterogeneous data sources so as to present them as a single relational database.In addition, we assume that the relational data sources store directly the symbols in S and in particular the names of elements used in the MKB.In other words, here we ignore the problem of the impedance mismatch between sources that store "data values" and the MKB that contains ontology elements [20].Note, however, that all the results presented in this paper can be extended to the case where the impedance mismatch is dealt with as in [20].Definition 1.A Hi(DL-Lite R ) mapping-based knowledge base (MKB) is a pair K = DB, M such that: • DB is a relational database; • M is a mapping, i.e., a set of mapping assertions, each one of the form Φ( x) ; ψ, where Φ is an arbitrary FOL query over DB of arity n ≥ 0 with free variables x = x 1 , . . ., x n , and ψ is an X-atom in DL-Lite R , with X = {x 1 , . . ., x n }.
Note that when the arity n of Φ is 0, then Φ is a boolean query and ψ is a ground atom.In particular, when the Φ is the query true, then the mapping assertion Φ( x) ; ψ corresponds to the Hi(DL-Lite R ) assertion constituted by the ground atom ψ.In this way, we can easily specify through M also Hi(DL-Lite R ) assertions that are "static", i.e., that do not depend on the current source database instance, and thus we do not need to include a different component in the formalization of an MKB for such static assertions.
We now turn to the semantics of a Hi(DL-Lite R ) MKB K = DB, M .We start by defining when an interpretation satisfies an assertion in M with respect to the source data DB.To this end, we make use of the notion of ground instance of an atom and the notion of answer to a query over DB.Let ψ be an X-atom with X = {x 1 , . . ., x n }, and let v be a tuple of arity n with values from DB. Then the ground instance ψ[ x/ v] of ψ is the formula obtained by substituting every occurrence of x i with v i (for i ∈ {1, .., n}) in ψ.If DB is a relational database, and Φ is a query over DB, we write ans(Φ, DB) to denote the set of answers to Φ over DB.With this notion at hand, we are ready to present the following definition.Definition 2. An interpretation I satisfies a mapping assertion Φ( x) ; ψ with respect to a database DB, if for every tuple of values v ∈ ans(Φ, DB), the ground atom ψ[ x/ v] is satisfied by I.The interpretation I is called a model of K = DB, M if it satisfies every assertion in M with respect to DB.We now present an example of MKB, with the goal of illustrating how such a notion can capture real world situations by introducing a notable flexibility in modeling the domain of interest.M1 asserts that for every tuple t of the T-CarTypes table, the value appearing in the second column of t denotes a subconcept of the concept denoted by the value in the third column of the same tuple.Thus, for example, considering the fifth tuple t 5 of T-CarTypes, M1 states that 1973 MUSTANG MACH 1 is a subconcept of the concept Ford.M2 asserts that every value appearing in the third column of T-CarTypes is a subconcept of Car.For example, by referring to t 5 again, such tuple states that Ford is a subconcept of Car.Analogously, M3 asserts that every value appearing in the fourth column of T-CarTypes is a subconcept of CarType.For example, t 5 states that Coupe is a subconcept of CarType.M4 (resp., M5) asserts that the values appearing in the fifth column (resp., second column) of T-CarTypes denote concepts that are pairwise disjoint.M6-M9 assert properties about the relation produced_in, between the concepts Car and ProdCountry.M10 populates the relation produced_in, and M11 does the same for the concept ProdCountry.Mapping M12 exploits the meta-modeling capabilities of Hi(DL-Lite R ) and relates the different car models to their specific type.For example, looking at tuple t 5 again, we can infer by M12 that 1973 FALCON XB GT COUPE is an instance of the concept Coupe.Note that M1 asserted that 1973 FALCON XB GT COUPE is a concept, and therefore, we are taking advantage of the possibility provided by Hi(DL-Lite R ) of defining a concept to be an instance of another concept (a metaconcept).Finally, M13 allows us to correctly assign the instances stored in the T-Cars table to the concepts corresponding to the different car models.For example, through this mapping we can infer that the "Mad Max" police car INTERCEPTOR is an instance of 1973 FALCON XB GT COUPE (see the second tuple of T-Cars), the famous car ELEANOR of movie "Gone in 60 s" is an instance of the concept 1973 MUSTANG MACH 1 (third tuple), and the "Supercar" KITT is an instance of 1982 PONTIAC FIREBIRD (fourth tuple).
We hope that the above example clarifies the potential of MKBs in acquiring ontology axioms in a flexible way.In particular, let us observe how the domain ontology is dynamically built through the mapping, even though no information about the different types of cars and the different models produced by the motor companies was available at design time.Indeed, the mappings in M retrieve at run-time both intensional and extensional knowledge from the current database instance.Suppose, for example, that a motor company, say GM, decides to produce cars of new model, say 1967 CADILLAC ELDORADO.Given the structure of the database and its intended usage, the natural thing to do for the organization is to add suitable tuples (e.g., Mod6, 1967 CADILLAC ELDORADO, GM, Coupe ) in the T-CarTypes table.In our approach, the new information is automatically detected at run-time by the mappings in M and correctly introduced in the ontology.So, instead of manually changing the ontology and "re-compiling it" at design time, the new concept is dynamically captured at run-time.
In Figure 6, using the Graphol language, we provide the graphical representation of the ontology dynamically constructed from the database D used in the example (In addition to the Graphol constructs already used in Figure 4, in Figure 6 we also use a diamond to denote a role (namely produced_in), and a blank and a full box connected to such role through a dashed line in order to denote its domain and range, respectively.Moreover, the solid double arrowed line denotes a mutual inclusion between concepts (i.e, an equivalence).By using it, we say, for instance, that Car is equivalent to the domain of produced_in, that is, every car is produced somewhere and only cars are produced (somewhere).It is analogous for the range of produced_in (cf.mapping assertion M6-M9).).We also want to add that, although the Example 4 does not show it, our framework allows variables to be used inside operators, in the right-hand side of mapping assertions.This is useful, in particular, to extract knowledge from the database catalog.For example, if FK is the database table storing information about foreign keys between binary tables, and such binary tables correspond to roles in the ontology, the mapping assertion FK(x, 2, y, 1) ; Isa C (Exists(Inv(x)), y) allows us to transfer the foreign key property at the level of the ontology, by correctly representing every foreign key as an inclusion between the corresponding roles.

Queries
In this section we describe the class of queries that we consider in this paper.We start by introducing "query atoms".Intuitively, a query atom is an atom constituted by a meta-predicate applied to a set of arguments, where each argument is either an expression or a variable.The precise definition relies on the notion of q-terms: a q-term is any element of the set τ DL-Lite R (S) ∪ V. Thus a q-term is either an expression in DL-Lite R or a variable.In other words, we do not allow for non-ground terms in queries, except for variables themselves.We are now ready to define the notion of query atom.A query atom is an atom constituted by the application of a meta-predicate in MP(DL-Lite R ) to a set of q-terms.A query atom is called ground if no variable occurs in it and is called an instance-query atom if its meta-predicate is Inst C or Inst R .The definition of the syntax of the class of queries that we are interested in is as follows.Definition 3. A higher-order conjunctive query (HCQ) is an expression of the form q(x 1 , . . . ,x n ) ← a 1 , . . ., a m , where q, called the query predicate, is a symbol not in S ∪ V, n is the arity of the query, every a i is a (possibly non-ground) query atom, and all variables x 1 , . . ., x n belong to V and occur in some a j .The variables x 1 , . . ., x n are called the free variables (or distinguished variables) of the query, while the other variables occurring in a 1 , . . ., a m are called existential variables.A higher-order union of conjunctive queries (HUCQ) is a set of HCQs of the same arity with the same query predicate.
Notice that an HCQ corresponds to an HUCQ formed by a single query.A HCQ is called Boolean if it has no free variables.It is analogous for an HUCQ.
An HCQ (HUCQ) constituted by instance-query atoms only is called an instance HCQ or IHCQ (IHUCQ).
We now turn our attention to the semantics of queries.Let I be an interpretation and µ an assignment over I.A Boolean HCQ q of the form q ← a 1 , . . ., a n is satisfied in I, µ if every query atom a i is satisfied in I, µ.
Given a Boolean HCQ q and a Hi(DL-Lite R ) KB (or MKB) K, we say that q is logically implied by K (denoted by K |= q) if for each model I of K there exists an assignment µ such that q is satisfied by I, µ.
Given a non-Boolean HCQ q of the form q(e 1 , . . ., e n ) ← a 1 , . . ., a m , a grounding substitution of q is a substitution θ such that e 1 θ, . . ., e n θ are ground terms.We call e 1 θ, . . ., e n θ a grounding tuple.Definition 4. Given a Hi(DL-Lite R ) KB K and a non-Boolean HCQ q of the form q(e 1 , . . ., e n ) ← a 1 , . . ., a m , The set of certain answers to q in K, denoted by cert(q, K), is the set of grounding tuples e 1 θ, . . ., e n θ that make the Boolean query q θ ← a 1 θ, . . ., a m θ logically implied by K. (i) Compute the instances of Ford that were produced in Australia and are of type Coupe; note that an instance x of Ford of certain type T is an instance of a car model y such that y is an instance of T (this in fact holds for all instances of Car, not only for instances of Ford).It follows that the correct query expression is as follows: q(x) ← Inst C (x, Ford), Inst C (x, y), Inst C (y, Coupe), Inst R (x, AUSTRALIA, produced_in).
(ii) Compute the pairs of cars, one of type Coupe, and one of type Sedan that were produced in the same country: (iii) Compute all the concepts in the ontology to which a given object (e.g., Eleanor) belongs to: (iv) Compute all the concepts in the ontology whose instances are the concepts to which Eleanor or Kitt belong to: We observe that all queries in the above example are actually IHCUQs.This is the class of queries that we deal with in the next section.

Query Answering
In this section we study how to answer IHUCQs over Hi(DL-Lite R ) MKBs.In the following we consider only consistent MKBs, i.e., MKBs that have at least one model.This is indeed not a limitation, considering that consistency of a MKB can be checked through query answering, by means of methods similar to those used for checking consistency of DL-Lite KBs [23].
Before delving into the details of our technique, we introduce some useful definitions.In what follows, we refer to a MKB K = DB, M .

•
We denote by M A the set of assertions contained in M having either Inst C or Inst R as predicate in their right-hand side.

•
We denote by M T the set M \ M A , that is, the set of assertions contained in M having any of Isa C , Isa R , Disj C , Disj R as predicate in their right-hand side.

•
M is called an instance-mapping if Inst C and Inst R are the only predicates that appear in the right-hand side of the mapping assertions in M.

•
We say that e occurs as a concept argument in the atoms Inst C (e, e ), Isa C (e, e ), Isa C (e , e), Disj C (e, e ), and Disj C (e , e).

•
We say that e occurs as a role argument in the atoms Inst R (e , e , e), Isa R (e, e ), Isa R (e , e), Disj R (e, e ), and Disj R (e , e).

•
A DL atom is an atom of the form N(e) or N(e 1 , e 2 ), where N is a name and e, e 1 , e 2 are either variables or names.

•
An extended CQ (ECQ) is an expression of the form q(x 1 , . . ., x n ) ← a 1 , . . ., a m such that x 1 , . . ., x n belong to V, a 1 , . . ., a m is a conjunction of atoms, each atom a j (with 1 ≤ j ≤ m) is either a DL atom or an instance-query atom (i.e., an atom whose meta-predicate is Inst C or Inst R ), and each x i (with 1 ≤ j ≤ n) occurs in at least one a j .An extended UCQ (EUCQ) is a union of ECQs.
• Given a TBox T (specified in high-order style syntax, cf.Section 2), we define Concepts(T ) = {e, Exists(e ), Exists(Inv(e )) | e occurs as a concept argument in T and e occurs as a role argument in T }, and Roles(T ) = {e, Inv(e) | e occurs as a role argument in T }.

•
Given a mapping M and a database DB, Retrieve(M, DB) denotes the Hi(DL-Lite R ) KB H defined as follows: where t is a tuple of constants and x is a tuple of variables having the same arity.

•
Given an instance-mapping M and an ABox A, we say that A is retrievable through M if there exists a database DB such that A = Retrieve(M, DB).
The query answering technique we are about to present is based on the following four main steps. 1.
In the first step, all intensional assertions are gathered by accessing the sources and using the mapping, in particular the M T portion.This way, a DL-Lite R TBox T is available for the subsequent steps.2.
In the second step, the input query is rewritten on the basis of T , using the algorithm PerfectRef presented in [23].In fact this is not just a trivial call to PerfectRef , since we need to transform the input IHUCQ, which cannot be given directly in input to PerfectRef , into a EUCQ.Similarly, we also need to translate the rewriting produced by PerfectRef in a form that is compatible with the syntax used in the mapping (which is required by the following steps of our algorithm).

3.
In the third step, the query obtained by the second step is unfolded using the mapping, in particular the M A portion, so as to obtain a query expressed over the alphabet of the source schema.4.
In the fourth step, the query obtained by the third step is evaluated over the source data, so as to obtain the final result.
In Section 5.1 we deal with the second step of our technique.In particular, we study the problem of computing a perfect rewriting of an IHUCQ over a DL-Lite R TBox.We recall that given a query q, a perfect rewriting of q with respect to a TBox T is a query q such that, for every ABox A, cert(q, T ∪ A) = cert(q , A) [23].That is, to obtain the certain answers to q in T ∪ A, it is sufficient to evaluate q over A seen as a database (such an evaluation indeed returns the certain answers to q in A).As said, in our rewriting algorithm we first retrieve a TBox through the mapping (step 1).We thus adapt the above notion to the context of MKBs as follows.Given an MKB K = DB, M , let T = Retrieve(M T , DB), a perfect rewriting to q with respect to T and the instance-mapping M A is a query q such that, for every ABox A retrievable through M A , cert(q, M T ∪ A) = cert(q , A).
After the detailed description of the second step above, in the subsequent Section 5.2, we present the complete query answering algorithm for MKBs based on our perfect rewriting technique.

Query Rewriting
The basic idea of our technique is to reduce the computation of the perfect rewriting of an IHUCQ over a DL-Lite R TBox to the computation of the perfect rewriting of an UCQ over a DL-Lite R TBox, which can be then done by using the algorithm PerfectRef described in [23].
To this aim, we first transform the IHUCQ into a standard UCQ, actually an EUCQ.This is realized through a first partial grounding of the query, using the function PMG and then through the functions Normalize and τ.In particular, the function PMG eliminates the meta-variables, i.e., the variables occurring as a concept or as a role argument, from the query, Normalize substitutes Inst C atoms whose second argument is of the form Exists(e) into Inst R atoms (i.e., the instance of the domain or the range of a role expression e is reformulated as an instance of e), and τ transforms Inst C and Inst R atoms in DL atoms.
Afterwards, the query resulting from the perfect rewriting of the EUCQ is transformed back into an IHUCQ, by the functions Denormalize and τ − .This is because the third step of the query answering algorithm assumes the query to be an IHUCQ.
We now describe in more detail the functions PMG, Normalize, Denormalize, τ and τ − .If q, q are two IHCQs, and T is a TBox, then q is a partial metagrounding of q with respect to T if q = σ(q), where σ is a partial substitution of the meta-variables of q with the expressions occurring in T such that, for each meta-variable x of q, either σ(x) = x or: if x occurs in a concept position in q, then σ(x) ∈ Concepts(T ); -if x occurs in a role position in q, then σ(x) ∈ Roles(T ).
Given an IHCQ q and a TBox T , the function PMG applied to q and T computes the set of all partial metagroundings of q with respect to T , i.e., it computes the IHUCQ Q = {q | q is a partial metagrounding of q w.r.t.T }.When applied to an IHUCQ Q a TBox T , the function PMG computes the IHUCQ q∈Q PMG(q, T ).
Example 6.Consider the MKB K = D, M given in Example 4, the TBox T dynamically constructed from D through M, and represented in Figure 6, and the query (i) given in Example 5.
Notice that the notion of partial metagrounding PMG is crucial for our rewriting method.Indeed, even if in Hi(DL-Lite R ) the set of expressions that can be constructed from a finite set of names occurring in the TBox is infinite, we can in fact limit to ground the meta-variables in the query on a finite set of expressions only, as stated by the following lemma.Lemma 1.If Q is an IHUCQ, and T is a TBox, then for every ABox A, cert(Q, T ∪ A) = cert(PMG(Q, T ), T ∪ A).
Thus, there exists a tuple t such that t ∈ cert(Q, T ∪ A) − cert(PMG(Q, T ), T ∪ A).This implies that there exists a model I for T ∪ A such that I | = Q(t) and I | = PMG(Q, T )(t).Let ∆ be the domain of I. We define below an interpretation I ↓ over the same domain ∆.For every d ∈ ∆: It is easy to see that I ↓ is a model for the KB T ∪ A. However, it is also straightforward to verify that I ↓ | = Q(t) if and only if I ↓ | = PMG(Q, T)(t), and since by hypothesis I | = PMG(Q, T)(t), then I ↓ | = PMG(Q, T)(t) as well.As a consequence, I ↓ | = Q(t), which contradicts the hypothesis that t ∈ cert(Q, T ∪ A), thus proving the thesis.
We now turn our attention to the functions Normalize and Denormalize.
Let α be an instance atom, Normalize(α) returns a new atom defined as follows: ) and e 2 has the form Exists(e ) where e is an expression which is not of the form Inv(e ), then Normalize(α) = Inst R (e 1 , _, e ), where _ denotes an existentially quantified variables; ) and e 2 has the form Exists(Inv(e )) where e is any expression, then Normalize(α) = Inst R (_, e 1 , e ).
Then, let q be an IHCQ and M be an instance-mapping, Denormalize(q, M) is the IHUCQ Q defined inductively as follows: and q contains an atom α of the form Inst R (e 1 , _, e 2 ), and either Exists(e 2 ) occurs in M or Exists(x) (where x is a variable) occurs in M, then the query obtained from q by replacing α with the atom Inst C (e 1 , Exists(e 2 )) belongs to Q; • if q ∈ Q and q contains an atom α of the form Inst R (_, e 1 , e 2 ), and either Exists(Inv(e 2 )) occurs in M or Exists(Inv(x)) (where x is a variable) occurs in M, then the query obtained from q by replacing α with the atom Inst C (e 1 , Exists(Inv(e 2 ))) belongs to Q; • if q ∈ Q and q contains an atom α of the form Inst R (e 1 , e 2 , e 3 ) and either Inv(e 3 ) occurs in M or Inv(x) (where x is a variable) occurs in M, then the query obtained from q by replacing α with the atom Inst R (e 2 , e 1 , Inv(e 3 )) belongs to Q.
Finally, let Q be an IHUCQ and let M be a mapping, we define Denormalize(Q, M) as q∈Q Denormalize(q, M).
We formally introduce below the functions τ and τ − , which transform IHUCQs into EUCQs and vice versa.Let q be an IHCQ and let T be a TBox, τ(q, T ) is the ECQ obtained from q as follows: • each atom of q of the form Inst C (e 1 , e 2 ), such that e 2 ∈ Concepts(T ), is replaced with the atom e 2 (e 1 ); • each atom of q of the form Inst R (e 1 , e 2 , e 3 ), such that e 3 ∈ Roles(T ), is replaced with the atom e 3 (e 1 , e 2 ).
Let q be an ECQ and T be a TBox, then τ − (q, T ) is the IHCQ obtained from q as follows: • each atom of q of the form e 2 (e 1 ) is replaced with the atom Inst C (e 1 , e 2 ); • each atom of q of the form e 3 (e 1 , e 2 ) is replaced with the atom Inst R (e 1 , e 2 , e 3 ).
Then, given an IHUCQ Q, we define τ − (Q, T ) = {τ − (q, T ) | q ∈ Q}.An example of application of τ − (Q, T ) can be obtained by simply reversing the transformation shown in Example 7.
We can now formally define our algorithm for query rewriting.The algorithm takes as input an IHUCQ, a TBox and an instance-mapping, and returns a new IHUCQ.Given an IHUCQ Q and a TBox T , we denote by PerfectRef (Q, T ) the EUCQ returned by the query rewriting algorithm for DL-Lite R presented in [23] (Actually, we consider a slight generalization of that algorithm, allowing for the presence of a ternary relation (Inst R ) in the query.).
To provide an example of a complete execution of the algorithm RewriteIHUCQ(Q, T , M), let us now continue the rewriting described in Example 6 and in Example 7. Let us focus on the PerfectRef function, and apply it to the query (2) (which is contained in the set Q 2 of CQs).PerfectRef transforms the atoms of the query using TBox inclusions as rewriting rules, from right to left.For example, according to the inclusion FALCON Ford, it rewrites the atom Ford(x) in query (2) into the atom FALCON(x) (intuitively, the PerfectRef encodes in the rewriting the knowledge expressed by the ontology saying that to obtain instances of Ford one has to look also for instances of FALCON).In this way PerfectRef produces the following query and adds it to the set Q 3 (notice that the atom FALCON(x) was already present in the query, thus the effect of the rewriting in this case is simply dropping the first atom of the query): q(x) ← FALCON(x), Coupe(FALCON), produced_in(x, AUSTRALIA). ( A similar reformulation is performed for all atoms whose predicate occurs in the right-hand side of a positive inclusion (provided that its arguments are not bound, as described in [23]).Among the queries returned by PerfectRef we consider in the following only query (3), which, as we will see later, will allow us to obtain the answer to the original query (the other queries returned by RewriteIHUCQ(Q, T , M) do not really contribute to the final answer in our example).Finally, RewriteIHUCQ(Q, T , M) applies τ − and Denormalize (which in this case is immaterial) to query (3), thus returning the query q(x) ← Inst C (x, FALCON), Inst C (FALCON, Coupe), Inst R (x, AUSTRALIA, produced_in).
The IHUCQ returned by RewriteIHUCQ(Q, T , M) constitutes a perfect rewriting of the query Q with respect to the TBox T and the mapping M, as formally stated by the following theorem.Theorem 1.Let T be a TBox, let M be an instance-mapping and let Q be an IHUCQ.Then, for every ABox A that is a retrievable through M, cert(Q, T ∪ A) = cert(RewriteIHUCQ(Q, T , M), A).
Proof.The proof follows from Definition 4, from Lemma 1, from the correctness of the algorithm PerfectRef [23], and from the fact that the functions Normalize, τ, τ − and Denormalize just perform equivalent transformations of the query.

Query Answering
We now provide an algorithm for query answering over MKBs, which makes use of the query rewriting technique presented in the previous subsection.As already said, our idea is to first compute a DL-Lite R TBox by evaluating the mapping assertions involving the predicates Isa C , Isa R , Disj C , Disj R over the database of the MKB; then, such a TBox is used to compute the perfect rewriting of the input IHUCQ.
To complete query answering, we have to also consider the mapping of the predicates Inst C and Inst R and reformulate the query thus obtained by replacing the above predicates with the FOL queries occurring in the corresponding mapping assertions (step 3 of our technique).In this way we obtain an FOL query expressed over the database.This second rewriting step, usually called unfolding, can be performed by the algorithm UnfoldDB presented in [20] (Here, we assume that the algorithm UnfoldDB takes as input an EUCQ and an instance-mapping.This corresponds to actually considering a straightforward extension of the algorithm presented in [20] in order to deal with the presence of the ternary predicate Inst R .).In the following, given a mapping M and a database DB, we denote by DB M T the database constituted by every relation R of DB such that R occurs in M T .Furthermore, we define DB M A as the database DB − DB M T (i.e., DB M A is the portion of DB which is not involved by the mapping M T ).We are now ready to present our query answering algorithm.
The algorithm starts by retrieving (function Retrieve(M T , DB M T )) the TBox T from DB M T through the mapping M T .Then, it computes (function RewriteIHUCQ(Q, T , M A )) the perfect rewriting of the query with respect to the retrieved TBox and next computes (function UnfoldDB(Q , M A )) the unfolding of such a query with respect to the mapping M A .Finally, it evaluates (function IntEval(Q , DB M A ) the query over the database and returns the result of the evaluation.

Example 9.
Let us consider again the MKD system and the query of Example 6.We notice that M T and M A consists of the mapping assertions M1-M9 and M10-M-13 given in Example 4, respectively, and DB M T = DB M A coincides with the database D described in Figure 3. Thus, the TBox T returned by Retrieve(M T , DB M T ) coincides with the TBox described in Figure 6 (not considering InstanceOf arrows).Q is as discussed in Example 8.As said, amog all queries in Q we consider only query (4).By unfolding it through UnfoldDB(Q , M A ), we obtain (among other queries) the FOL query More in detail, we have unfolded the atom Inst C (x, FALCON) using mapping M13, atom Inst C (FALCON, coupe) using mapping M12, and the atom Inst R (x, AUSTRALIA, produced_in) using mapping M10.Notice also that other queries are produced by the function UnfoldDB applied to the query (4) and the M A (e.g., by unfolding Inst C (FALCON, coupe) through M12).However, the one we showed is sufficient to obtain the answers to the original query.It is indeed easy to see that by evaluating the above query over DB M A (see Figure 3), we obtain the answer INTERCEPTOR, and this is the only answer to the query (i) given in Example 5 evaluated over the MKD system K.
To prove correctness of the above algorithm, we first state the following property, whose proof immediately follows from the definition of Retrieve(M, DB) and the definition of model of a Hi(DL-Lite R ) MKB. Lemma 2. Let K = DB, M be a Hi(DL-Lite R ) MKB and let H = Retrieve(M, DB).The set of models of K and H coincide.
We also need the following additional property.Lemma 3. Let M be an instance-mapping, and let Q be an IHUCQ.Then, for every database DB, cert(Q, M, DB ) = IntEval(UnfoldDB(Q, M), DB).
Proof.The proof follows from Definition 4 and by a natural, slight extension (which we leave to the reader) of the proof of correctness of the algorithm UnfoldDB shown in [20].
We are now ready to show the correctness of the algorithm Answer.Theorem 2. Let K= DB, M be a Hi(DL-Lite R ) MKB, let Q be an IHUCQ, and let U be the set of tuples returned by Answer(Q, K).Then, cert(Q, K) = U.
Proof.The proof immediately follows from Theorem 1 and Lemmas 2 and 3.
Finally, from the algorithm Answer we are able to derive the following complexity results for query answering over Hi(DL-Lite R ) MKBs.Theorem 3. Let K= DB, M be a Hi(DL-Lite R ) MKB, let Q be an IHUCQ and let t be a tuple of expressions.Deciding whether K | = Q( t) is in AC 0 with respect to the size of DB M A , is in PSPACE with respect to the size of K, and is NP-complete with respect to the size of Q( t).
Proof.To decide whether K | = Q( t) we can execute Answer(Q( t), K).Then, membership in AC 0 with respect to the size of DB M A follows from the fact that the only step of the algorithm Answer that depends on DB M A is IntEval(Q , DB M A ), where Q is a FOL query and the fact that evaluating an FOL query over a database is in AC 0 in data complexity.Membership in PSPACE with respect to the size of K follows from the fact that evaluating a FOL query is in PSPACE in combined complexity (whereas the other steps of the algorithm Answer are in PTIME with respect to the size of the K).Finally, NP-completeness with respect to the size of Q( t) follows from Lemma 2, from the fact that computing Retrieve(M, DB) can obviously be done in constant time with respect to the size of Q( t) and from the fact that evaluating an IHUCQ over a Hi(DL-Lite R ) KB is NP-complete in query complexity.

Conclusions
In this paper we have investigated the issue of generating both the TBox and the ABox of a DL ontology on the fly from data stored in data sources through asserted mappings.The two main ingredients for obtaining such a degree of flexibility are (i) enriching the mapping language so as to extract from the data sources the knowledge about possible concepts and roles that are relevant to the domain of interest and (ii) relying on higher-order description logics which blur the distinction between concepts/roles at the intensional level and individuals at the extensional level.
The approach presented here can be useful in a number of scenarios.Although in this paper we have discussed a single example, we point out that such example can actually be seen as a prototypical instance of so-called product databases, where one needs to model information about types and models of products (akin to the table T-CarTypes in Figure 3), as well as specific data about single products (like table T-Car in Figure 3).In the applications with product databases, for example in the context of e-commerce, when new products become available, they are typically acquired by accessing catalogues, where the above mentioned information about types and models of products is reported.In our framework, such catalogues are simply modelled as data sources with suitable mappings to the ontology (similarly to the mappings from the table T-CarTypes in Example 4), and such mappings are used to dynamically extend the ontology with new classes and relationships between them present in the catalogue.
We believe that the approach described in this paper can be a starting point for several investigations going beyond what we have presented here.For example, we may allow for the coexistence of multiple TBoxes within the same data sources and allow the user to select which TBox to load when querying the system, possibly depending on the query, much in the spirit of [27].In addition, the user can in principle even compose on the fly the TBox to use when answering a query.Obviously notions such as authorization views and consistency acquire an intriguing flavor in this setting.As for the former notion, the mechanisms described in this paper can be used to hide intensional knowledge for privacy purposes, thus extending the approaches aiming at filtering only extensional knowledge.As for consistency, an interesting idea is to allow for contradicting TBox axioms coming to the data sources to coexist as long as they are not used together when performing query answering.Other promising directions to pursue include handling possible inconsistencies in MKBs, in the spirit of [28], considering more sophisticated form of mappings, such as the ones in the context of peer-to-peer data integration [29], conceiving the assertions coming from the mapping as updates on the current knowledge base (in the spirit of [30]), and exploiting the ideas presented here as a basis for the problem of acquiring knowledge graphs (see [31]) from existing data sources.

Figure 4 .
Figure 4. Example of meta-modeling in the Cars ontology.

•Figure 5 .
Figure 5. Interpretation structures for Hi(DL-Lite R ).An interpretation for S (simply called an interpretation, when S is clear from the context) over the interpretation structure Σ is a pair I = Σ, I o , where • Σ = ∆, I c , I r is an interpretation structure, and • I o is a function that maps: -each element of S to a single object in ∆; and each element op ∈ OP(DL-Lite R ) to a function op I o : ∆ → ∆ that satisfies the conditions characterizing the operator op.In particular, the conditions for the operators in OP(DL-Lite R ) are as follows: * for each d 1 , d 2 ∈ ∆ such that d 2 =Inv I o (d 1 ), we have that d I r 2 =(d I r 1 ) −1 , where (d I r 1 ) −1 is the inverse of the relation d I r 1 , and * for each d 1 , d 2 ∈ ∆ such that d 2 = Exists(d 1 ) we have that d I c 2 = {o | there exists o such that o, o ∈ d I r 1 }.

Figure 6 .
Figure 6.Representation of the Cars ontology.

Example 5 .
These notions extend immediately to HUCQs.We illustrate some examples of HCQs that can be posed to the MKB K of Example 4.
Disj C /2, Disj R /2}, provided that Inv and Exists take only a role as argument, Inst C takes an individual as first argument and a concept as second argument, Inst R takes an individual as first and second argument and a role as third argument, Isa C and Disj C take two concepts as arguments, and Isa R and Disj R take two roles as arguments.