A Semantic Model for Enhancing Data-Driven Open Banking Services

: In current Open Banking services, the European Payment Services Directive (PSD2) allows the secure collection of bank customer information, on their behalf and with their consent, to analyze their ﬁnancial status and needs. The PSD2 directive has lead to a massive number of daily transactions between Fintech entities which require the automatic management of the data involved, generally coming from multiple and heterogeneous sources and formats. In this context, one of the main challenges lies in deﬁning and implementing common data integration schemes to easily merge them into knowledge-base repositories, hence allowing data reconciliation and sophisticated analysis. In this sense, Semantic Web technologies constitute a suitable framework for the semantic integration of data that makes linking with external sources possible and enhances systematic querying. With this motivation, an ontology approach is proposed in this work to operate as a semantic data mediator in real-world open banking operations. According to semantic reconciliation mechanisms, the underpinning knowledge graph is populated with data involved in PSD2 open banking transactions, which are aligned with information from invoices. A series of semantic rules is deﬁned in this work to show how the ﬁnancial solvency classiﬁcation of client entities and transaction concept suggestions can be inferred from the proposed semantic model.


Introduction
Open Banking services nowadays are powerful facilities in the Fintech sector for digitally sharing financial information with third parties, enabling customers to approve their transactions through secure open APIs. This approach promotes the rapid digital transformation of payment services, traditionally managed by bank entities, and the generation of digital platforms and companies that offer their customers payment and enhanced financial services. In this context, the European Payment Services Directive (PSD2) allows the secure collection of bank customer information, on their behalf and with their consent, with the purpose of analyzing their financial status and needs [1]. This new paradigm leads to massive daily transactions between Fintech entities, which require the automatic management of the data involved, which generally come from multiple and heterogeneous sources and formats.
In this regard, current and past initiatives like the Helix project (AEI-010500-2020-34 NextGeneration EU) and the Estrella European Initiative (IST-2004-027655) [2] are focused on developing the fundamental bases of a future online platform that democratizes the access of SMEs to financing and risk management tools. The objective of this platform is to dynamize economy and security in B2B transactions. Today, these tools are only available to large companies. The main goal in such projects is to comprehensively manage commercial risk operations with customers through the possibility of online contracting of credit insurance by invoice, credit insurance by a debtor, as well as direct access to financing • An ontology is defined using Web Ontology Language (OWL 2) [13], for the first time for the semantic annotation of bank statement and invoice management operations in the context of PSD2 transactions. The proposal, called Open Banking Ontology (OBO), is used as a semantic integration data scheme for different information sources that enables data ingestion, querying, and reasoning in a structured and homogeneous way. • According to the OBO definition, a semantic model is also developed for knowledge graph generation, which is populated by means of specific mapping functions for each different data source. The resulting graph comprising all the data instances is integrated into a repository using Resource Description Framework (RDF) [14], which allows querying with the same language (SPARQL). A series of reconciliation mechanisms were tested with data from more than 70,000 invoices and 33,000 open banking customers through multiple PSD2 operations. • The proposal is validated through several real-world use cases, including intelligent matching bank statements with their corresponding invoices. Moreover, a set of SWRL reasoning rules is also specified for classifying the financial solvency of clients, including an automatic generation of concepts that would facilitate conciliation.
The remainder of this paper is organized as follows. Background concepts are explained in the next section, which also includes a review of related work. In Section 3, the ontology design is detailed, including the software methods implemented to build the knowledge graph. Semantic reconciliation mechanisms and reasoning use cases are explained in Section 4. Finally, Section 5 contains the main conclusions and lines of future research. In addition, annexes with complementary information are included.

Background Concepts and Related Work
Before defining the semantic approach, this section is devoted to explicating the basic concepts of semantic web technologies for the representation and structure of a knowledge graph. A review of related works in the literature is carried out in order to situate our proposal within the current state of the art.

Preliminary Concepts
This subsection is devoted to describing a set of relevant concepts to make the paper more self-contained. - Ontology [15]. This is a description framework that provides a formal representation of the real world, shared by many users, by defining concepts and relationships between them. It defines a set of axioms and primitives to represent a specific domain of knowledge or discourse. In computer science, ontology refers to an artifact designed to allow the formal modeling of knowledge about a domain. Ontologies are part of the standards developed by the W3C consortium for the semantic web. In the semantic web, they are used to standardize vocabularies that facilitate the exchange of information between systems. This standard provides services to solve queries and to populate and publish knowledge graphs [16] so that they can be reusable and facilitate interoperability between heterogeneous systems and databases. Instances of ontology classes and properties annotate data or resources semantically. Furthermore, the reasoning capabilities of ontologies allow semantic web applications to infer implicit knowledge from explicitly expressed knowledge, enabling the creation of intelligent applications. -RDF [14]. To define an ontology in the context of web semantics, we used the standard language RDF, according to which resources are described in terms of properties and values. RDF statements are represented as triples, consisting of a subject, a predicate, and an object. RDFS (RDF Schema) [17] semantically extends RDF to define resource classes and the properties that those classes describe. RDFS provides the mechanisms for explaining the specific RDF vocabularies for each application. Although the expressiveness of RDFS is still insufficient for all applications, it is enough for data exchange and allows semantic languages to be built on top of it. However, for many other applications, it is necessary to have other characteristics, such as cardinality constraints, transitive properties, inverse properties, and range. When used for a specific class, it is able to express classes as unions or intersections of other classes.
In addition, ontology languages should meet other requirements, such as having a well-defined syntax, being based on formal semantics necessary for reasoning, and supporting efficient reasoning mechanisms while maintaining expressiveness. -OWL [17]. Proposed by the W3C Ontology Working Group, this is a semantic markup language for publishing and sharing ontologies on the web. OWL is equivalent to a very expressive description logic, where an ontology corresponds to the Tbox [18]. This equivalence allows language to exploit the results of researchers in the logic of descriptions. OWL extends RDF and RDF Schema, with the main goal of bringing the expressiveness and reasoning power of description logic to the semantic web. OWL provides two sublanguages: OWL Lite for simple applications and OWL-DL, which represents the subset of the language equivalent to the logic of descriptions, the reasoning mechanisms of which are complex. A new version of this language is OWL 2 [13], which includes new functionalities, e.g., keys, property strings, data ranges, and qualified cardinality constraints. These new features are oriented towards increasing expressiveness.

Related Work
In the past and current literature, not many studies can be found that propose ontologies dealing with the economic and/or finance domains of knowledge. However, these constitute important contributions that are worth highlighting.
A first attempt was introduced by Castells et al. in [7], where an interesting ontologybased platform was developed to integrate finance base contents and semantics in a knowledge base that provides a conceptual view of low-level contents, including navigation semantic search facilities. However, it still constitutes a preliminary proposal for which the ontology developed is unavailable, and the semantic search is limited to values of specific properties only.
In 2008, as a result of the European Estrella protect (ESTRELLA Project http://www. estrellaproject.org/ (accessed on 27 November 2022)), the Legal Knowledge Interchange Format (LKIF) [2] appeared as an XML Schema for representing theories and arguments (proofs). An approach in LKIF consists of a set of axioms and defeasible inference rules. The individuals and predicates can be imported from an ontology represented in the Web Ontology Language (OWL). Importing an ontology also imports the axioms of the ontology. Other LKIF files may also be imported, enabling complex theories to be modularized.
Also in the domain of financial products, but with a different orientation, an ontology was proposed in [9] (2011) for annotating financial products and customers. The ultimate aim of this work was to model a recommender system for linking products with client's profiles, and so the proposal was restricted to this specific application.
As remarked in the introduction, an important past proposal is the Financial Industry Business Ontology (FIBO) (EDM Council FIBO https://spec.edmcouncil.org/fibo (accessed on 20 November 2022)) [8] (2013), which was conceived as an ontological model composed of modules or sub-ontologies, each one for a specific purpose in the regulatory domain of finance. This ontology is aligned with LKIF, as they both focus on the legal regulation of the different actors in financial transactions. Although they do not cover PSD2 bank statement directives, they could be easily aligned to OBO with regard to classes such as service provider, legal person or regulatory agency.
More recently, in 2017, the OntoREA [19] model was proposed to cover the REA accounting UML model that conceptualizes the economic logic of accounting in terms of resources that are exchanged in economic events between economic agents. It is a mapping translation using OntoUML [20], although no OWL specification is provided and it is oriented to top-level economic concepts.
Similarly, the COFRIS [21] ontology was proposed, also in 2017, to consider financial information systems that model accounting reports. It is also designed with the OntoUML language but with a particular orientation through financial reporting according to regulations, hence covering scopes other than open banking statements and reconciliation methods.
In 2021, the authors of this paper [22] argued that there was a need for a semantized banking schema. They integrated existing COVID-19 ontologies in order to analyze how the COVID-19 pandemic affected the banking sector.
Therefore, although these ontologies describe many of the top-level concepts in the domain of economic regulation, they represent initial steps in the semantic specification of specific data-driven mechanisms for allowing advanced analysis in financial statements. The OBO ontology proposed here aims to describe one step beyond this direction, with a special focus on open banking statements and transactions.

Semantic Model
This section describes the methodology followed for designing the proposed ontology and gives details of its implementation. In this context, a series of components and mechanisms are also introduced to show how the underpinning knowledge graph is built and populated with real-world open banking data.

Methodology
When defining a new ontology, a good practice is to follow a well-grounded methodology that considers all the required aspects for knowledge domain description. Concretely, the "Ontology 101 Development Process" [23] was used, which consists of the following seven steps: 1.
Determine the domain and scope of the ontology. The ontology domain represents the different actors and the information generated in open banking services. More specifically, bank statements and invoice management systems are contextualised to allow for data reconciliation tasks.

2.
Consider reusing existing ontologies. After reviewing the literature, several ontologies were considered to describe accounting entries [8]. Still, none of them represents bank statements oriented to PSD2 format and other related aspects such as invoice contextual information. These ontologies usually contain a class Invoice, Receipt or other similar terms, but they do not ontologically describe an invoice, with relationships and components. In the proposed approach, the invoice concept plays an essential role since it allows considering all the meta-information, involving it to be further used in reconciliation mechanisms for linking with third-party bank statements. For example, those products or services the invoice refers to are usually unknown to the legal person or company that a given client has invoiced.

3.
Enumerate important terms in the ontology. As mentioned in the previous step, important terms in OBO correspond to data in invoices and bank statements, involving open banking service transactions. Examples of these terms are: companyId, customerId, issueDate, maturityDate, itemCurrencyAmt, and documentType, which are related to invoicing, as well as those related to bank statements, such as: Statement, Customer, Account number, Bank, Concept, and beneficiary.

4.
Define the classes and the class hierarchy. From the relevant terms, we extract which of them are classes of ontology. Figure 1 shows the set of classes that have been defined to describe an open bank service. Examples of these classes are: Item, Statement, AccountObj, and Bank. The class Item comprises two subclasses, Opened and Cleared, for specifying whether Items are pending to be paid or not.

5.
Define the properties of classes. Object properties and data properties were created to connect instances of different classes and define their attributes. Examples of object properties are: participating, which connects a customer to the bank account; statementHasOffice, which relates a bank movement to the office in which it was registered; and ItemHasCompany, which connects an item with the company. Examples of data properties are: statementConcept, beneficiary, accountNumber itemIssueDate, itemAmount. Tables 2-6 describe details of these properties in description logic. 6.
Define the facets of the slots. In this step, cardinality and value constraints are defined.
In the proposed ontology, examples of constraints are: the range of the property itemAmount has to be a decimal number; the range of the property itemIssueDate has to be a date; the concept of a statement has to be of type string, and so on. 7.
Create instances. The instances (or individuals) in the current approach are generated from the JSON documents containing actual PSD2 bank statements and the CSV files with data of the items (invoices). These data are transformed to RDF through mapping functions according to the classes and properties defined in OBO, and an RDF knowledge graph is then created.

Ontology Implementation
The Open Banking Ontology (OBO) proposed here was developed with Protégé editor (Protégé Web Site https://protege.stanford.edu/ (accessed on 25 November 2022)) after several rounds of design with experts in the domain of knowledge (in the context of the Helix project). As a result, the proposed ontology is made up of 14 classes, 10 object properties, 35 data properties, and 250 axioms. Figure 1 shows a general overview of the OBO proposal, and for the sake of simplicity, a selection of the main classes and properties are detailed below. The properties are formalized using the OWL-DL semantic logical description syntax. A summary of this syntax, which can be used to support the interpretation, is shown in Table 1. The remaining elements of OBO can be examined in the software repository, which is available at https://ontologies.khaos.uma.es/obo (accessed on 28 November 2022).

1.
The class Company represents those companies that take part in a given PSD2 transaction. For this class, 5 data properties were defined: businessName, to indicate the company name, companyId, which models the company identifier, corporationId to identify the corporation (when existing) to which the company belongs, currencyCode, which represents the currency the company works with, and fiscalId to store the tax identification number. In addition, the Company class has an object property, has-StatementCustomer, which links the company with its PSD2 bank statements. Table 2 describes the properties of the Company class in the description logic definition.

2.
The Customer class was included to annotate customers that participate in transactions with companies. A set of 3 data properties were defined for this class: business-Name, which represents the customer's name, customerId, which stores the customer identifier, and fiscalId to set the number of the client's tax identification. Description logics of these properties can be found in Table 3.

3.
The Item class represents the invoices that customers should pay. Its main properties are defined in description logic in Table 4, comprising 7 data properties and 2 object properties. As for data properties, it is worth mentioning textititemAmount, which stores the invoice amount in the currency associated with the corresponding company, itemCurrencyAmt, to annotate the amount of the invoice, itemIssueDate, which describes the invoice issue date, currencyCode, which indicates the currency associated with the invoice and in which it is expressed, and indicator, which represents a series of descriptors associated with the invoice. The two object properties are: itemCompany, which links each invoice with its company, and ItemCustomer, which connects the invoice to the customer.

4.
The class Statement represents the bank statements carried out on a particular bank account. Among its main data properties, a set of 5 representatives ones is: statementAmount, to annotate the transaction amount, accountNumber, which indicates the bank account in which the statement was made, statementCurrency, which stores the currency associated with the statement, statementBeneficiary, representing the beneficiary of the bank statement, concept and statementValueDate, which indicate the concept and value date of the statement, respectively. The object property statementHasOffice was defined for this class to associate each transaction with the corresponding bank branch. Table 5 describes the properties of the Statement class.

5.
The Bank class models the bank information. A total of 3 data properties were defined for this class: bankGroupId, which stores the identifier of the bank group to which the bank belongs, bankId, to represent the identifier of the bank, and bankName, which stores the name of the bank. This class is important for annotating not only traditional banks, but also open banking entities that take part in PSD2 transactions with customers. 6.
The class StatementsCustomer represents the holders of the bank accounts in which the movements are made. For this class, a series of data properties were defined to store the customer's identity document, name, date of birth, address, email, and telephone number, as described in Table 6. The customer is related to his/her/their bank accounts through the object property participating, specifying the role in which he/she/they participate(s) (owner or authorized). This is modeled using the Accoun-tObj class. Table 1. Basic OWL-DL semantic syntax used to formally define the proposed ontology.

Building the Knowledge Graph
Once the OBO ontology has been defined, the data involved can be integrated and consolidated by following the defined semantic model in an RDF repository. The information is then stored in the form of a knowledge graph with a homogeneous format, regardless of the source and avoiding semantic inconsistency. In order to clarify the technology that was used to deploy the model, Table 7 summarizes the package and languages that were involved in each step of the process. This process is illustrated in Figure 2, showing that a series of mapping functions were implemented using RDFLib (https://rdflib.dev/ (accessed on 24 November 2022)) to automatically translate the customer's data from the issued invoices and bank statements into RDF format, hence populating the knowledge graph derived from the OBO scheme.

•
Mapping invoice data into RDF. These data are obtained from the ERP systems of the companies involved in CSV format, which comprise four different files: (1) for open invoices, (2) for closed invoices, (3) for companies that issue invoices to customers, and (4) for customer information, by means of the Customer class in OBO. In addition, contents regarding classes Item, Company, and Customer are also stored in the repository. Once the RDF triplets corresponding to bank statements and invoices have been created, they are stored in the RDF repository, which is deployed by means of a Stardog (https://www.stardog.com/ (accessed on 22 November 2022)) instance service. Therefore, the integrated data can be efficiently queried through a SPARQL endpoint. In addition, the RDF repository is powered by an SWRL inference engine to explode different sets of semantic rules designed to conduct classification and validation tasks on financial subjects, which, together with the semantic reconciliation mechanisms, are explained in the next section.

Reconciliation and Inference
This section aims to show the usability of the proposed semantic model by carrying out a series of real-world use cases oriented toward data reconciliation and semantic inference. The data involved, although previously anonymized, are captured in the context of the aforementioned Helix innovation initiative (The HELIX Innovation Project (https://www.ongranada.com/proyectos-2/ (accessed on 21 November 2022)) and are made up of a cluster of financial organizations that work to maximize the liquidity of small and medium-sized companies).

Semantic Reconciliation
One of the most common operations in current banking, which usually requires strenuous expert efforts, consists of (manual) movement inspections to match partial payment statements concerning different concepts in invoices. A way to facilitate these processes is to filter those operations whose concepts and payments can be semantically matched in an automatic way. This approach can now be conducted by means of the semantic system implemented here, according to the OBO TBox (Terminological Box) (Following the Description Logic nomenclature, ontology classes and properties are called "Tbox", while ontology individuals are called "Abox" (Assertional box)) for concept disambiguation.
For validation, a number of 74,000 real-world invoices and 150 bank statements were used, the contents of which were stored in the RDF repository with the mapping functions explained in the previous section. In this regard, as shown in Figure 2, a series of SPARQL queries are applied to building two-word corpora. One for each item concept involves the invoice data that companies usually store in their corresponding ERPs (customer's business and fiscal name to whom the invoice was financed, etc.), and another one for each Statement, with concepts of bank transactions (beneficiary, bank, etc.). Table 8 shows the SPARQL queries defined to construct these corpora, with two examples as results. Concretely, the first set of queries (first row) retrieves the properties ItemsConcepts, ItemsBusinessName, and ItemsFiscalId to obtain the information in the Items for two invoices, while queries (second row) are used to select properties PSD2Concepts and PSD2Beneficiary of two sample bank Statements. " obo_item00035ed7 -8 a02 -e3dd " : [ "1" , "1 -1" , "1201**** -I " , " EXAMPLE SL " , " B2756 ****" ] , " obo_item0007874d -30 a9 -a4b0 " : [ "1" , "1902**** -F " , "H -0345****" , "8771****" , " EXAMPLE2 SL " , "****5028 G " ] , } # Query to get corpus generated for invoices. Once the two-word corpora are built, a fuzzy search RapidFuzz algorithm (https: //maxbachmann.github.io/RapidFuzz/ (accessed on 26 November 2022)) [24] for string similarity is applied to calculate semantic matching between invoices and bank statements. This algorithm is configured with a Leverstein distance metric and a confidence threshold of 80% similarity to filter unnecessary matches. This threshold was set after a series of preliminary tuning tests until an adequate level of similarity was reached, taking into account the expert knowledge domain. Table 9 shows a set of two (anonymized) highly scored results obtained from a total number of 225 matches (from >11 MM possible combinations).
The first column of the table indicates the semantic attributes, and the other columns show the semantic similarity values. In addition, these filtered matches take into account the currency, as well as the difference between the maturity date of the invoice and the date of the movement, which should be less than one year since it is not expected that a client takes more than a year to pay an invoice. This is checked by means of SPARQL query shown in Code Listing 1, for which a sample of results is shown in Table 10, comprising two new results that had not been obtained previously. Finally, further checking is addressed in this use case to find matches in the case of partial payments, e.g., in those situations where an invoice is paid with several movements and the opposite, when several invoices are paid with the same statement. In these cases, to calculate the possible sums of amounts, a Genetic Algorithm (GA)-based approach was used to find the optimal combinations of the sum of subsets. This algorithm was codified by means of the JMetalPy (https://jmetalpy.readthedocs.io/ (accessed on 21 November 2022)) [25] framework, implementing an Elitist GA variant with population size 100, SPX-Crossover and bit-flip mutation operators, selection and replacement by binary tournament, and stop condition with 25,000 fitness evaluation functions.
A clear sample result from this context is shown in Table 11, where the GA finds an invoice whose amount (1039.88 e) exactly matches the sum of the two corresponding bank statements (800 e and 239.88 e). Differently, Table 12 contains a sample matching in which a bank movement amount (21,913.10 e) is used to pay the corresponding two invoices (16,637.50 e and 5275.60 e). It is worth noting that, without the previous semantic reconciliation concerning items and statements descriptions, these numerical matchings would be highly complex because of the number of possible combinations in the search space.

Semantic Inference
This last use case is devoted to illustrating two inference tasks oriented towards allowing customer classification and concept suggestion by means of semantic reasoning.
First, the values of the obs:debtRatio data property of the Customer class are calculated by means of several SPARQL queries for each customer (obs:itemHasCustomer) and company (obs:itemHasCompany). The debt ratio is calculated with Equation (1), which is applied to the aggregation of the amounts (obs:itemCurrencyAmt) of all the items of the class Opened (these amounts are the outstanding items), divided by the sum of the amount of all the customer's items. Then, in accordance with the debt values, a set of SWRL semantic rules were defined to establish a classification of customers, as follows: the Low label is assigned to those customers with a debt ratio lower than 30% (The rule is shown in Code Listing 2), the Medium label is assigned to those with a debt ratio between 30% and 70% (The rule is shown in Code Listing 3), and finally, the High label is assigned to customers with a debt ratio higher than 70% (The rule is shown in Code Listing 4 Therefore, the value of data property obs:debtType is inferred by the reasoner in the proposed semantic model, hence allowing an additional population of this element in the RDF repository. In this way, to obtain a given customer's classification according to his/her debt with a particular company, the SPARQL Query shown in Code Listing 5 can be used (this refers to the specific case of customer ID 367 with company ID 226). Second, taking advantage of the information stored in the knowledge graph, it is possible to infer concepts suggested for different bank movements. To this end, the SWRL Rule shown in Code Listing 6 was defined to merge information from classes Opened and Customer, comprising the data properties that annotate the customer's fiscal identification number (fiscalId), the maturity date of the item (itemMaturityDate), and the amount of the closed item (itemAmount). The semantic reasoner is then used to concatenate these values and to form the suggestedConcept data property, thereby reconciling bank movements.

Conclusions
In this work, the Open Bank Ontology (OBO) is proposed for the semantic annotation and consolidation of data involved in PSD2 transactions between customers, banks and third-party financial entities. A semantic model is also used for knowledge graph construction and data harmonization in the context of a real-world open banking project (Helix), enabling the definition of SPARQL queries and SWRL rules for reasoning. For validation purposes, a series of mapping mechanisms, queries, and intelligent algorithms were also developed with a special focus on data reconciliation, which seeks to enable automatic correspondence between bank statements of customer payments and the corresponding invoices delivered by companies. In addition, a set of reasoning SWRL inference rules were defined to obtain new knowledge from the integrated data, with the specific aims of automatically classifying customers according to their debt patterns and standardizing the suggested concepts for invoices.
In terms of practical implications, as already mentioned, payment processes such as PSD2 are stimulating the emergence of applications capable of greatly simplifying the dayto-day management and procedures of SMEs and users. This translates into the creation of systems capable of integrating all of a company's payments within the company, as well as providing access to the different bank accounts held by the company, even if they are held at different banks. The key idea lies in integrating all corporate areas into the same interface, which allows for a much clearer perspective on the company's status. In this sense, the proposed ontology enables the constitution of a semantic model that can facilitate the integration of all these functionalities, as well as other external yet important factors (such as the impact of COVID-19 [22]), as it offers a common data framework that is also oriented towards being consistent with open banking standards.
In the same regard, it is worth noting that the OBO ontology constitutes the first attempt to model bank movements and activities with a special focus on the European PDS2 standard. Therefore, new extensions are expected to follow this work, including Linked Data federation with other related knowledge graphs in the domain of Fintech applications and standards.
In future work, we plan to integrate more data from different invoice management systems and update the OBO ontology to incorporate new relevant attributes from different perspectives, such as opinions on social networks, behavioral features of the company in its commercial relations with customers, etc. This new knowledge will allow further analyses to be carried out, taking into account further factors and actors.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
This appendix section includes anonymized fragments of the data returned by the open banking API. The snippets show the structure and fields that define the documents in JSON format. The data includes information identifying the customers and the movements associated with the banking products contracted by the customer with the bank.