Towards a Domain-Speciﬁc Modeling Language for Extracting Event Logs from ERP Systems

: Enterprise resource planning (ERP) systems are often seen as viable sources of data for process mining analysis. To perform most of the existing process mining techniques, it is necessary to obtain a valid event log that is fully compliant with the eXtensible Event Stream (XES) standard. In ERP systems, such event logs are not available as the concept of business activity is missing. Extracting event data from an ERP database is not a trivial task and requires in-depth knowledge of the business processes and underlying data structure. Therefore, domain experts require proper techniques and tools for extracting event data from ERP databases. In this paper, we present the full speciﬁcation of a domain-speciﬁc modeling language for facilitating the extraction of appropriate event data from transactional databases by domain experts. The modeling language has been developed to support complex ambiguous cases when using ERP systems. We demonstrate its applicability using a case study with real data and show that the language includes constructs that enable a domain expert to easily model data of interest in the log extraction step. The language provides sufﬁcient information to extract and transform data from transactional ERP databases to the XES format.


Introduction
At present, organizations use enterprise resource planning (ERP) systems as a core information and communications technologies (ICT) component. ERP systems encompass configurable modules supporting business processes that have received wide industrial adoption, and they contain extensive amounts of data about the behaviors of such processes. Recorded data can be used to analyze whether a predefined process specification conforms with real activities [1]. Process mining offers automated techniques for this type of analysis. Process mining has emerged in the past few years as a new analytical discipline that focuses on extracting insights about processes from the event data stored in information systems.
Process mining provides the techniques to automatically discover process models from data, find the mismatch between process models and process executions, and improve models based on information obtained from their past executions [2]. These techniques rely on the existence of an event log whose structure is suitable for mining. An event log assumes that events that refer to an activity occur at a particular time in precisely one case [3]. While event logs structured in this form are mostly available in process-aware information systems, such a structure is not explicitly generated by ERP systems and other domain-specific business systems [4][5][6]. These systems record events implicitly, separately, and without a common case identifier. Thus, such logs need to be generated from the data that typically exist in a relational database. To build such an event log is the key enabler for process mining and is a complex task [7,8].
To extract data from ERP systems in an event log format suitable for process mining, several challenges need to be addressed. First, there is not one possible case notion, but multiple case notions (called object types) which could be used to correlate the corresponding events in traces [9]. For example, a purchase order or purchase invoice document can be identified as a case notion for a purchasing process in an ERP system. Second, data are organized in documents related to each other through one-to-many and many-to-many relationships, and activities are often related to these documents. Due to the data featuring one-to-many and many-to-many relations, an activity can be performed multiple times on a single document or performed only once but on different documents; however, existing process mining techniques and tools require an XES event log as an input. The eXtensible Event Stream (XES) format [10] is the current Institute of Electrical and Electronics Engineers (IEEE) standard for storing events and requires a single case notion to correlate events and precisely one case per event. Thus, the resulting log may suffer from convergence (one event is related to multiple cases) and divergence (multiple events of the same type are related to one case) when extracting an event log from the ERP system, as described in [11]. This may lead to duplicated events and false data dependencies between events.
This paper emphasizes the challenge of building an event log from ERP systems supporting domain knowledge. Extracting events from an ERP database requires substantial domain knowledge of the business processes and underlying data model [2]. Data constraints in ERP systems are generally managed at the application level [12,13]. In this situation, when data relations are defined outside a database schema, domain knowledge is essential to select the correct data from database tables. The semantics of event data and where they are stored within an ERP system is known by domain experts. Jans and Soffer [14] specified a set of decisions that domain experts often need to make when creating event logs from databases. They have discussed how these decisions can affect log quality. One of these decisions is the selection of a meaningful case notion. This shows that the need to support domain experts during the log building process is a real issue, but there is an absence of techniques and tools to facilitate this [9,15].
To support domain experts, to automate log building process as much as possible, and to mitigate the problems caused by many-to-many relations, a graphical domain-specific modeling language (DSML) for extracting event data from transactional databases has been developed. It has been specifically designed to describe behaviors for complex data in ERP systems in terms of multiple interacting artifacts. To this end, we have followed artifactcentric approaches [11,16] which consider artifacts, i.e., the documents of ERP systems, as case identifiers. These approaches can deal with one-to-many and many-to-many relations between data.
A few researchers have also attempted to address the problems caused by manyto-many relations and have proposed novel metamodels for storing data for process mining [13,17] and novel modeling notations and process mining techniques for discovering and analyzing artifact-centric processes [18,19] to achieve this; however, our proposed metamodel is aimed at providing the basis for a data extraction model to create event logs in the XES format from transaction logs. Once event logs are obtained, i.e., XES files, it is possible to perform existing process mining techniques which users are already familiar with. The language supports domain experts to determine the process scope, select correct database tables, and include relevant events for conversion definition in an intuitive way.
The metamodel proposed in this paper, along with its corresponding constraints, is the result of several iterations of development and revision. We also introduce a graphical notation for the language concepts. Once the DSML model was designed, the next step of our approach was to transform the model into Structured Query Language (SQL) code. Then, automatic translation from the query results to the final XES event log could be performed. The main contribution of our research is the full specification of the aforementioned language. The benefit of designing a specific language is that it allows domain experts, even without any programming background, to focus on modeling the data of interest in the log extraction step. Another important contribution is a conceptual representation of ERP database content in terms of business artifacts. The new metamodel aligns behavioral and data perspectives in order to provide sufficient information about a process, data objects, and the relations between them. Based on this model information and the specified transformation rules, it will be possible to extract and transform data from transactional ERP databases to the XES target representation, as demonstrated in this paper.
The remainder of the paper is organized as follows. The next section discusses related works. Section 3 describes complex ambiguous cases from ERP systems. The specification of the language, its metamodel, and the corresponding constraints are discussed in Section 4. Besides, a graphical notation is presented. Then, our solution is evaluated using a case study with real data in Section 5. A discussion is presented in Sections 6 and 7 provides final conclusions and ideas for future work.

Related Work
Various techniques and tools in the literature have tried to deal with the extraction of event logs from enterprise systems. Tools such as XESame [20] and ProMimport [21] can be used to extract XES event logs from various data sources that hold explicit information about process instances and events. The XES log format has been accepted by the IEEE Task Force on Process Mining as a de facto standard for process mining algorithms. This format represents a structure of an event log, containing traces and events with the corresponding attributes. It comes with standardized extensions for capturing semantics to particular attributes per component. Briefly, a trace describes an order in which events occur during one execution of a process. Note that traces in the XES log describe the evolution of the main case object, i.e., one type of process instance. Moreover, each event refers to some activity. We refer to the official IEEE XES standard [10] for more information about XES metamodel and the OpenXES reference implementation.
Since ERP systems provide several identifiers for a process instance, the XES log format cannot be extracted directly. So far, most studies in this regard have dealt with extracting data from specific environments like SAP systems and have not provided a general approach for building event logs from databases. Ingvaldsen and Gulla [22] developed a tool to specifically construct process chains from SAP transactions while allowing users to define events. Piessens [23], in turn, initialized a repository with predefined event types for specific processes supported by SAP. A more generic solution is the onprom project presented by Calvanese et al. [5], who introduced two ontologies, namely, a domain ontology and an event ontology, to support the extraction of event logs from legacy data. Specifically, the approach provides an annotation mechanism to help users in the conceptual identification of event data. This approach addresses the data from a semantic perspective but does not deal with one-to-many and many-to-many relations, as encountered in this work. Another effort to obtain event logs considered the redo log files of relational databases as a source of event data [4,24]. Redo logs are used to record information about any modification performed on the data of the database; however, more files need to be explicitly configured to store redo logs. In many real cases, redo logs might not be available.
The OpenSlex metamodel [13] has been proposed to store object-centric event logs, covering different types of entities such as objects, object versions, relations, and events. Multiple XES event logs can then be created from this intermediary storage based on the chosen case notion. Unlike OpenSlex, our solution provides the basis for a data extraction model and does not require the manual specification of database queries. It abstracts users from the definition of complex transformations required to obtain an XES event log. Once a conceptual model has been designed by domain experts, it provides sufficient information to automate the generation of queries and translate the obtained query results to the final XES event log.
A more recent work [19] has proposed a new process mining technique to discover an object-centric behavioral constraint (OCBC) model from an eXtensible object-centric (XOC) log. This technique is based on a novel log format and a new type of model which is used to describe processes involving complex data dependencies. The XOC format does not assume a unique case notion to group events, thus, it can be extracted from object-centric data [17]. The new log format combines different types of entities, such as objects, events, and attributes, as presented in [13]. It provides additional information, such as database objects, in order to solve data convergence and data divergence problems; however, the major drawback of an object-centric approach is that it forces the representation of complex systems by means of a novel log format, which is not widely accepted in the industry when compared with the IEEE XES standard. This necessitates new process mining tools and existing process mining techniques cannot be used. Domain experts cannot easily interpret OCBC models discovered from XOC logs as they have constructs that are different than conventional process modeling notations such as the BPMN or Petri net notations [15]. Our aim is to enable the building of an event log that could be managed by the existing tools used in process mining.
Our work also relates to artifact-centric approaches [11,16,25]. These approaches identify data artifacts, i.e., documents like a purchase order, in order to discover the life cycles of objects with one-to-many and many-to-many relations between them. For each data artifact, its instances and all related events are extracted and transformed into a separate event log. As a result, multiple event logs are obtained. The benefit of applying an artifact-centric approach is that it addresses the data convergence and data divergence problems explicitly by defining an instance notion in each artifact; however, more support should be provided for domain experts to identify the essential artifacts for data extraction from ERP systems. Our approach attempts to address this issue. Unlike existing approaches, we propose a metamodel for describing the semantics of transactional data from an ERP system. Inspired by artifact-centric approaches, the data structure of the process is conceptually represented by multiple related documents and their transactions. As far as we know, this is the first work that uses domain-specific language for obtaining event logs from object-centric information systems such as ERP systems. The benefit of designing a specific language to support data extraction is that it allows the domain experts to localize the correct data for the process mining analysis in different tables and data structures of ERP system at the higher level of abstraction, thus avoiding implementation details. This consequently guides a domain expert towards converting data from a database to a suitable event log without the need for programming.

Motivation and Running Example
An ERP system is a business suite of integrated applications that supports the major business processes of an organization. Typical examples of such processes in ERP systems are purchase order fulfillment and manufacturing. The key characteristic of an ERP system is that the different types of documents are used to record transaction data that are often tied to the execution of process activities. For instance, material documents record goods that are either received into or issued from inventory. Document types are implemented as table pairs with parent-child relationships between tables. This can be represented as a header and lines pattern, where the header table holds general information of a document and the lines table captures specific line item information of a document. The header and lines design pattern is very common for business applications that deal with documents, such as ERP, customer relationship management (CRM), and supply chain management (SCM) systems.
In ERP systems, the process logic and business rules are both embedded in the application code managing documents in typically many-to-many relationships [13,26]. The presence of these complex data dependencies has a large influence on the definition of cases for process mining analysis. When a process has one-to-many and many-to-many relations between documents, the user has to choose one document as a case notion as multiple documents can be linked to a single process execution. Experience teaches us that the ability to obtain the event log, which accurately represents reality, depends on this decision.
Here, we introduce an ordering process as a running example to illustrate the complex data dependencies in an ERP system. In general, the process involves supplying goods or services, receiving supplies, and processing invoices. The process starts with creating a purchase order which is forwarded to the selected vendor. When the vendor delivers supplies, a purchase receipt document is created and an invoice for the payment is registered. Figure 1 gives a simplified example data set from the ordering process supported by the Microsoft Dynamics NAV ERP system. This example of transactional data describes the event type of creating a document. Note that multiple invoices can be received for a single order (IN1 and IN2 refer to PO1) and one invoice can be matched to multiple purchase orders (IN1 refers to PO1 and PO2). As already mentioned, a process instance represents the unique document or case object that is followed throughout the process. An invoice, a receipt, or purchase order can be selected as a process instance, i.e., case notion. If a purchase order is selected as the process instance, the trace of process instance PO1 reveals the link from the book invoice to the posting goods receipt activity in the discovered process model shown in Figure 2. There are two posting goods receipt events (R1 and R3) and two book invoice events (IN1 and IN2) for the process instance (PO1). Due to the multiple instances of the same activity referring to the single process instance, the link may indicate the false dependency between the events. For instance, we can infer that there are invoices registered before the related purchase order is received; however, receipt document R3 is created before invoice document IN2, but this is not visible in the discovered process model. This situation corresponds to the data divergence problem.
The data convergence problem is a result of a many-to-one relationship between the document that represents a process instance and its related document, and the problem is related to the number of times an activity is performed. For instance, invoice IN1 is matched to two different purchase orders (PO1 and PO2). Within the ERP system, invoice IN1 is registered only once but is extracted twice and the event is duplicated in the process model discovered using traditional process mining techniques. For the same activity, one event per process instance is observed in the discovered process model in Figure 3. The number in brackets represents how many times the activity is performed.

Domain-Specific Modeling Language for Extracting Event Logs
This section presents the definition of the abstract syntax as a metamodel, and a graphical concrete syntax is described. The language can be used in a pre-processing step for process mining projects to extract correct data from ERP systems for the analysis in a user-friendly way. It is aimed at structuring and integrating the elements of the process and data perspectives in a single structure in order to support an artifact-centric approach.

Abstract Syntax
The specification of the essential domain concepts and relationships between them is the central part of designing a DSML. It is composed of a metamodel, where the attributes and associations of the language concepts are defined, as well as the corresponding constraints. We chose to specify restrictions and semantic properties of the DSML elements more through the abstract syntax than by adding a larger number of model constraints. By defining fewer additional constraints, the comprehensibility and readability of a metamodel can be improved. Also, the correctness of a model can be checked at the class level [27].
To define the abstract syntax of a language, a metamodel has been created using the Unified Modeling Language (UML) class diagram [28]. A UML class diagram used as its metamodeling language is easier to understand by domain experts and the format is widely accepted for conceptual modeling in industrial fields. The UML metamodel is compliant with the Meta-Object Facility (MOF) standard [29]. Figure 4 shows the main metaclasses in our metamodel, along with their attributes and relationships. The metamodel refers to the existing structure of the XES standard. The XES standard enforces the general structure of an event log containing the log, trace, event, and attribute objects. All information is stored in attributes where the semantics are clearly defined using extensions. It is possible to enrich the format with new domain-specific attributes for standard entities; however, extensions cannot be used to include new entities and structures to store additional information. The data perspective is missing in the XES, thus, the new metamodel proposed in this work describes necessary elements related to data objects within a process flow that are interesting for process mining analysis. The main construct of the metamodel is the artifact, which refers to the key business objects, i.e., documents. A business document structure typically contains a header and a set of lines, such as a purchase order header and line item detail. A parent-child relationship is used to associate one document header with its many lines in a database. A document header table contains general information of a transaction, while a document line captures specific line item properties. Note that transactions refer to operations on a complete document or only to a line of a document. A purchasing transaction, for example, to approve an order is linked to a purchase order header. Hence, both the header and line are subsumed in the abstract artifact entity type, which can be associated with several transactions by means of the relationship called the "object of transaction" in the metamodel. Each transaction in turn belongs to exactly one artifact.
The artifacts are connected to each other via direct or indirect relations. A relation holds a link from a source artifact to a target artifact. A direct relation specifies the reference linking the two artifacts, while an indirect relation is constituted of two or more direct links. This is performed using an attribute called "join path".
We have only represented the main metaclasses of XES metamodel to facilitate the comprehension of relations between the language elements and the elements of XES. The elements are linked using dependency relationships. This relationship indicates the dependencies between an XES element and the element of the presented metamodel for transformation implementation. The XES log element will use the artifact type log metaclass for defining a case notion to correlate events. The artifact type log metaclass has a "main artifact ID" property referencing the artifact element whose instances can be related to XES trace elements. There may only be one main artifact, and it represents a case notion. Each event in a log can be related to only one occurrence of a transaction, which is indicated by the dependency relationship between transaction type and XES event element.
The constraints that impose certain restrictions on how to construct a model are described in the following: • Artifact type log: The artifact type log is the main metaclass that represents the root of every model. The main artifact of the log must be specified, as well as the timestamp from which to begin querying for event data. It is composed of at least one transaction type. • Artifact type: The artifact type represents a document header or a document line as data objects. Each type of specialization is a subclass of artifact type. The document header class instance can have zero or more document line class instances. For every artifact type, it is necessary to specify its name and name of the table, which provides one or more fields that uniquely identify each artifact instance. There may only be one main artifact and it represents the source artifact in each relation. • Transaction type: A transaction type only occurs for one artifact. Every transaction must have the type name and the field name of a database table that holds data about the execution time of a transaction. • Direct relation: A direct relation describes exactly one target artifact class and one source artifact class. Besides the type name, the reference field must be specified. It relates the target artifact class to the source artifact class. Thus, the table of target artifact class provides one or more fields for reference information. • Indirect relation: An indirect relation describes exactly one target artifact class and one source artifact class. Every indirect relation must have the name and the join path. The join path consists of two or more joins through several linked tables.
These metamodel constraints have been defined using the Object Constraint Language (OCL) [29]. This language is the de facto standard for specifying integrity constraints for both the model and metamodel levels within the UML. As an example, the OCL constraints for artifact type log instances are given in Listings 1 and 2. Furthermore, we believe it is interesting to look at the formal specification of the constraints for the direct relation element. Listing 3 specifies exactly one target artifact class and one source artifact class for each direct relation. Next, Listing 4 shows the OCL code to restrict that the main artifact represents the source artifact in each relation. These constraints are also defined for the indirect relation element. It is also important to enforce that the timestamp used to begin querying for event data is older than the timestamp used to stop querying. This constraint is given in Listing 5.

Concrete Syntax
A second design objective refers to offering the representation of the DSML, either textually or graphically, through a concrete syntax. We opted for a graphical notation since it can represent numerous relations more comprehensively and intuitively. The language elements are represented by the graphical symbols shown in Tables 1 and 2. Also, the properties and constraints for the elements are given. Each symbol has a set of properties that must be defined by the user once a symbol is added to the diagram.

Name Notation Description
Transaction type A transaction has only one package as a drop target. Table 2. Connector elements of the domain-specific modeling language.

Direct relation
A solid line arrow connecting two artifacts. The arrow is drawn at the point where the line meets a target artifact. The name of a direct relation can be shown in the middle of the line. Next, reference information can be shown on the end connected to a target artifact. Both labels are optional.

Indirect relation
An indirect relation is shown as a dashed arrow line between two artifacts. The arrow is drawn at the point where the line meets a target artifact. The name of an indirect relation can be shown on the end connected to a target artifact. This label is optional.

Association
An association connects an artifact and a transaction. Association end by an artifact is indicated graphically by a small field circle. It should be interpreted as showing that transaction occurs only on one artifact.
The example in Figure 5 illustrates the graphical symbols used in the modeling notation. There is a clear distinction between language elements and each symbol has a reference to one modeling element. Explicitly, the symbol used for the artifact type log element serves as a starting point for creating a new model. It represents a composition of a set of elements that allows for reducing diagram complexity. Figure 5 shows a model where the purchase order is the main artifact.

Results
In this section, we demonstrate the feasibility and usefulness of the proposed DSML in a real setting. Our main motivation for developing this solution is to enable domain experts to model data of interest at a conceptual level and then automatize the extraction of event logs from ERP systems. As such, the aim of this evaluation is to verify whether the DSML model can provide sufficient information to extract and transform data from transactional ERP databases to the XES target representation. In order to validate and test the model, we use real data from a Microsoft Dynamics NAV ERP database.

Case Study
A case study has been carried out for the procurement process supported by the Microsoft Dynamics NAV ERP system. The database shares most of its elements with the running example from Section 3 and can be seen as a broader form of the running example. The data was provided by a Serbian wholesale furniture company operating in the Western Balkan region. We used NAV 2017 because the company had implemented this version of the system.
The general process behavior is given as follows. First, a new purchase order with line items is created. The purchase order needs to be approved before it can be released to a vendor. After the approval, the goods are received, and the receipt document attached to the material is inserted into the ERP system. Also, the corresponding invoice is received and registered in the system. When the invoice is posted, the details are entered into a ledger and the invoice is released for payment. Before the invoice can be paid, it is matched to one or more purchase orders. Typically, the management expects that the corresponding invoice is not paid without the registered goods receipt document. In practice, different variations of this process exist which can be explained by the fact that ERP configuration settings are not set very strict to enable the flexibility necessary in the company. Note that the purchase orders can be linked to a purchase quote or to a contract document. Obviously, the process described above has one-to-many and many-to-many relations in the data. For instance, one invoice can be matched to multiple purchase orders and multiple invoices can be issued for a particular purchase order. Because of the existence of multiple data objects, it is difficult to choose a unique case or process instance notion for the whole process. The goal is to avoid the data convergence and data divergence problems (as explained in Section 3) by choosing proper multiple case notions or artifacts.

Step 1-Designing the Model
The first of the three main steps is designing the model for a given case. This step consists of two stages. The first stage is collecting all of the relevant inner workings and peculiarities of the system by seeking expert inputs. This leads to the second stage, which is incorporating expert opinions into a viable model.

Seeking Expert Inputs
The expert business opinion is used to populate elements of our metamodel. Consulting domain experts, we determine artifact types and correlate events with artifacts. Then domain knowledge is needed to obtain the relations from the data model. This is so because, for the most-used ERP systems, e.g., Microsoft Dynamics NAV ERP, the relationships between tables are not explicitly defined in the database management system (DBMS) software. Additionally, an excessive volume of data unnecessarily increases the scope of the project and consequentially complicates it. This is the reason why experts are needed. They understand the procedures and rules of the company in question, so they can outline which data are needed to optimally achieve the set goal. The experts should define the relationships between the tables, as well as main artifacts and important attributes of the database so that the event log can be properly mapped. For example, we identified two key documents in the case study, namely, the purchase order and purchase invoice. These are the primary objects in the process and an event log is obtained for each.

Incorporating the Expert Opinions into a Model
For the need of testing, we use the purchase order as the main artifact to validate our solution. The resulting DSML model is illustrated in Figure 5. The model includes four artifact types on different levels of granularity, namely, the purchase order, purchase receipt, purchase invoice, and change log. The change log represents the document line subclass which captures changes made to the fields on the specified document that could be translated into event types. Purchase order is defined as the main artifact of the log which is selected as a case object, i.e., a type of process instance. Different transactions can be associated with these documents, i.e., the create purchase order is linked to the purchase order artifact and the posting goods receipt is linked to the purchase receipt artifact, etc.
Only the relevant transaction types should be selected considering the goal of the analysis. Further, the model consists of two direct relations and one indirect relation. The reference field is specified for each direct relation, while the join path attribute is specified as a string for purchase debit indirect relation.

Step 2-Defining SQL Query Patterns
In order to extract data, we defined SQL query patterns that will be used to create specific SQL queries whose attributes and tables would depend on the elements and properties of the model.

Select Main Artifact Instances
The artifact metaclass will be directly transformed into a SELECT statement. The SELECT clause specifies the identification properties of the data object to be retrieved, and the FROM clause indicates the table to be used. As stated earlier, these properties are mandatory for each artifact type. The condition metaclass attribute represents the condition to be fulfilled to distinguish various artifacts having the same table, e.g., different purchasing documents can be stored in one table. It will be transformed into a search clause. An extra condition can be used to specify the boundary date values for getting all event data in a certain timeframe. The values are mapped from timestamp attributes of artifact type log metaclass. The main artifact instances can be retrieved using the following query (see Listing 6).

Obtaining Interactions between Artifacts
A list of type-level relations from the main artifact to other artifacts is obtained. Each relation class is used to generate an SQL query that joins the tables of source and target artifacts and selects the identification properties of each artifact. Specifically, direct relation instances are retrieved by joining the two tables based on one or more reference fields defined in a model. For indirect relations, it is possible to connect two tables using the join path attribute which involves multiple joins to be performed. The result is effectively a join table that holds two artifact IDs, i.e., the source and target, which allows us to identify interrelated artifacts. Because of space limitations in this paper, we illustrate one such query pattern in Listing 7. To illustrate with a concrete example in our case study, an SQL query is shown in Listing 8, which selects relation instances between purchase order and purchase receipt artifacts. Table 3 shows the excerpt of the query result. Notice that, in this case, two instances of direct relation exist between the purchase order PO1 and receipt documents R1 and R3. It should also be noted that the examples specific to the case study should not be written manually, but that they would be generated based on the SQL query patterns and the provided model.  Table 3. Excerpt of the purchase item direct relation instances.

Select Transaction Instances
Next, the transaction metaclass, which describes the event type and timestamp, is also transformed into a SELECT statement. Each transaction type has a reference to only one artifact type according to the association relationship of the metamodel. Thus, the SELECT query has answer variables, i.e., identifiers of a related artifact that denote occurrences of transaction instances, timestamp, and identifiers of the main artifact, which establish a relation between transaction instance and process instance, i.e., the case it belongs to. The FROM clause indicates which direct or indirect relation will be used in the join to select only transactions which belong to the main artifact instances. The type-level relation between two artifacts is set up in the DSML model. For example, the purchase debit indirect relation is specified and then used to query the data concerning book invoice transactions.
We use the placeholder "<< . . . >>" for particular parts of the pattern which will contain the results of a subquery. The placeholder "<<relation join>>" (line 4 in Listing 9) subsumes the results of the query pattern for a type-level relation such as the one shown in Listing 8. Then, the resulting table needs to be joined with the table of the artifact to which the transaction is related, to get the timestamp of events. Additional conditions can be expressed in the where clause to distinguish transaction types or to filter the results. For example, the user can define an aggregation function to obtain only the first or last recorded transaction for the analysis. The transaction instances can be retrieved using the following query. The number of attributes and tables used in the log extraction step depends on the elements and properties of the DSML conceptual model. To create a high-level overview of the procurement process, our domain experts suggest we consider the four relevant transaction types for process analysis (see Figure 5). Therefore, the four database tables were used in the log extraction step for this case study. The Purchase Header, Purch_Rcpt_Header, Purch_Inv_Header, and Change Log Entry database tables were used.
The user is able to determine the additional constraints on which instance to consider within the conceptual model. For our analysis, we considered data recorded between 15-06-2017 and 15-10-2017. The boundary date values for obtaining all event data in a certain timeframe are mapped from the timestamp attributes of the artifact type log metaclass. In total, 31 cases were extracted from the real data of ERP system for the main artifact Purchase Order. A fragment of the extracted event log is given in Table 4. Each line corresponds to an event.

Step 3-Mapping DSML Model to XES Format
We now briefly describe how to map the model provided by the user and the data provided by the queries to an XES document which can later be used for process mining. The mapping is described in Algorithm 1. Figure 6 demonstrates the excerpt of the XES document for our case study. Map the main artifact instances to event log case objects. The "concept:name" attribute will store the case identifier, which will allow the grouping of all events linked to the same instance into a single trace. The necessary data comes from the SQL query result set for the main artifact instances. It populates the trace element in an XES model (see Listing 10).

2.
Derive the identity of each event in one of two ways: a.
Derive the name from a descriptive name of the activity or b.
Derive the name from a list of two attributes: the name of the activity and the event identifier.
This will make the log come with two event classifiers. Only one classifier will be used to determine the event class at a time, as well as assign a label to each event in a log. Therefore, the answer variables of queries resulting from the query pattern in Listing 9 are used in the definition of an event. Each transaction instance will be mapped to an event of related trace, i.e., the main artifact instance, according to the dependency relationship of the metamodel.

3.
Define a list attribute, as a new domain-specific attribute associated with a trace, in order to specify the relationships between the main artifact and the target artifacts. One list attribute is defined for each relation type relevant to the main artifact. Use the relation name to map the list attribute key.

4.
Define child attributes for each list attribute. One child attribute should correspond to one instance of the target artifact type. The child attribute is made up of a key, which is a sequence number starting from 1, and a value, which is the identifier of the target artifact. The necessary data comes from the answer variables of queries resulting from the query pattern in Listing 7. The child attribute gets its value from the target artifact identifiers as seen in Listing 11.
Listing 10. XES declaration of the name attribute for the purchase order instance.

Process Discovery
The goal of this subsection is to demonstrate that the generated XES document is valid, which would mean that the mapping is successful. We can demonstrate this by generating a business process model from the XES document with the process mining tool Disco. The resulting model can be seen in Figure 7. This shows that the XES document can be used to successfully generate process models and that the process models are in line with the previously discussed business logic, making the XES document valid. We will not analyze the model further, as it is beyond the scope of this paper. The discovered process model represents a high-level overview of the procurement process. The language proposed in this work allows the user to choose a specific view on data for building the log for analysis. Different selections of the case notion and the relevant transactions in the conceptual design phase of our solution provide different views on data. This selection is mostly influenced by the goal of the analysis. The user can choose what parts of the event data can be interesting to look into. For our case study, one can create a different view by adding more transactions within the DSML conceptual model, so a bigger process model and different variants of the process can be discovered.

Discussion
This paper has attempted to address the issue of extracting suitable event logs from data stored in relational databases, supporting domain knowledge. We have particularly focused on the business processes supported by ERP systems, which are generally seen as a collection of interrelated documents. In fact, a single process instance may involve several documents, each with its own life cycle. Hence it is very difficult to describe the evolution of various documents using a single process definition. This has led to artifact-centric approaches which can deal with one-to-many and many-to-many relations, avoiding convergence and divergence problems. These approaches have proven more appropriate for complex business processes supported by an ERP system [9].
To deal with one-to-many and many-to-many relations, the metamodel proposed in this paper provides an artifact-centric view of databases. The DSML metamodel was initially described in our previous work [30]; however, this metamodel was revised and we acknowledged some limitations compared to the new metamodel proposed in this paper. Firstly, the metamodel was incomplete in terms of its semantics and also inconsistent that made it more difficult to understand by the end-user. The artifact type log element was not considered, thus lacking properties which provide valuable information, such as the timestamp from which to begin querying for event data. Also, the attributes of metaclasses and the number of metamodel restrictions were not described. Secondly, the different levels of abstraction were within the model which compromised an appropriate interpretation of the model. For example, the trace and link elements at type level (M1) were included.
Afterward, a full specification of the language was provided in this paper. This DSML is composed of the revised metamodel with its attributes and new constraints and a graphical representation of the metamodel concepts. In addition, we have determined how certain metaclasses of language could be transformed into SQL code. This is an automatic process that does not require any further user interaction.
The case study demonstrates that it is possible to apply the developed solution to a real example. The results show that the language includes constructs that enable a domain expert to easily model data of interest in a log extraction step. Once a conceptual model has been designed, this provides sufficient information to automate the generation of queries. We have successfully described the preferred artifacts and the relations between them. We argue that our solution works in real settings because it facilitates mapping from the conceptual model and queries variables to the target XES representation. The output consists of an event log in the XES format for each main artifact. This results in avoiding the data convergence and data divergence problems, since multiple event logs are obtained with stored relations among artifacts. For this purpose, we use the list attribute in the XES standard.
We have demonstrated that the generated XES document is valid by discovering a business process model. The resulting process model is in line with the business logic presented in the case study. Furthermore, an artifact-centric process model can be discovered from these multiple event logs using existing process discovery techniques. A discussion on that topic extends beyond the scope of this paper. We refer the readers to [11,25] for more information about process discovery techniques that can be used to discover artifact life cycle models and artifact interactions.
Furthermore, the authors are aware that the main limitation of the presented solution is the lack of tool support in particular phases. In view of the positive results of the study, active work is being carried out to develop a more natural and sophisticated tool to provide better guidance to users.

Conclusions
In this paper, we have proposed a graphical DSML that provides support to domain experts in obtaining event logs from ERP databases. A new metamodel version with corresponding constraints has been introduced, as well as a graphical representation of the language concepts. Specifically, domain experts will be able to indicate where cases and events are located within a database at a conceptual level. Afterward, these conceptual models will be automatically validated and transformed into SQL code. When the query results are obtained, the extracted data can be mapped to XES elements. We have briefly described each component and demonstrated the applicability of the approach using a case study with real data from a Microsoft Dynamics NAV ERP database. Although the demonstration shows its usefulness, more empirical experiments are still needed. As part of our work-in-progress, we are developing a specialized CASE tool that will include a set of applications to support the implementation of each component.
Besides, the proposed language makes the querying of process data easier and domain experts can focus on modeling data of interest conceptually without the need to learn a programming language. As future work, a further formal definition of mechanisms for mapping ERP transactional data to the XES event log will be provided according to the data extraction model and transformation rules.