1. Introduction
People often manage work structured into multiple tasks, intuitively organized and carefully orchestrated to achieve a specific goal. Whether it is baking bread, making a bed, or designing a house, these activities involve creating or modifying a tangible or conceptual entity. Each task, regardless of its simplicity or complexity, follows a structured process to achieve the desired outcome. A task within a process must be performed under specific conditions that determine its order and dependencies. Processes may also be referred to as procedures, while tasks represent individual units of work. A resource can be a person, a machine, or a group of individuals or machines responsible for carrying out specific tasks [
1].
Business processes are the backbone of any organization, defining how work is structured, executed, and optimized. Modeling these processes [
2] using workflows provides a structured representation of tasks, their sequences, dependencies, and decision points, ensuring that business activities are executed consistently and effectively. Data pipelines are specifically a workflow that describes the flow of data through multiple stages, including extraction, transformation, and loading (ETL) [
3], but can also extend to other operations such as data cleaning, validation, aggregation, analysis, or feeding data into machine learning models [
4]. A data pipeline is a specific implementation of a workflow for processing data, focused on moving data through various stages, with clear inputs, transformations, and outputs [
5]. These workflows play a critical role in modern data environments, facilitating the seamless integration of data from multiple sources, ensuring its transformation according to business logic, and ultimately making it available for analysis, reporting, or other downstream tasks [
3].
However, the design and implementation of these data pipelines often face challenges due to the complexity of managing heterogeneous data sources, varying infrastructure, and diverse requirements [
6]. Traditionally, ETL development has been approached in an ad hoc manner [
7], with developers focusing on the tool they plan to use, whether a commercial tool, custom-built solution, or script-based approach [
7]. This approach typically leads to algorithms expressed in technical languages or specific programming architectures, making the design difficult to understand, communicate, and reuse. Such practices result in complex, error-prone systems that are hard to maintain and scale
General software modeling languages, such as UML, have been applied in ETL contexts [
8]. However, general software modeling techniques often fail to capture the specificities of ETL processes and are viewed as an extra effort due to their lack of alignment with the inherent complexities of data workflows [
9]. The result is that many initial modeling efforts are disregarded, and the design process remains disconnected from the actual implementation, contributing to inefficiencies. In contrast, several specific modeling approaches have been developed for ETL representation [
2,
10,
11,
12], aiming to address the unique requirements of data integration processes. However, these approaches also come with their own set of challenges. For instance, adopting a new modeling language or framework often requires learning new semantics, which can be time-consuming and may pose a steep learning curve for teams unfamiliar with the specific notation or methodology. In contrast, leveraging an existing and widely adopted language, such as BPMN, reduces this entry barrier, as many professionals are already familiar with its syntax and modeling conventions, and extensive documentation, community support, and tooling are available. While ETL-specific notation aims to address domain concerns more directly, it often lacks maturity, standardization, and accessible learning resources. This paper proposes a methodology for the development of data workflows that begins with a high-level conceptual specification and introduces a blueprint-based mechanism to support its structured translation into logical primitives. These blueprints serve as an intermediate layer, bridging the gap between abstract models and concrete implementations. The methodology outlines a preliminary approach to enable physical realization. Execution-ready primitives are derived through the use of a dedicated middleware component, integrated into the proposed framework, which interprets the logical blueprints and generates the corresponding implementation artifacts. By building on conceptual workflow representations and progressively enriching them into platform-specific components, the approach ensures consistency and scalability across different environments. Furthermore, it promotes a clear separation between conceptual design and technical implementation, allowing designers to validate workflows at an abstract level before addressing deployment-specific concerns. Additionally, by breaking down workflows into isolated components, we enable different teams to work concurrently on different parts of the project, improving collaboration and productivity. The methodology also promotes the reuse of components across projects, reducing development costs and improving the flexibility, reliability, and maintainability of data workflows. A tool for providing support from conceptual primitives to physical implementation is also presented.
The main contributions of this work are as follows:
A three-phase methodology (conceptual, logical, physical) for modeling data workflows based on BPMN.
A custom BPMN metamodel and tool extension using BPMN.io, enabling constraint-based modeling.
A blueprint specification schema using JSON to describe reusable ETL primitives, supporting validation and transformation.
A synthesis of previous approaches into a unified framework, with illustrative examples based on real-world data to demonstrate its applicability.
In the following sections, we begin by reviewing related work on data workflow modeling, covering both traditional and modern approaches (
Section 2). We then examine how Business Process Model and Notation (BPMN) has been applied to represent ETL pipelines and introduce our proposed three-level methodology, detailing the conceptual, logical, and physical layers of our approach, supported by a BPMN.io-based tool and formal blueprint specification (
Section 3).
Section 4 concludes the paper with a summary of findings and implications for workflow engineering.
2. Related Work
Visual representation of data workflows has been explored through various approaches as a way to improve communication and reuse and benefit from interactive visualizations that include annotations, metadata, and datasets. Graph-based techniques have been used to model ETL activities, allowing for multi-level transformations and updates [
13].
UML-based approaches provide mechanisms for the conceptual modeling of ETL processes, facilitating the specification of common operations and integration with data warehouse schemas [
14]. BPMN (Business Process Modeling and Notation) is also used as a standard representation for ETL processes, enabling the creation of business process-aware ETL tools that can adapt to changing requirements [
15]. These visual modeling techniques not only serve as blueprints for ETL workflow structure but also incorporate the internal semantics of each activity [
13,
16]. This allows for the development of rigorous quality measurement techniques for ETL workflows, building upon existing frameworks for software quality metrics. In contrast, some authors have recognized the limitations of general-purpose modeling languages such as UML and BPMN for the design of ETL/data workflow processes and have proposed domain-specific notations tailored specifically to ETL workflows [
2,
12,
17]. These specialized notations aim to provide a more intuitive and expressive way to define, analyze, and optimize ETL processes by incorporating ETL-specific constructs that are not natively supported by UML or BPMN.
In [
18], the authors framed the ETL development process within the conceptual, logical, and physical phases. Their methodology ensures that data workflows are systematically designed, validated, and implemented, aligning business requirements with technical execution. In the conceptual phase, the focus is on defining business requirements and high-level data flows without delving into technical details. This phase provides an initial abstraction that facilitates communication among stakeholders, allowing them to outline data movements and transformation goals without being constrained by implementation-specific considerations. By working at this level of abstraction, teams can identify potential bottlenecks and inefficiencies early in the design process, improving the overall clarity and direction of the workflow. The authors introduce a custom notation specifically designed for conceptual modeling. This phase begins by identifying and documenting the core components of the workflow, including the diverse data sources (e.g., databases, flat files, APIs), the target destinations (e.g., data warehouses, lakes), and the sequence of transformations required to harmonize and prepare the data for its final use. Attributes are first-class citizens and transformations are represented as logical operations—such as cleaning, filtering, joining, or aggregating data—while intermediate states of data, known as record sets, are mapped to illustrate how information evolves as it moves through the pipeline.
Once the conceptual phase establishes the broad structure of the workflow, the logical phase focuses on defining the precise steps and dependencies within the ETL or data pipeline. At this stage, the abstract model is refined into a structured design that specifies how different data integration and transformation tasks interact. The logical phase refines the conceptual ETL model into a structured, technology-agnostic blueprint, formalizing the workflow’s activities (extractors, transformers, loaders) and their dependencies as a directed graph. Transformations are defined with precise logic (e.g., filter conditions, join predicates), while persistent data stores (sources/targets) and transient data flows are distinguished to clarify data lineage. This phase bridges the abstract conceptual design with eventual physical implementation, serving as a verifiable intermediary that guarantees the workflow’s integrity before tool-specific optimizations. BPMN diagrams and UML sequence diagrams help define dependencies between tasks at the logical phase, while ETL-specific languages such as SQL-based transformation mappings provide a more formal representation of data processing logic. More advanced approaches, such as Petri Nets, can be applied to model concurrent ETL tasks and verify workflow correctness, ensuring that all processes execute efficiently and without deadlocks or inconsistencies.
Finally, the physical phase involves translating the logical model into a fully operational ETL pipeline. At this stage, workflow designs are implemented using specific tools and technologies or SQL-based ETL procedures [
19]. Automatically generating the physical implementation of an ETL pipeline from a logical model remains challenging due to the heterogeneous characteristics of ETL tools and platforms. Each tool (e.g., SSIS, Pentaho Data Integration
1., or Apache NiFi
2) or SQL-based framework operates with a distinct syntax, execution paradigms, and optimization strategies, making it difficult to create a universal translation mechanism. To bridge this gap, researchers have proposed leveraging ETL patterns—reusable, abstract templates that generalize common operations (e.g., data cleansing, surrogate key assignment, slowly changing dimensions) [
16].
The development of data workflows differs significantly from traditional software engineering due to their dynamic nature, frequent changes, and performance-driven optimizations. Unlike conventional software development, which often follows well-established systematic methods, data workflows must constantly adapt to new data sources, evolving business rules, and shifting infrastructure requirements. As a result, formal methodologies and structured modeling approaches are not commonly applied in their development [
20]. One of the possible reasons for this divergence could be related to the highly volatile and evolving nature of data workflows. Unlike software applications, where the architecture remains relatively stable over time, ETL processes and data pipelines must frequently adjust to changes in data schemas, integration requirements, and performance constraints. Data engineering teams often work with heterogeneous and unpredictable data sources, requiring constant adaptation rather than rigid adherence to predefined models. Moreover, performance optimization is a key concern, as data processing workloads can vary significantly depending on the volume and variety of incoming data. These practical constraints make it difficult to enforce a systematic methodology, leading teams to favor agile and iterative development approaches.
The scarcity of academic research on systematic methodologies for data workflow development further reflects this industry-driven reality. While data workflows and data engineering play a crucial role in modern analytics and decision making, research efforts in the field of data science have historically been more focused on machine learning, artificial intelligence, and advanced analytics, rather than the foundational engineering of data workflows. Additionally, the rapid evolution of data processing tools—from on-premises ETL systems to cloud-based data orchestration platforms [
21]—has made it difficult to establish a universal modeling standard for data workflows. Instead, best practices are often dictated by industry trends, tool-specific capabilities, and case-by-case performance considerations.
In previous works, we have adopted two approaches for ETL conceptual modeling using BPMN [
15,
22]. In [
15], we address the need for a well-defined methodology for using BPMN to represent ETL workflows. The work’s primary motivation is to establish a standardized methodology that ensures consistency, reduces misinterpretations, and allows for the reliable translation of BPMN models into executable ETL processes. A step-by-step approach to BPMN-based ETL conceptual modeling is presented, with the methodology focusing on clarifying the ambiguities inherent in BPMN modeling and provides clear guidelines on how to represent various ETL tasks and processes. Three layers are presented: Process, pattern and task.
The process level allows for an overview of the ETL system’s main processes directly related to each one of the involved objects that should be populated. It can represent only the dependencies between ETL components. This process model can be used as a top-down mechanism to develop other layers progressively using BPMN subprocesses. Process understandability can be adapted by hiding irrelevant information to a particular development stage or even to a specific user profile.
Figure 1 represents a conceptual model example, demonstrating the main populating processes associated with the load of the star schema. The process reveals the need to first populate the dimension tables (and the independence of tasks through parallel gateways) and then the fact table (preserving referential integrity).
Table 1 represents the conceptual modeling guidelines used for
Figure 1, emphasizing strategies to simplify and abstract the representation of ETL workflows. These guidelines collectively prioritize readability and abstraction, ensuring the model remains accessible while capturing essential workflow logic.
After the process level, the pattern level (
Figure 2) identifies specific modeling patterns and standardizes them, using BPMN elements to represent common ETL operations. The process details the “Load Customer Dimension” subprocess from
Figure 1. By following these patterns, designers can create BPMN models that are not only visually intuitive but also consistent in their interpretation across different users and tools. The authors presented a set of patterns that are especially relevant in data integration scenarios, e.g., change data capture, surrogate key generator, and slowly changing dimension.
This layer integrates BPMN data stores to visually represent data repositories, with directional arrows specifying input/output relationships. Each pattern (distinguished using # in their name) is annotated with structured metadata that declaratively describes the behavior of each pattern:
Pre-conditions: Validate task prerequisites.
Input: Source repositories (formatted as database.schema.table).
Description: Task purpose in plain text.
Output: Target repositories.
Post-conditions: Rules to verify task success (e.g., data quality checks).
This structured approach clarifies dependencies, execution criteria, and the outcome validation, balancing abstraction with actionable detail to guide implementation.
The third level (task level) represents the elementary level for ETL and is presented in
Figure 3. The process details the #SCD-T2#—“Load to Customer Dimension” subprocess from
Figure 2. The layer employs BPMN collaboration diagrams to represent interactions between distinct entities, such as source systems and staging areas, using pools and message flows to denote synchronized partnerships. Complex operations can be abstracted into subprocesses, enabling modular drill-downs without cluttering the high-level view. Data stores explicitly depict repositories such as lineage tables and staging areas, with directional arrows clarifying input–output relationships. Structured annotations accompany each task, detailing pre-conditions for execution, input sources, output targets, and post-conditions to validate success. Loop markers indicate record-level processing, which iterates over individual customer records. Pools are included only when critical to understanding partner interactions, such as distinguishing source systems from staging areas, while intermediate events like error handling are omitted to prioritize clarity. These guidelines collectively ensure the model remains accessible to stakeholders while encoding the technical rigor required for implementation, facilitating validation and reuse across ETL workflows. This work does not address physical primitives to enable the translation to ETL tools.
In [
22], the article proposes a methodology for traditional process modeling [
23] that addresses how to adapt to an ETL context. The approach presented in the paper organizes ETL modeling into three progressive levels of abstraction, descriptive, analytical, and executable, going beyond the conceptual representation through the definition of executable primitives.
The descriptive level provides a straightforward and accessible framework for designing ETL workflows. Adopting core elements similar to flowcharts, such as tasks (individual steps like data cleansing), subprocesses (grouped operations for modularity), and gateways, it maps workflows with minimal complexity. Events mark critical milestones, such as the initiation of data extraction (start event) or the completion of a load phase (end event). The model allows for organizing workflows using swimlanes, where pools define broad boundaries (e.g., source systems vs. staging areas) and lanes categorize phases like extraction or transformation. The models produced according to these guidelines are similar to the model presented in
Figure 1, allowing for representing the same process using different layers using subprocesses.
The analytical level introduces greater detail, incorporating dynamic behaviors, error handling, and structured metadata to support decision making and optimization. This level enables adaptive processes through intermediate events (e.g., timers, messages, errors) and boundary events (e.g., error handlers attached to tasks). These elements allow workflows to react to interruptions, such as redirecting invalid records to quarantine tables during data validation or triggering time-based retries for stalled processes. For example, a timer event might pause an ETL job until a source system becomes available, while an error boundary event could reroute corrupted data to a repair subprocess without halting the entire pipeline. Data interactions are formalized using data objects (transient inputs/outputs) and data stores (persistent repositories like databases), linked via directed associations to clarify dependencies. The parameterized annotations are explicitly documented (similar to the ones used in
Figure 2). The analytical level employs complex gateways, artifacts, and text annotations to improve documentation, while intermediate events enable event-driven execution, restoring the core elements of level 1, but enriching them with analytical depth, focusing on error resilience, conditional logic, and data-based execution. This level is useful for scenarios requiring precision, such as reconciling conflicting data sources or managing real-time ingestion, balancing analytical rigor with clarity for both technical and business stakeholders.
Figure 4 presents an example from [
22], and maps the data flow from the source systems to the staging area, focusing on the change data capture (CDC) mechanisms. Database objects are explicitly represented as BPMN data stores, distinguishing between source systems, with directional associations clarifying input–output relationships. This aligns with BPMN’s emphasis on visual clarity, ensuring stakeholders can trace data lineage and dependencies at a glance. A BPMN intermediate compensation event is incorporated, modeled as a boundary event attached to the activity. This event triggers when specific error conditions are met, moving affected records to a quarantine table. This practice adheres to BPMN’s standards for exception handling, enabling workflows to maintain continuity while isolating faulty data for later resolution. The quarantine process itself is documented through BPMN text annotations using the same approach as in [
15]. As referred to before, this practice transforms BPMN from a purely conceptual tool into a bridge for technical execution, offering implementers precise directives while retaining the model’s readability for non-technical stakeholders.
Finally, the executable level focuses on enriching models into deployable workflows, though it acknowledges that actual execution depends on integration with specialized ETL tools rather than direct BPMN implementation. The authors propose an approach where a structured, intermediate representation—specifically in JSON—is used to enrich conceptual models and enable their transformation into executable ETL implementations. This method allows for the dynamic generation of executable code or scaffolds. The approach is exemplified through the use of BIML (Business Intelligence Markup Language), which translates structured JSON annotations into fully functional SSIS packages. The authors also acknowledge the trade-off between abstraction and usability: while highly structured annotations provide precision, they can become complex and resemble full-fledged configuration languages. In this context, the integration of AI agents could improve this approach: high-level annotations could be used as prompts for AI systems to validate, clarify, and generate executable models, or to reverse-engineer physical implementations back into conceptual representations. This bi-directional mapping supports model-driven development by keeping business logic and technical execution aligned throughout the ETL lifecycle.
Both approaches share several key similarities in their objectives and structure, despite some differences in terminology and specific focus areas (
Table 2 summarizes the main stages of both approaches). Both works emphasize the need for standardized methodologies in conceptualizing ETL processes, which are often developed in an ad hoc manner, leading to inefficiencies and miscommunications across stakeholders. While both existing methodologies emphasize the importance of layered modeling and structured transformation, they adopt distinct strategies in their representation of abstraction levels, patterns, and executable translation. The proposal presented in this paper aims to consolidate their strengths into a single cohesive framework that supports clarity, reusability, traceability, and intelligent automation across the entire ETL lifecycle.
3. Proposed Methodology
This section presents the methodology proposed for modeling and implementing data workflows using a multi-layered BPMN-based approach. The methodology builds upon the foundational concepts introduced in our previous works [
15,
22] while introducing several novel contributions.
In prior studies, we explored the conceptual modeling of ETL processes using BPMN and discussed how reusable workflow patterns can support both analysis and design. These earlier works established a layered modeling approach, with each level serving a distinct purpose: from high-level process representation [
15] to analytical refinement and task-level specification [
22].
In this paper, we extend those foundations with the following new contributions:
A refined three-layer methodology—conceptual, logical, and physical—explicitly aligned with the lifecycle of data workflows.
The introduction of blueprints as logical-level workflow specifications, defined through a lightweight JSON-based domain-specific language (DSL), enabling tool-supported validation and transformation.
A specification of BPMN modeling conventions and element specializations tailored to data workflow design, facilitating the conceptual-to-logical mapping.
A prototype extension of the BPMN.io editor that supports real-time validation and guides users in aligning models with the methodology’s constraints.
An initial approach to operationalize the physical layer through the transformation of blueprints into execution-ready artifacts.
By clearly distinguishing between prior foundations and these new additions, we aim to present a cohesive and extensible methodology that supports the design, specification, and operationalization of data workflows in a structured and tool-assisted manner.
3.1. Conceptual Layer
The conceptual phase represents the first level of abstraction in the BPMN-based methodology for modeling data workflows. Its primary goal is to provide a high-level, intuitive representation of the process, capturing the main steps and components involved in the data workflow without delving into technical or implementation-specific details. At this stage, the model should serve as a shared communication artifact, bridging the gap between technical teams, business stakeholders, and data consumers. In this phase, BPMN elements are used selectively to ensure clarity and simplicity. The conceptual model aims to reduce visual and cognitive complexity while preserving the expressiveness necessary to represent the logical structure of the data pipeline. This selective usage forms the basis of a custom BPMN metamodel, specifically tailored for data-centric workflows.
To formalize this approach, we define a subset of the BPMN 2.0 metamodel composed of a restricted set of element types and attributes. These include task types with extended semantics for data processes, minimal event usage for denoting boundaries, optional gateways for decision and parallel logic, and the use of lanes to separate responsibilities or processing domains. We implement this custom metamodel through a model extension mechanism based on bpmn.io, particularly using the BPMN-JS modeling framework. The extension provides a semantic layer over standard BPMN elements by introducing domain-specific types and constraints on model composition. The BPMN-JS modeler is extended to reflect these conceptual constraints, allowing designers to select from a controlled palette of elements and annotate them using custom properties. These definitions are validated via JSON-based descriptors compatible with the BPMN-JS modeling stack, ensuring that conceptual models follow the prescribed rules.
The following subsections describe the modeling rules and the metamodel structure adopted in the conceptual phase. We present the allowed BPMN element categories (tasks, events, gateways, lanes), their semantics within a data workflow, and how they are extended using the BPMN model extension mechanism. This controlled vocabulary supports model validation and future transformation into logical and physical layers of the ETL pipeline lifecycle.
3.1.1. Activity Elements
In the conceptual model, BPMN activity elements represent atomic operations or macro-activities that are part of the data workflow. The purpose is to describe what is being performed, rather than how it is executed. The selection and extension of task elements in this phase aim to provide a clear and high-level view of the data process while preserving key information that can support further refinement in the logical model.
To this end, the following types of task-related elements are allowed in the conceptual metamodel:
Call activity: Used to reference external or reusable conceptual components. It is especially useful for modeling shared processes or imported data routines.
Sub-process (collapsed): Used to represent macro-activities that can be further detailed in subsequent phases or within the same model.
Sub-process (expanded): Allows inline specification of the steps contained in the macro-activity, useful when additional context is needed.
In addition to these task forms, the conceptual model allow the use of loop and multi-instance markers to represent recurring or distributed behaviors:
Loop marker: Indicates that a task or subprocess is repeated under certain conditions. This may be used conceptually to represent iterative operations such as data cleansing passes or retraining of a model.
Sequential multi-instance marker: Represents repetition, where the instances are executed one after the other; useful in sequential data transformations or time-series forecasting scenarios.
Parallel multi-instance marker: Represents concurrent execution of multiple instances of the task or subprocess. This is useful when tasks can be applied in parallel, such as processing multiple datasets or training parallel models.
The inclusion of these markers at the conceptual level is optional and should be controlled. They should only be used when the iteration or distribution behavior is essential to the meaning of the activity and contributes to understanding the data workflow at a high level. Care must be taken to avoid unnecessary visual complexity, especially in large pipelines. For that reason, it is recommended to apply these markers primarily to subprocesses, which encapsulate the repeating logic, rather than individual tasks.
In conceptual modeling of data workflows, different types of tasks, such as data extraction, transformation, and loading, interact with data objects and data stores in specific ways. The rules governing these interactions ensure clarity in the process and allow for a well-defined flow of data throughout the workflow. The DataSource task is used for extracting data from an external source, such as a file or a database. At least one data object (representing the input file) or data store (representing the database or storage system) must be used as an input. This ensures that the source of the data is clearly defined. The DataSource task may also have an output, representing the extracted data that will be passed to a transformation or data loading task.
The DataSink task represents the final step in the ETL process, where transformed or processed data is written to a target location such as a database or file. For this task, it must have at least one output data object or data store, as the data needs to be persisted or stored. The DataSink task can have multiple inputs, depending on the number of data sources or transformations being merged or loaded. The DataTransformation task represents the process of transforming data (e.g., filtering, aggregating, or mapping). This task does not necessarily have explicit inputs or outputs defined in all cases, as transformations are often applied to the data that flows through the process. However, for clarity, it is useful to model the transformation with one or more inputs (representing the data being transformed) and a single output (representing the result of the transformation). Every elementary task (DataSource, DataSink, DataTransformation) in the conceptual model follows a clear pattern regarding its data input and output interactions. Each task can have multiple inputs, but must have only one output. This ensures that the flow of data remains clear and well defined, avoiding confusion and ambiguity. The DataSource task and DataSink task must have clear connections to external data objects or data stores to indicate where data is coming from and where it is being sent. The DataTransformation task, though more flexible, should still define the relationship between input data and output data to maintain clarity.
These rules are particularly useful for conceptual modeling because they maintain simplicity and clarity while ensuring consistency across the entire process. By limiting tasks to having one output, the model avoids unnecessary complexity and focuses on representing the flow of data at a high level. The use of data objects and data stores also ensures that the model remains abstract, without delving into technical details such as exact data formats or types. As the tasks become more detailed in the logical and physical phases, the basic structure established in the conceptual phase provides a foundation for further expansion.
Figure 5 represents an extension to the BPMN metamodel (using Unified Modeling Language) that specializes the Activity class into various types of tasks relevant for conceptual ETL modeling. These include DataSource task, DataSink task, and DataTransformation task, all of which inherit from the BPMN task element. Each task type defines specific constraints on the number of associated DataArtifact elements, which unify the representation of data objects and data stores at the conceptual level. Cardinalities are used to express these constraints: a DataSource task must have exactly one input and one output; a DataSink task requires one input and may produce multiple outputs; and a DataTransformation task can have multiple inputs but is limited to a single output. These associations model the data flow logic commonly found in ETL processes, while maintaining clarity and abstraction suitable for the conceptual modeling phase. The diagram also preserves BPMN structural constructs such as SubProcess and CallActivity, and includes an optional MarkerType to represent iteration or parallelism semantics when appropriate at this level of abstraction.
3.1.2. Events
Events are used to represent the beginning and end of the process, as well as specific triggers that initiate data-related actions. Events are crucial for defining when a process starts, under which conditions it is activated, and when it concludes. However, given the abstract nature of this phase, the range of event types is restricted to maintain simplicity and clarity while preserving the expressiveness needed to model a wide variety of data workflow scenarios.
The following BPMN event types are allowed in the conceptual model:
Start event: Represents a generic entry point for a process. It is used when no specific trigger needs to be modeled or when the process starts as part of a larger orchestration.
End event: Marks the termination point of a data workflow. It helps to delimit the scope of the conceptual process and clarify where the pipeline logically concludes.
Message start event: Indicates that the process begins upon the reception of a message. This is useful to model external invocations, such as the arrival of a data file, an API call, or the output of another system.
Timer start event: Used when the initiation of the data workflow depends on a time-based condition, such as a scheduled batch job or periodic ETL process.
Conditional start event: Applied when the process begins based on the evaluation of a business rule or system condition, for instance, when a certain data volume threshold is reached or a flag is set in a database.
Signal start event: Represents a process that is triggered by a broadcast signal, which could come from a parallel pipeline or a control process. This is particularly useful for reactive workflows that need to respond to events within a data ecosystem.
These event types ensure that different initiation patterns in data pipelines can be represented while maintaining the simplicity expected at the conceptual level. From a metamodel perspective, the conceptual layer introduces a restriction on the BPMN StartEvent class by limiting its trigger property to the allowed types. These triggers are typically mapped to high-level concepts such as schedules, messages, and conditions rather than to specific system-level events. The EndEvent class remains simple, as no additional semantics are required in the conceptual model.
The class diagram from
Figure 6 provides a formal representation of the event elements allowed in the conceptual BPMN metamodel for ETL processes. At the core, all event elements inherit from the abstract event class, which in turn extends the base BPMNElement, representing any BPMN modeling component. To improve model expressiveness and clarity at the conceptual level, triggers are each represented as specific subclasses—namely, MessageStartEvent, TimerStartEvent, ConditionalStartEvent, and SignalStartEvent. Each of these can be used to indicate different initiation mechanisms of data workflows, such as scheduled jobs (timer), external system notifications (message), conditional dependencies (conditional), or asynchronous signals (signal).
The EndEvent class, which also inherits from the Event class, marks the termination point of a workflow. In the conceptual phase, it serves to clearly define the boundary of the ETL process without requiring further specification of output behavior or termination logic.
3.1.3. Gateways and Sequence Flows
In the conceptual phase of modeling ETL workflows using BPMN, gateways and sequence flows play a crucial role in defining the flow of control between different tasks and events. While sequence flows in BPMN represent the paths that link tasks, events, and gateways, gateways control the branching, merging, and synchronization of process flows. The use of gateways in this phase is essential to represent the decision making and parallelism that are integral to data workflows. This section explores the gateways allowed in the conceptual model and their relevance to ETL process representation.
Sequence flows in BPMN are used to connect different flow elements (tasks, events, and gateways), indicating the direction of the workflow. In the conceptual phase, sequence flows maintain the same semantics as in standard BPMN models. They establish the order of execution between elements, ensuring that the flow of control between the tasks and events is properly represented. Sequence flows are fundamental to creating a logical progression within the workflow, whether it is linear or branched.
Gateways are used to control the flow of the process by splitting or merging paths based on conditions or events. In the context of ETL workflows, gateways can be utilized to model decisions, parallel execution, or event-driven processes, all of which are key components in data integration. The gateways allowed in the conceptual BPMN model include the following types:
Parallel gateway (AND gateway): The parallel gateway is used to represent the execution of tasks that can run in parallel, indicating that multiple paths of execution should proceed independently and concurrently. This is particularly useful in ETL workflows, where different transformations or data extraction tasks can be executed in parallel to optimize performance. For example, different data sources might be processed simultaneously before being merged into the final dataset. In the conceptual model, this gateway helps to represent workflows where certain tasks are independent of each other.
Inclusive gateway (OR gateway): The inclusive gateway allows for one or more paths to be taken, depending on the conditions defined on each outgoing sequence flow. This type of gateway is suitable for modeling scenarios where some conditions may or may not be met, allowing for a flexible path selection. In ETL workflows, inclusive gateways can represent conditional processing where different data transformation steps are executed based on certain data characteristics or metadata. This gateway is particularly useful in the conceptual phase to express optional tasks that could be triggered based on specific conditions or input data.
Complex gateway: The complex gateway is used when the flow of control depends on a complex set of conditions. It allows for more advanced logic than the inclusive gateway by supporting multiple conditions and diverse scenarios. In ETL processes, the complex gateway could be useful for modeling sophisticated business rules or data transformation logic that cannot be captured by simpler gateways. Although it might seem like an advanced element, the complex gateway can be used at the conceptual level to represent workflows with intricate decision-making processes or complex dependencies between tasks.
Event-based gateway: The event-based gateway is employed when the flow of control depends on the occurrence of a specific event, such as a message arrival or a timer expiration. In ETL workflows, this could be useful for modeling event-driven processes where certain actions are taken based on external triggers or specific events in the data pipeline. For example, a workflow might proceed based on an event such as the arrival of new data or a scheduled task. This type of gateway is particularly relevant in conceptual modeling when the sequence of actions in the workflow is contingent upon external triggers, making it essential for workflows that interact with external systems or that depend on real-time events.
While gateways are typically used in more detailed phases of BPMN modeling, their inclusion in the conceptual phase can be highly beneficial for visualizing key decision points, parallelism, and event-based interactions within the data workflow. However, the use of gateways in the conceptual phase should be carefully considered to avoid overcomplicating the model.
The use of complex gateways and multiple branching paths should be limited, as they may introduce unnecessary complexity that could hinder understanding. However, using gateways such as the parallel gateway for parallel execution and the inclusive gateway for conditional tasks makes sense, as they represent essential characteristics of data workflows that need to be communicated to both technical and non-technical stakeholders.
3.1.4. Data Objects
In BPMN-based conceptual modeling for ETL workflows, data objects and data stores are essential for representing the data that flows through the system. Data objects are used to represent files or specific pieces of data used and manipulated during the ETL process, while data stores represent databases or large storage systems where data is persisted or retrieved from during the process. Data objects are used to show how data is passed between tasks in the ETL workflow, such as from an extraction task to a transformation task. For example, a data object might represent an input file containing raw data to be extracted or a transformed file generated during the data transformation step. Data stores, on the other hand, are typically used to model data sources or data sinks that are part of a larger data warehouse or database system, providing a clear view of where data resides—either as an input or output for different tasks.
3.2. Metamodel Implementation
This conceptual model identifies the essential blocks that will be further refined in the logical (next) phase and eventually translated into executable operations. We selected BPMN.io project for implementing BPMN modeler. For that, we implemented model extensions to be used with BPMN.io. We need to define the custom elements and properties. The palette of elements is restricted to those previously identified, and for each task a property panel identifies the specific task used. When a DataSource is selected, a data artifact with an input association is automatically created for identifying the source input. Inputs and outputs are defined and validated in the DataTransformation and DataSink task types as described in
Figure 5.
The proposal presented in this paper enhances the practices of previous works [
15,
22]. It distinguishes itself from previous approaches by introducing a custom metamodel and constraint-driven tooling that actively enforces modeling consistency. This formal backbone ensures that conceptual models are not only readable but also structurally sound and aligned with logic without embedding implementation complexity. Compared to [
15], the approach presented in this paper introduces a significantly higher degree of formalization. While ref. [
15] emphasizes simplicity tailored to the ETL domain and promotes a top-down understanding through high-level subprocesses, it lacks a defined metamodel and enforcement mechanisms to guarantee modeling consistency. In contrast, the presented approach adopts a custom BPMN metamodel and integrates tool-supported validation (via BPMN-JS extensions), enforcing constraints on task types, data flow, and process structure.
The approach in [
22] is grounded in general BPMN modeling principles, which supports abstraction but does not incorporate domain-specific modeling constraints. Moreover, our earlier work in [
15] deliberately maintained a high level of abstraction at the conceptual layer, assuming that recurring patterns would govern downstream modeling. However, this reliance on implicit pattern-based refinement may limit adaptability across different domains and use cases.
The BPMN model shown in
Figure 7 demonstrates conceptual alignment with the analytical representation in
Figure 4, while adhering to the methodology’s abstraction requirements. Three key design decisions enforce conceptual compliance: (1) Task names were generalized to domain-agnostic formulations, decoupling them from implementation-specific terminology; (2) granular data annotations were intentionally omitted to avoid technical specificity incompatible with conceptual modeling; (3) activities were systematically classified using the custom BPMN task taxonomy defined in
Figure 5.
This representation maintains what we identify as the minimum viable specificity—preserving essential process logic while eliminating implementation details. Although more abstract renderings could theoretically satisfy phase requirements (as demonstrated in
Figure 1), this version strategically exposes critical control flow relationships needed for stakeholder validation. The detail level is similar to the analytical phase described in [
22]. The model remains extensible through BPMN’s native constructs, where process complexity can be accommodated through gateway patterns, event handling, and subprocess decomposition without compromising conceptual integrity.
3.3. Logical Layer and Blueprints
In the context of data integration and ETL processes, a logical model serves as an intermediate specification that bridges the gap between high-level conceptual modeling and low-level physical implementation. While the conceptual model focuses on abstract representations, logical modeling moves the focus toward how these processes work and how specific requirements will affect how the process will be sequenced. Logical models define the control and data flow between ETL activities, describe transformation logic, and often incorporate dependency management, ordering constraints, and data lineage [
24]. These models are not tied to specific ETL tools but are precise enough to support automation, validation, and optimization of workflows. The ETL pipeline logical modeling can capture metadata such as source–target mappings, task parameters, and join and filter conditions. Logical modeling supports the reuse of transformation patterns and the validation of workflow semantics before committing to any platform-specific implementation, contributing to the overall robustness and maintainability of ETL systems [
25].
Refs. [
15,
22] approach the relevance of an intermediate logical layer to bridge the conceptual clarity with the executable implementation of ETL processes. However, they diverge in their goals and granularity. The logical model in [
15] is structured around the pattern layer, which acts as an intermediate between the high-level process layer (conceptual) and the low-level task layer (physical/executable). Each subprocess in the conceptual model is progressively refined into reusable ETL patterns, such as surrogate key pipelining, slowly changing dimensions, or change data capture. These patterns serve as templates that constrain and guide the creation of specific task implementations. The logical model in this approach is domain-specific and rule-driven, but lacks a formally defined metamodel. Despite its value, the pattern layer exhibits domain specificity: patterns suitable for traditional ETL contexts may not transfer directly to other domains, such as machine learning pipelines, which naturally require different abstractions. Furthermore, our experience is that developers encounter a steep learning curve when attempting to identify or interpret these patterns, particularly when lacking prior exposure to design pattern thinking. This indicates a need for better onboarding or supportive tooling to ease adoption.
Ref. [
22] provides enrichment of the first layer, turning this logical layer to more general-purpose abstracting, enough to remain detached from the technical constraints of specific ETL platforms. The modeling is process-driven and draws from classical BPMN best practices (e.g., gateways, events, swimlanes), but adapts them to ETL semantics using structured annotation (like in [
15]) and using data artifacts to describe the data flow in each task. The pattern concept is also used, but more informally, since common ETL procedures are used to identify common tasks. It offers more flexibility and acts as an incremental approach to the conceptual model.
The approach presented in this paper supports the development of conceptual models using a simplified BPMN palette. For each activity, subprocess and task (DataSource, DataSink, DataTransformation), the logical layer can be specified. It introduces the notion of blueprints, which function as reusable templates for recurring operations. These blueprints define expected behavior, inputs, outputs, and constraints, but defer implementation details. Each blueprint is described in a lightweight DSL (in this case JSON), allowing consistent interpretation and validation while remaining readable and extendable.
Designers can bind these blueprints to BPMN tasks or subprocesses when greater specificity is needed, or define new blueprints to fit domain-specific patterns. For example, in the context of machine learning pipelines, new task types and data flow semantics can be modeled without altering the core metamodel. This provides flexibility across domains while maintaining structural and semantic integrity. However, blueprint usage is not mandatory—if the user only needs to describe the conceptual flow, the modeling process remains simple.
Figure 8 presents a class diagram modeling the core components. At the conceptual level, BPMNElement serves as the abstract base for the task and subprocess classes, with three specialized task types shown: DataSource (entry points), DataSink (termination points), and DataTransformation (processing units). These correspond to the fundamental building blocks of data pipelines while remaining platform-neutral.
The logical layer centers on the blueprint interface, which enforces structural contracts (inputs, outputs, validation) for reusable workflow components. Slowly changing dimension (ETL pattern) and hyper-parameter tuning (ML optimization process) are just two examples of domain-specific blueprints. Specialized blueprints bind to exactly one blueprint, and subprocesses may compose multiple blueprints.
We formalize blueprint specification through a constrained grammar that enforces topological invariants across all elementary tasks. This grammar is instantiated in JSON schemas that serve both human designers and automation tools. The blueprint syntax follows the rules described in
Listing 1. The grammar defines a blueprint as a composition of metadata (an id following a Namespace.Concept.Version pattern and a semanticRole, classifying it as a source, sink, or transformation), a LogicalInterface (declaring typed input/output DataPorts, where each port has a name, a type—either data or control—and a contract referencing an abstract schema), and SemanticConstraints (enforcing connection rules via cardinality bounds like minIn/maxOut and logical invariants expressed as predicates). This structure ensures implementation-agnostic specification of data operations:
Sources (semanticRole: source) have 0 inputs and *1+ outputs*;
Sinks (sink) have *1+ inputs* and 0 outputs;
Transformations define N→M mappings with explicit derivation logic, all while maintaining tool-independent contracts (e.g., “contract”: “SchemaReference”).
The grammar enforces logical correctness (e.g., single-output via maxOut: 1) without prescribing physical protocols. For the DataSource atomic task, that is also a (embedded) blueprint, the JSON structure defines a logical DataSource semantic that abstractly represents a data origin point in a pipeline. The blueprint declares itself as a generic source (“logical.source.generic.v1”) with the semantic role of “source”. Critically, it specifies no inputs (empty array) and exactly one output, named “data_out” of type “data”, enforcing the single-output principle through cardinality constraints (minOut:1, maxOut:1). The output’s data characteristics are defined abstractly via a schema reference (“SchemaReference/ExternalDataSet”).
The “contract”: “SchemaReference/ExternalDataSet” is a logical data contract that abstractly defines the expected structure and semantics of data (via a reference to an external schema, e.g., SchemaReference/PatientRecords) without specifying physical storage details (like file formats or database protocols). It enforces what the data should be (e.g., fields, types, constraints) rather than where/how it is stored, enabling implementation-agnostic pipeline design. For example, a DataSource blueprint with this contract could later bind to a PostgreSQL table, CSV file, or API endpoint, as long as the actual data conforms to the referenced schema.
For example,
Listing 2 presents the surrogate key generation blueprint.
The input contract representing the input fields metadata is exemplified in
Listing 3.
Following the same approach, the output metadata and specific constraints/post-conditions are represented in
Listing 4.
Each contract is described using a simple JSON schema format, listing fields, types, and field-level constraints (e.g., uniqueness, positivity). The logical invariants—such as ensuring a one-to-one mapping or that surrogate keys are positive integers—are defined independently, referencing fields from the input and output contracts by name. This separation improves modularity and reusability while enabling validation at design time. A dedicated validation routine can then parse both contracts and the invariant specification, check field presence and type compatibility, and simulate invariant rules using standard Python (3.x version) logic. This structured form allows tools to automatically validate assumptions, detect modeling inconsistencies early, and support assisted generation and correction through AI-based agents or rule engines.
3.4. Physical Implementation
The process of transforming a logical blueprint into an executable ETL pipeline involves mapping abstract data transformation patterns into concrete tool-specific workflows while maintaining fidelity to the original design semantics. For example, the implementation of a surrogate key pipeline—defined at the logical level through a blueprint—can be realized in technologies such as SQL Server Integration Services (SSIS) or Azure Data Factory (ADF).
Considering SSIS, the blueprint specifies the data inputs (e.g., business keys such as customer_id), expected outputs (e.g., a schema containing customer_sk and effective_date), transformation rules (e.g., one-to-one mapping with a surrogate key generated by a SEQUENCE strategy), and associated parameters. In SSIS, an on-premises ETL tool, this logical design is operationalized using components in a data flow pipeline. A source component extracts data using queries (e.g., SELECT FROM stage.customers) that conform to the input contract. Transformations, such as surrogate key generation, are then implemented via native tasks (e.g., RowNumber, ScriptComponent, or DerivedColumn). The processed data is loaded into the destination schema (e.g., dim_customer), ensuring that the mappings are consistent with the output contract defined in the blueprint.
This mapping can be automated using templating languages like Business Intelligence Markup Language (BIML), which allows for the generation of SSIS packages based directly on blueprint metadata. Importantly, the approach is flexible: the designer retains control over the specific implementation logic and can choose among multiple alternatives (e.g., using ROW_NUMBER() in SQL vs. a script transformation), allowing adaptation to project standards or performance constraints. Furthermore, the process can be supported by AI agents that assist in selecting the appropriate mapping strategy, auto-generating ETL templates, validating schema alignment, and even adapting implementations to different execution environments—all while ensuring compliance with the original blueprint semantics.
By beginning with a well-defined and technology-neutral blueprint, teams can automate the generation of ETL pipelines in SSIS, ADF, or other environments. This approach not only ensures consistency and reduces manual work but also maintains the flexibility to evolve with changing business or technological requirements. Whether the deployment is on-premises or in the cloud, the logical blueprint serves as the single source of truth that bridges the gap between design and execution.
An excerpt of the translation process for SSIS is presented in
Listing 5, representing an SSIS data flow that implements a surrogate key generation pipeline based on a formal blueprint specification. It begins by extracting customer data from a staging table using an OleDbSource, retrieving only the customer_id field as defined in the input schema. A script component is introduced to programmatically assign surrogate keys using a custom sequence or identity logic, aligning with the strategy parameter defined in the blueprint. A DerivedColumns transformation adds metadata, such as the current date as the effective_date, further enriching the data. Finally, the transformed data is sent to a dimension table via an OleDbDestination, with explicit column mappings ensuring consistency with the output schema. A high-level pseudo-code for translating to the BIML from JSON blueprints is presented in
Listing 6.
The algorithm outlines a structured, multi-step process for converting a technology-neutral ETL blueprint—defined in JSON format—into an executable pipeline in a target environment such as SSIS (via BIML). It begins with parsing and validating the blueprint to ensure semantic correctness and completeness. Then, it resolves all referenced input and output schema contracts, capturing metadata such as field types and constraints. A scaffold of the ETL pipeline is initialized, and components such as sources, transformations, and destinations are mapped directly from the blueprint’s logical structure. Parameters defined in the blueprint are applied to fine-tune the behavior of the components, while formal invariants are translated into validation logic to ensure data integrity. Finally, the pipeline is rendered into the syntax and structure of the chosen ETL platform. In this process, the high-level workflow logic is supplied by a .bpmn file that defines the orchestration of tasks using the standardized BPMN 2.0 vocabulary. While visual elements (e.g., node positions, diagram layout) are disregarded, the semantic content—comprising task, sequenceFlow, exclusiveGateway, parallelGateway, StartEvent, EndEvent, and related elements—is parsed and interpreted to construct the execution flow of the pipeline. This allows the algorithm to preserve control-flow constructs such as branching, merging, loops, and parallel execution paths. The integration of BPMN semantics into the blueprint translation ensures that the resulting ETL pipeline not only performs the correct data transformations but also adheres to the intended process choreography, enabling faithful round-trip engineering between conceptual models and deployable implementations. This entire translation is managed by a dedicated middleware component, which is an integral part of the proposed framework. This component interprets the high-level orchestration logic from the conceptual BPMN models and the detailed operational specifications from the logical blueprints to generate the final executable artifacts.
4. Conclusions and Future Work
Data workflows have become the backbone of modern information systems, supporting applications from traditional batch ETL pipelines that populate data warehouses to real-time feature engineering in machine learning platforms and generalized data integration across microservices, APIs, and event streams. Their proper design and governance are essential to ensure data consistency, traceability, and compliance. Unlike other areas of software engineering, data workflow development has lacked a unified, discipline-driven methodology. In this paper, we have identified this gap and proposed a structured framework to bring consistency, reuse, and automation to the design, implementation, and evolution of data workflows.
BPMN has been used to document data flows and ETL processes. A key benefit of the proposed methodology is its foundation on BPMN, a well-established and broadly adopted modeling standard. This choice minimizes the cognitive and operational overhead associated with introducing completely new modeling languages. Instead of requiring teams to learn unfamiliar semantics and specialized tooling from scratch, our approach allows practitioners to capitalize on existing knowledge, tools, and best practices from the BPMN ecosystem. By specializing and constraining BPMN for the data workflow domain—rather than inventing a new language—we balance expressiveness with usability, making the approach more accessible, sustainable, and easier to adopt in real-world environments. Business analysts, data engineers, and operations specialists may each view the same diagram through a different lens, resulting in misalignment, undocumented handoffs, and fragile pipelines. Moreover, BPMN’s native constructs were not designed specifically for data transformation semantics, making it hard to represent key ETL concepts—such as data lineage, schema contracts, and deterministic transformations—in a consistent, tool-agnostic way. To overcome these challenges, our methodology emphasizes a conceptual modeling phase that leverages a deliberately minimal subset of BPMN: three elementary task types (DataSource, DataTransformation, DataSink), clear input/output flows, and explicit labeling of data objects. This layer focuses purely on what needs to happen, without prescribing how or where. Its goal is to capture the essential business intent, dependencies, and high-level triggers (e.g., daily, event-driven) in a lightweight diagram that serves as the “single source of truth” for all activities.
The second, logical layer introduces the concept of blueprints—formal, reusable templates that behave like interfaces in object-oriented programming. Each blueprint defines a pattern’s input and output schema contracts, transformation parameters (e.g., surrogate key strategies, filter expressions), and invariants (e.g., 1:1 mappings, value constraints). Blueprints are specified in a structured DSL (JSON or YAML) and managed in a registry so that teams can extend them or define new domain-specific patterns (e.g., streaming CDC, machine learning feature extractors) without altering the core metamodel. This layer preserves conceptual simplicity while adding semantic rigor.
The third, physical layer operationalizes blueprints by generating executable artifacts. As an example, we choose BIML packages for SSIS. The transformation logic is mapped to concrete components (source connectors, script tasks, derived columns, destinations), and column mappings enforce the schema contracts defined upstream. Crucially, this layer remains decoupled so that physical pipelines can be reverse-engineered back into logical blueprints and conceptual BPMN diagrams, enabling continuous synchronization between design and execution.
The methodology is centered on flexibility. The blueprints themselves are highly customizable: if designers require more detail than the simple conceptual models provide, they may define their blueprints to capture domain or project-specific requirements. Looking ahead, we envision a shared blueprint repository where multiple designers contribute and curate pattern libraries—extending the ecosystem of reusable building blocks and accelerating automation across teams. Likewise for the translation logic that generates physical artifacts—here exemplified by BIML translators that can be modularized and shared. Designers often have distinct performance considerations, organizational standards, or platform constraints that lead them to prefer one implementation strategy over another. By treating translators as pluggable components, we enable each team to adopt or contribute their code generators without imposing a one-size-fits-all solution. In this way, both blueprints and translators remain open, extensible, and community-driven—avoiding the rigidity of a closed system and instead fostering continuous innovation, reuse, and adaptation.
Looking forward, several avenues promise to enrich and extend this methodology:
AI and LLM integration for blueprint discovery: Use language models to infer likely blueprint patterns from natural-language process descriptions or existing pipeline metadata.
Parameter auto-completion: Auto-populate blueprint parameters (e.g., default filter expressions, common connection strings) based on historical usage.
Invariant validation: Leverage symbolic reasoning or probabilistic inference to suggest missing invariants or detect potential data anomalies before execution.
Interactive design agents: Chat-based or voice-driven assistants that allow non-technical stakeholders to sketch workflows in plain language, with the system proposing corresponding BPMN diagrams and blueprint bindings.
Multi-backend support: Expand the physical layer to cover additional execution engines and facilitate seamless migration or hybrid deployments across on-premises and cloud environments.
Open-source blueprint repository: Foster a community-driven ecosystem of domain-specific blueprint libraries—covering marketing analytics, IoT ingestion, and machine learning pipelines—encouraging sharing, best practices, and continuous improvement.