Data Usage and Access Control in Industrial Data Spaces: Implementation Using FIWARE

: In recent years, a new business paradigm has emerged which revolves around effectively extracting value from data. In this scope, providing a secure ecosystem for data sharing that ensures data governance and traceability is of paramount importance as it holds the potential to create new applications and services. Protecting data goes beyond restricting who can access what resource (covered by identity and Access Control): it becomes necessary to control how data are treated once accessed, which is known as data Usage Control. Data Usage Control provides a common and trustful security framework to guarantee the compliance with data governance rules and responsible use of organizations’ data by third-party entities, easing and ensuring secure data sharing in ecosystems such as Smart Cities and Industry 4.0. In this article, we present an implementation of a previously published architecture for enabling access and Usage Control in data-sharing ecosystems among multiple organizations using the FIWARE European open source platform. Additionally, we validate this implementation through a real use case in the food industry. We conclude that the proposed model, implemented using FIWARE components, provides a ﬂexible and powerful architecture to manage Usage Control in data-sharing ecosystems.


Introduction
Extracting value from data is one of the key aspects leading the development of applications and services, especially for Industry 4.0 [1]. Consequently, ensuring data governance and traceability becomes imperative to promote the exchange of data in this new business paradigm. Another crucial aspect of the data economy is interoperability [2], since sharing data between different stakeholders brings many new opportunities for all parties involved. In this scope, the use of trusted and secure platforms for sharing and processing personal and industrial data is essential for the creation of a data market and a data economy.
There is no doubt that big data analysis has played an important role in the fourth industrial revolution as it helps to determine a low cost strategy for companies to be more competitive and to identify how to increase their revenue and optimize their processes [3]. Furthermore, IoT has become one of the core elements in Industry 4.0 [4] as well, since it facilitates factories the tasks of speeding up their product development, achieving a more flexible production, and setting up more complex environments. Currently, smart factories are growing in number and they have the capability of manufacturing intelligent and customized products in a short period of time considering customers' preferences in real time [5]. In this regard, the streaming data gathered by IoT devices previously described in [15]. This article extends said work by providing an implementation of the proposed architecture and a validation with an original case study in the food industry.
The proposed architecture fulfills the main requirements [16] of Internet of Things (IoT) applications developed in shared data scenarios. It guarantees reliability and interoperability through the trusted and standardized environment established thanks to the IDS framework. Likewise, IoT applications are usually very demanding in terms of scalability. Thus, the data Usage Control architecture needs to meet this requirement. The extended XACML-based architecture separates each component based on its role, so each of these components can be scaled as needed. In terms of dynamism, the proposed architecture can easily apply predefined access and usage policies to the new dynamically added nodes, as it stores and manages those policies in a centralized way.
In this scenario, the FIWARE platform (FIWARE: The open source platform for our smart digital future, https://www.fiware.org/) seems particularly suitable for implementing a Usage Control architecture. FIWARE is an open initiative whose mission is to ease the development of new Smart Applications in multiple sectors by providing a set of components, known as Generic Enablers (GE), that enable the connection among IoT devices and Context Information Management and other services such as security or big data analysis. In a previous work [17], we proposed an implementation of the IDS architecture using FIWARE components focused on data brokering, identity and trust management, and the development of IDS Connectors. In the present work, we extend such implementation by adding the components that are needed to achieve Usage Control by data providers based on the architecture we propose. The newly added components introduced in this work are the Policy Translation Point (PTP), the Usage Policy Decision Point (uPDP) and the Policy Execution Point (PXP). A thorough explanation of every one of these components is presented in Section 4.
The article is structured as follows. Section 2 reviews the most relevant related work on data Usage Control. Section 3 presents a brief description of the architecture that we took as a reference for this implementation including some additional considerations for this proposal. In Section 4, an implementation of the proposed solution using FIWARE components is described. Section 5 presents a validation of said implementation by means of a use case in the food industry, which includes the results obtained from measuring the enforcement time when applying Usage Control policies. Lastly, Section 6 finishes with the conclusions of the article and an outlook on future work.

Related Work
According to [18][19][20], the adoption of big data and industrial IoT (IIoT) is rapidly growing in Industry 4.0. Previous works have provided industrial solutions that make use of this type of technologies such as recommendation and prediction systems, process optimization techniques, intelligent manufacturing, and AI applications [21,22]. In this regard, many open source big data tools have been developed to enable companies to process the vast amount of data that are generated by the IIoT. Several proposals present a general view of the big data architecture needed to manage industrial scenarios in which multiple sources of data are present [23,24]. The authors of [25] present a five-layer architecture for big data processing and analytics (BDPA): collection, storage, processing, analytics, and application. Since each layer involves different tasks, many open source tools have been developed to help overcoming some of them. Some examples include: As stated in [6], big data processing in industry is different in several aspects from other scenarios. One of the major concerns in this sort of environments is the way in which data are collected, considering that data can derive from multiple sources. This issue has become more crucial in data ecosystems shared among multiple organizations, since each organization can process and store data in different ways and has different access and usage rights over data. Thus, we identify the need to use a common standard, compatible with the existing big data tools and IoT devices, that provides a standardized way to gather, extract, process and store data in industrial contexts. In this regard, NGSI-LD provides a simple and powerful open API published as an ETSI specification for the management of context information [26]. This specification promotes the adoption of a standard way to manage data in the whole industrial data processing pipeline. However, the use of a common standard makes it imperative to use big data tools that can work with data in this format. To address this issue, we include the FIWARE GEs in the industry's BDPA pipeline since they comprise a set of libraries, connectors, and protocols on top of the most widely used big data frameworks provided by the Apache Community, providing these components with full support for the NGSI-LD standard. With the adoption of this approach, any organization can not only perform big data operations more easily using a common standard, but also mitigate additional needs such as enforcing data Usage Control and securing data exchange.
As far as data Usage Control is concerned, most of the proposals in the literature take the UCON model [8] as a starting point for the development of their solutions. However, depending on the field of application, these works take different approaches. For instance, Russello and Dulay presented xDUCON [27], a cross-domain Usage Control proposal for coordinating and enforcing Usage Control policies across different collaborating organizations. In said framework, a cross-domain data space instance is shared among the organizations to be used as a local enforcement point of the control policies. As a result, the coordination of the enforcement policies is easier to specify since it is not necessary to include details of the receiving organization structure.
A posterior paper was presented by the same authors [28]. As a complementary part of the xDUCON framework, they defined cross-domain policies, which are capable of dealing with the mutability issues of the UCON model and providing a fine-grained decision mechanism that can be captured by the defined policies. The xDUCON framework provides a general perspective for providing capabilities of policy enforcement and specification. Furthermore, Di Cerbo et al. present a solution for avoiding security risks and providing a mechanism for allowing the data owners to keep the data under their control. They present [29] a solution that allows the provision of a secure data sharing across the cloud and mobile engines. This is achieved by relying the enforcing mechanism and rules definition on Policy Definition Languages (PSLs) like XACML and an extended version of PPL (PrimeLife Policy Language) [30]. However, these works fall short of providing an architecture that not only covers the enforcement and definition mechanisms for access and Usage Control but also provides a full description of the whole process that data Usage Control involves.
Likewise, Lazouski et al. use the principles presented in [31] for providing a Usage Control solution mainly focused on cloud systems applications. The conducted research is based on the UCON model and the OASIS XACML standard to regulate the usage of cloud resources [32]. This proposal was validated by implementing the authorization system integrated with OpenNebula (OpenNebula: https://opennebula.org). More comprehensive research was presented by Wu et al. [33], in which data Usage Control is enforced in industrial Wireless Sensor Networks (WSN). Not only do they provide cross-domain fine-grained Access Control, but they also use fuzzy clustering to analyze industrial sensing data. This work uses a set of simulations for verifying the suitability of the overhead time and the effectiveness of the proposed model. A comparable proposal was presented by Marra et al. [34]. They used the core concepts of UCON model and the XACML reference architecture in order to implement a Java application for providing Usage Control operations over IoT devices. They developed a case study in which they evaluate the performance of the IoT devices and determined the feasibility of the system by implementing their proposal on real devices.
Recent works provide some interesting solutions for data Usage Control based on the XACML reference architecture proposal. For instance, Barsocchi et al. use GLIMPSE [35], a flexible monitoring infrastructure for performing Usage Control on operations over sensors in a smart home. They demonstrate the feasibility of carrying out Usage Control in this type of environments. In addition, in said study, the authors provide a low cost, easy to install, user-friendly, dynamic and flexible infrastructure, capable of performing run-time resource management using control rules [36]. Similarly, Gkioulos et al. present a model that can integrate access and Usage Control mechanisms for dealing with the distributiveness and heterogeneity of systems like IoT and online banking. At the same time they bring several improvements regarding resilience on active attacks, policy writing simplification, run-time efficiency and scalability [37] . Another proposal by Martinelli et al. presents a framework for applying QoS in a network in a Smart Building environment. They combine UCON and SDN (Software Defined Networks) in order to enforce a set of management, security, and safety policies aimed at ensuring the appropriate QoS for the provided services according to both the tenants' Service Level Agreements (SLAs) and the current context [38]. Lastly, an approach presented by Milan Petković et al. in [39] uses the business model of an organization to detect privacy infringements and to verify that data have been processed only for the intended purpose. The work presents a strong point in formal specification using Calculus of Orchestration of Web Services (COWS). These proposals show the growing interest of the scientific and industrial community in exploiting all the capabilities of Usage Control, and also demonstrate its application in specific scenarios (IoT, Cloud, etc.). Nevertheless, these overtures do not cover topics like cross-domain data exchange, data governance and trust environments, highlighting the need to deal with these topics before using this type of architecture in industrial data-sharing ecosystems. Moreover, in Industry 4.0, data sharing between multiple organizations is a key factor to be considered. Thus, guaranteeing the compliance with data governance rules and the responsible use of organizations' data by third-party entities is one of the requirements that needs to be addressed. Therefore, the need to generate a more flexible framework, capable of adapting to these mixed data ecosystems, is identified.
Moreover, the Data Privacy Directive 95/46/EC [40], currently replaced by the GDPR, played an important role in data protection. In this regard, many Access Control solutions are currently presented for protecting personal data. For instance, Bartolini et al. proposed a systematic approach for authoring Access Control policies that are aligned with GDPR provisions. They present a methodology for generating templates from the GDPR text and identifying if a GDPR article can be defined as an Access Control policy. This is achieved by matching actual attributes gathered from the legal use cases and translating the resulting policies into a given formalism or language [41] in order to comply with GDPR's principle of "data protection by design and by default" [42]. In the same line, Calabró et al. conducted a preliminary study for integrating Access Control and business processes for GDPR compliance. The main goal of said study was to extend the currently adopted Access Control mechanisms to enforce GDPR compliance during business activities of data management and analysis [43]. Nevertheless, although these works provide a first step towards a formal definition of an Access Control solution based on GDPR, they do not cover the Usage Control of the data once access is granted, opening an issue that needs to be addressed. Another interesting proposal was presented by Arfelt et al. who identifies formalizable GDPR articles and, by using Metric First-order Temporal Logic (MFOTL), formalizes and monitors the articles in which controllers, processors, or data subjects are required to take specific and observable actions [44]. Moreover, the policies generated with the previous process are deployed over MONPOLY, a monitoring tool for compliance checking [45]. Although previous works solve the problems of GDPR formalization and monitoring, these proposals do not consider other factors such as data governance and trust that affect data sharing in cross-domain industries.
Prior works have sought Access Control solutions based on Blockchain for distributed environments [46][47][48]. In particular, the authors of [46] describe how contracts can be deployed in the ledger to perform Usage Control following GDPR provisions and avoiding central entities in the authorization and authentication processes. Overall, the cited studies outline that relying on Blockchain provides transparency and trust solutions for Access Control but it may impose scalability limitations for real-time scenarios.
As can be seen, much of the research up to now has been descriptive in nature. Table 1 presents a comparison of the Usage Control architectures found in the literature. Most solutions rely only on ABAC (Attribute Based Access Control) for Access Control and just a few proposals supplement it with RBAC (Role Based Access Control) or IBAC (Identity Based Access Control), which play a crucial role in data-sharing scenarios in which it is necessary to provide particular access and usage permissions to different stakeholders. In addition, the existing Usage Control solutions focus on policy infringement detection. Only about half of the works studied provide remediation capabilities for policy violation, rather than just detection. Remediation actions are essential to automatize the enforcement of consequences for policy noncompliance rather than issuing warnings or notifications and leaving it up to the wrongdoer to redress the situation. In order to univocally define obligations, prohibitions and permissions, and the consequences for noncompliance, it is convenient to use a Policy Specification Language (PSL). Most works found in the literature rely on XACML or U-XACML as their PSL, although some of them do not use a specific language or define one of their own. Another important aspect is the support for multi-actor architectures. About half of the works studied provide support for several actors involved in the data-sharing process, whereas the rest of them focus on one-on-one exchanges. The former is a key requirement for scenarios in which multiple stakeholders share data with one another. Furthermore, it is worth mentioning that most of the studies analyzed in this section fail to provide a data Usage Control solution that is independent of the context in which data are generated (WSN, IoT, SDNs, etc.), with only a couple of them being domain-agnostic. Lastly, far too little attention has been paid to the topic of trust environments as a pillar for providing capabilities for secure and trusted data exchange and sharing between multiple organizations. In light of this information, it becomes apparent that previous works have failed to provide a standard solution for achieving data Usage Control in data-sharing ecosystems. None of the works reviewed provides advanced Access Control and Usage Control capabilities in architectures with several agents involved, while supporting an expressive policy definition language that allows the definition of obligations, prohibitions and permissions and providing remediation functionalities in the event of policy infringement. This research contributes to fill the gap in the existing literature by providing a generic multi-actor architecture to achieve advanced data access and usage control capabilities in real-time data-sharing scenarios. Our proposal incorporates the core concepts from the UCON model, the key aspects of the IDS Reference Architecture Model and the extended XACML Reference Architecture, and relies on ODRL as its PSL. In addition, we provide an implementation using one of the core CEF (Connecting Europe Facility) building blocks and FIWARE GEs for providing a reference framework fully adapted to the requirements of Industry 4.0.

Proposed Solution
This section presents an overview of the design principles that we have considered for designing this proposal. We also summarize the resulting architecture as well as a workflow that illustrates the interaction between the described components. Details about both the principles and the architecture can be found in our previous work [15].

Design Principles
In recent years, industries have experienced an increased need to exchange data among them. Thus, data protection has become a priority. In view of this necessity, the IDS, which is in close contact with the industry, has identified the main requirements that need to be fulfilled in order to address data sharing when multiple organizations are involved. We have taken the IDS guidelines included in [10] as a starting point and added additional considerations from the literature [17] to concoct the set of design principles in which we have based our proposed architecture for providing access and Usage Control in industrial contexts:

•
Trust. IDS Connectors provide a trusted environment that enables the achievement of data Usage Control [10].

•
Interoperability. Standardization of protocols is crucial to ensure the understanding between all the components involved in the architecture, for managing both Usage Control and identity.

•
Governance. Emerging data-centered businesses need to define data governance programs to exploit data in a cost-effective manner [49]. Data sharing should comply with the data governance rules defined by each of the organizations involved. In this scope, providing ways to respect and protect the data of all the parties involved is one of the main requirements that data Usage Control must fulfill. Thus, data providers must have access to monitoring and configuration tools that allow them to control what becomes of their data. Nevertheless, as pointed out by [50], in collaborative systems, resources can be administered by multiple data owners. Due to this fact, the aspects of data governance model, policy composition and conflict resolution need to be addressed. In this context, the concept of "data governance model" defines the authority that entities have over a resource; "policy composition" describes how the authorization requirements authored by multiple entities are combined or reconciled to regulate the access to a resource, and "conflict resolution" indicates the method used to resolve policy conflicts in order to obtain a conclusive decision [50]. In this regard, the preliminary version of this proposed architecture takes the work presented by Mahmudlu et al. as a reference, in which they define multiple ownership, authoritative and predefined mechanisms for addressing the main aspects of governance model, policy combination and conflict resolution respectively [51].

•
Performance. The accomplishment data Usage Control policies can be only ensured if reaction to policies violation is quick and efficient. • Flexibility. As many data-sharing scenarios and use cases are contemplated, the solution must be adaptable to the specific requirements of such scenarios.

Agents Involved
According to the International Data Spaces Association and IDS Reference Architecture presented by Fraunhofer [10] four agents have to be provided in every system that considers data sharing: Data Owner (DO), Data Provider (DP), Data Consumer (DC) and Data User (DU). However, it is very frequent to assume a two actors model in which the DP and the DO play the same role as well as the DC and the DU. Taking into account this assumption, the DP is the organization or user who is the proprietary of the data and decides which data are available for sharing. Additionally, the DP defines the usage and Access Control policies applicable to the data that the DC can consume. On the other hand, the DC represent every entity that has the legal rights to use the data provided by a DP according to the previously defined Usage Control policies.
Besides these two actors, the GDPR, defines an additional component named Data Controller (DCr) [52]. According to the DCr definition, a third actor needs to be included to cover the IDS reference architecture and the one proposed by the GDPR. In this case, the DCr and the DP are responsible for guaranteeing the protection of data owners' rights and for providing access to data, respectively. Nevertheless, we consider that a model capable of being GDPR-compliant needs to contemplate some other factors, as stated in [43]), that are out of the scope of this paper. For instance, one of the additional factors that should be taken into account is the ability to write GDPR articles formally as algebraic expressions to transform legal concepts into rules. Also, it is necessary to provide a formal extension of a PSL to explicitly manage GDPR principles of consent and purpose limitation. Lastly, it would be useful to include tools for authoring and enforcing GDPR-based policies. Thus, we concentrate our efforts on presenting a preliminary version of data Usage Control with a two-actor model (DP-DC), mainly focused on data-sharing ecosystems in Industry 4.0. However, below we present an alternative proposal architecture including the three-actor model.
Finally, we also consider the Identity Provider (IdP) as an actor presented in the IDS Reference Architecture: The IdP verifies the authenticity of all the actors involved in the architecture and also provides all the characteristics related to identity management. These include actor registration, authentication, password management and the option of grouping actors in organizations to manage them under identical conditions. In scenarios in which a single DP shares data with one or more DCs, the DP can integrate its own IdP, since the identities of other DPs do not need to be validated. Said configuration would lead to an even further simplification of the two-actor model.

Architecture and Workflow
Building on the design principles established and the simplified agent model identified for data processing scenarios (DC, DP and IdP), we have presented an architecture for providing data usage and Access Control in shared ecosystems. Figure 1 shows the proposed two-actor architecture, whereas Figure 2 presents the aforementioned alternative architecture including the DCr as a separate agent.
On the other hand, Figures 3 and 4 show the workflow of the two main scenarios of data sharing: (1) DP defines the policies that apply to the shared data and (2) DC preforms operations to the received data.
The DP uses the Policy Administration Point (PAP) to define access and Usage Control policies. Usage policies are defined using an Open Digital Rights Language (ODRL) extension materialized by the W3C [53] and translated by the Policy Translation Point (PTP) to a program that runs on the usage Policy Decision Point (uPDP) and that is updated every time policies are modified by the DP. Once both access and usage policies are defined, the data in the Data Infrastructure (both real-time and stored) can be made available in the Shared Data Space (SDS).
When the DC requests access to data available in the SDS (to save them in its Storage System or to process them using a Processing Engine), the Policy Enforcement Point (PEP) checks with the access Policy Decision Point (aPDP) if the DC has the necessary access permissions to make the subscription effective. If the result is positive, the DC starts receiving data from the SDS and processing them performing the desired operations. As a result of such operations, traces are generated and sent to the PEP that after validating an authentication token previously generated by the IdP, redirects them to the uPDP. The uPDP checks the usage policies and in case of noncompliance delegates the responsibility of enforcing the established action for policy noncompliance to the Policy Execution Point (PXP) by sending the corresponding control signal.    One concern identified in this proposal, is that the logs generated when performing operations on data could be easily manipulated due to the fact that log generation takes place outside the scope of the DP. Nevertheless, the reference architecture used in this work relies on the guidelines of the IDS connectors, which determine that all the connectors involved in a data exchange must run a trusted (certified) software stack. Worded differently, IDS Connectors require a certification from the IDS Certification Body to establish trust among all participants.
Moreover, the IDS guidelines also establish that any communication between connectors from different organizations should be encrypted and integrity protected. Thus, by including the DP and DC inside IDS connectors, each DP is capable of ensuring that their data are handled by the Connector of the DC according to the usage policies specified, or else the data will not be sent [10].

Implementation Using FIWARE
This section presents an implementation of the proposed architecture using the generic enablers (GEs) provided by FIWARE and other open source tools. Specifically, the GEs used in this implementation of the data usage architecture are the following:

•
Keyrock The Keyrock GE (FIWARE Keyrock: https://fiware-idm.readthedocs.io) is responsible for Identity Management. Using Keyrock enables OAuth 2.0-based authentication and authorization security to services and applications, as described in [13,14]. In the context of this implementation, Keyrock plays the role of IdP, manages authorization policies (PAP) and decides which DCs can access which resources in the data infrastructure (aPDP). Therefore, DPs and DCs perform the authentication process relying on Keyrock. Guaranteeing the unequivocal identification of all the agents involved in the data usage architecture is mandatory to ensure a secure way of providing or consuming data. By using Keyrock, DPs can create authorization policies to constrain DCs' access to the data infrastructure. It also implements PEP functions within an XACML-based Access Control schema [12]. In the scope of this implementation, two Wilma instances are needed. One Wilma instance is in charge of enforcing access policies over requests sent to the data infrastructure [17]. When a DC is authenticated through Keyrock, an OAuth 2.0 token is generated, which must be included in every request sent to the DP's data infrastructure. Wilma intercepts requests and asks Keyrock to validate the token, verifying the DC's identity. Since Keyrock also acts as the aPDP, it checks the DC's access authorization policies. In case that the DC's request complies with the established policies, Wilma grants access to the requested resource. With regard to data Usage Control, a second Wilma instance is needed as a PEP proxy to authenticate the traces sent from the DC's processing engine to the uPDP. • AuthZForce: The AuthZForce GE (FIWARE AuthZForce: https://authzforce-ce-fiware. readthedocs.io) brings additional support to aPDP/PAP functions within an Access Control schema based on the XACML standard. It has not been included in the present implementation, but it could be used to create more advanced fine-grained authorization policies and to make decisions over requests received from PEPs.  [54] APIs and associated information model (entity, attribute, metadata) as the main interface for sharing data among stakeholders. In addition to being the centerpiece of any platform "powered by FIWARE", the Context Broker has been recognized as a CEF Building Block, which is one step forward on its path towards becoming a global standard for large scale contextual information management [55] . In the context of this implementation, it constitutes the DP's data infrastructure and SDS, which enables the sharing of data between the DC and the DP in a secure way. In other words, the DP makes use of the NGSI API provided by the Orion Context Broker in order to publish or expose the data they have to offer, whereas DCs retrieve or subscribe to said data. The aforementioned FIWARE GEs provide all the features needed to implement the components on the DC's side (processing engine and data storage), the Access Control components (PAP, PEP, and aPDP), the SDS, and the IdP. As the FIWARE catalogue lacks any GEs that aid in the implementation of Usage Control capabilities, we have developed several ad-hoc components for this purpose. The Technical Steering Committee of FIWARE has shown interest from the conceptualization of this proposal to its materialization since it covers the key integration aspects of cross-industry data exchange. As some of the authors of this work are part of such committee, it is directly connected with the design of our solution. The PTP, uPDP and PXP components are planned to be included as a new FIWARE GE in the near future: • The PTP is a piece of software written in Python in charge of translating the ODRL usage policies defined through Keyrock into a Complex Event Processing (CEP) program using the Flink CEP Scala API. Every time the usage policies that apply to a certain DC are modified by the DP, a new program is generated by the PTP containing a CEP rule for each policy. This program is then compiled, packaged, and sent to the uPDP.

•
The uPDP is an Apache Flink computation cluster that runs all the CEP programs generated by the PTP: one for each DC. These programs take advantage of the CEP capabilities of Apache Flink to verify whether the DC complies with the policies defined by the DP or not. This is done by analyzing the traces generated by the processing engine on the DC's side (Apache Flink in this case) which are sent to the uPDP. In the event of noncompliance, the PXP is notified.

•
The PXP is a piece of software that is notified each time the uPDP detects policy noncompliance. It is written in Scala and attached to each program that runs on the uPDP. The PXP enforces the control signal established by the DP for the unfulfilled policy. For instance, in order to stop a DC from receiving data as a punishment for policy noncompliance, the control signal sent by the PXP is an unsubscription request to Orion. If the DC is, in turn, processing data in an incorrect manner, one way to punish this policy violation would be to send a control signal that kills the processing job on the DC's side. These are the control signals that have been implemented so far, but the goal is to extend the capabilities of the PXP to support custom control signals written by the DP. Figure 5 shows the data usage architecture proposed using the aforementioned FIWARE GEs and ad-hoc components developed, as well as the workflow mechanism presented in Section 3.3. Instead of deploying the IdP as an external actor (as proposed in Figure 1), in Figure 5 we include the IdP as a part of the DP, since the IdP is provided by the Keyrock GE, which also includes the PAP and aPDP. However, a three-actor configuration like the one proposed in Figure 2 would also be feasible by deploying Keyrock separately, since Keyrock supports a multi-tenant configuration in which each DP would be mapped to a specific application. In such case, the Access Control permissions and usage policies that apply to a specific DP would be defined and validated in the scope of the corresponding Keyrock application. Details about the application-scoped Access Control management of Keyrock can be found in [13]. The operation flow during one usage decision process is defined as follows: • The DC sends a subscription request to the Orion Context Broker to retrieve data from the DP.

•
The subscription request is intercepted by the access PEP Proxy and validated by the IdP and the aPDP by checking whether the token containing the user information is valid an if the user has the right to access the requested resource.

•
Once the subscription is done, the DC starts receiving data from the Orion Context Broker at the processing engine. The traces generated by the program containing all the operations performed on data are sent to the uPDP, verifying the DC's identity through the usage PEP Proxy. Moreover, this instance of the PEP Proxy is in charge of redirecting the traces to each specific uPDP CEP program. When translating the defined ODRL policies for a DC, the PTP generates a new CEP program and maps the port where it runs to the DC. Thus, when receiving the traces and after verifying the DC's identity, the PEP Proxy knows the port in which the corresponding CEP program is running and can redirect the traces to it. The uPDP then verifies that the DC complies with all the policies defined through the PAP. Otherwise, the uPDP notifies the PXP, who sends the corresponding control signal.
To ensure integrity, confidentiality and authenticity in the exchange of traces between DC and DP, we take advantage of the facilities that IDS Trusted Connectors provide to avoid eavesdropping, manipulation and impersonation [17]. The IDS defines two layers of security with regards to communication between Connectors: point-to-point encryption using an encrypted tunnel and end-to-end authenticity and authorization. The DC sends the traces generated by the processing engine to the DP using HTTP requests over the Internet or through a Virtual Private Network (VPN), depending on the specific scenario. Regardless, HTTPS (the secure version of the HTTP protocol) is used in both cases. HTTPS [56] makes use of an added encryption layer of SSL/TLS to protect the HTTP traffic. Therefore, the point-to-point encryption is taken care of by this protocol. On the other hand, end-to-end authenticity and authorization are covered using the OAuth 2.0 protocol. As explained above, the DC includes an Authorization Header with the OAuth 2.0 token previously created by the IdP in the HTTPS requests containing the traces. As the PEP Proxy intercepts such requests before sending them to the uPDP, the identity and permissions of the DC are validated with the IdP and the aPDP to ensure authenticity and authorization respectively.
Lastly, verifying the correct timestamping of the traces received is also crucial to avoid replay-attacks [57]. Replay-attacks consist of resending an already sent request (trace) (maybe repeatedly). In our proposal, timestamping verification is also delegated to the use of OAuth 2.0 by means of the inclusion of timestamps and nonces (nombers used once) in each one of the traces generated by the DC. Adopting this mechanism, we can ensure that even if an attacker tries to replay the trace, this request will be denied by the PEP Proxy because it is not possible to neither change the timestamp nor the nonce used, since these values are also used in the signature (changing them would invalidate the signature).
Besides the secure interchange of traces between DC and DP, privacy, non-repudiation and integrity must be ensured in the whole data-sharing process, including the subscription and publication requests between the Orion Context Broker (on the DP's side) and the DC components. These requests are also protected thanks to the security features of IDS Trusted Connectors explained above. Thus, using encrypted requests, attackers cannot access shared data by brute force and, since all the requests are signed, the identity of the actors cannot be impersonated.

Validation: A Case Study in the Food Industry
To validate the proposed architecture and implementation using FIWARE, a case study has been developed in the food industry. The components presented in the implementation section have been deployed to perform the policy definition and enforcement in a shared data ecosystem. The main goal of this case study is to perform access and Usage Control over industrial data.

Scenario Overview
The scenario is composed by two actors: a food company (DP) and a marketing company (DC). The former generates a great amount of data daily every time a client makes a purchase at one of their grocery stores, which are later used for internal big data analysis. One data record is generated for each purchase, which consists of the client id, the payment method, and the list of products purchased, including the product name, price and quantity. The board of the company realizes that if they were to share these data, they would allow other businesses to find new ways of extracting value from them, thus creating another source of revenue for the company. A marketing company is interested in the food company's real-time data to identify trends and carry out instantaneous special offers that take these into account. In order for the marketing company to be able to make this analysis, the food company must provide a real-time channel to make their data available to them. However, the food company wants to keep the marketing company from making an incorrect use of the data that would jeopardize customers' privacy. For the sake of data protection, a set of Usage Control policies are defined to enforce some constraints over the shared data. In addition, since the communication channel between the DP and the DC is in real time, it is impossible to know a priori the number of data events there are going to be generated.
In this scenario, the DC deploys a Flink cluster for performing all the data processing operations. On the other hand, the DP deploys all the components showed on the right side of Figure 5 (i.e., the Orion Context Broker, Keyrock, Wilma, and the proposed Usage Control components), including the data generated and published by the cash registers on the Context Broker as part of the data infrastructure stakeholder.
To simulate the grocery store data, we have extracted data from real purchases from an open dataset released by a very popular French grocery store chain. We converted these data into a stream of real-time notifications by triggering purchases periodically (in periods ranging from 25 ms to 5 s). Each notification represents a single purchase and contains a timestamp, the payment method used, and a comprehensive list of the items purchased, including, for each item, the quantity and the price. As can be seen, very fine-grained information related to consumer habits can be extracted from these data. Also, it becomes apparent that the more notifications the marketing company receives, the more accurate their offers will be for the stores' customers. It would be interesting to be capable of limiting the throughput of data events to implement different monetization strategies.

Policy Specification
In the proposed scenario, the DP defines two main policies regarding data usage that apply to any external DC. The natural language definition for these policies is: • Policy A: The DC shall NOT save the data without aggregating them every 15 min first or else the processing job will be terminated • Policy B: The DC shall NOT receive more than 200 notifications from the Context Broker in 1 min or else the subscription to the entity will be deleted Policy A tackles one of the main concerns in data-sharing scenarios, which is anonymization. For instance, individual purchases could be cross-referenced with credit-card statements inferring the identity of the client and his/her consumer habits. By requiring the DC to aggregate data, the individual attribute values in each notification are combined into a single value at least every 15 min (e.g., by computing the mean, the maximum, the sum, etc.), thus guaranteeing that individual records are not saved. If, for instance, the DC tries to print the data or send them somewhere else upon receipt, this policy will be violated since the entirety of the data would be transferred away from the scope of data Usage Control without anonymizing them first. In the event of policy violation, the job will be immediately terminated by sending a signal to the DC's Flink cluster manager. Regarding Policy B, in scenarios involving large amounts of data, it is often useful to establish a limit in the throughput of data that is shared (i.e., amount of notifications in a given time). As mentioned, one possible application would be establishing different monetization strategies based on the maximum throughput of notifications allowed. In this case, the limit is set to 200 notifications per minute. In the event of noncompliance, the subscription to the entity for which the limit was surpassed will be removed.
As mentioned, in order to take full advantage of all the capabilities of data Usage Control, usage policies must be defined by using a policy specification language and, although ODRL provides a powerful interface to define these [58], in the future, it will be necessary to develop new vocabularies and ontologies for tackling some currently uncovered cases. However, in this case study we include a first approximation of the use of ODRL to declare the two policies that have been presented in natural language. In Listing 1, we present the ODRL definition of policies A and B. ODRL defines three ways to declare policy rules: "permissions", "obligations" and "prohibitions", providing different options to express a policy. We use "obligations" combined with "constraints" and "consequences" for defining our two policies. In each obligation, there is a "target", which refers to a resource that is subject to a rule, and an "action", which is the operation that is forced to be perform on the target as part of the obligation. Actions can be limited by "constraints", which can be temporal, spatial, amount-based, etc. In addition, "consequences" allow definition of what happens in case of noncompliance.
The first obligation represents Policy A. In this case, the "target" is the NGSI notification received from the Context Broker. The action that the DC is required to perform in this policy is "aggregate" (combine data individual values into one). In addition, we define constraints applied to said action: the use of the terms "leftOperand", "operator", and "rightOperand" allow us to define the logical constraints to be applied. The values presented in this fragment of code means that the DC is obliged to aggregate the notifications received at least every 15 min before generating an output. Finally, as a consequence, we establish that a kill signal for stopping the running program associated with this rule will be sent ("killJob" action). As can be seen, a similar approach is followed in the second obligation, which represents Policy B. The aim of using ODRL is to provide dynamic capabilities so as to enable the PTP to generate an extended automaton on the basis of the policies that will run on the uPDP.
Listing 1: ODRL Specification of Policy A and Policy B.
Listing 2: uPDP code generated from the ODRL Specification by the PTP.

p a t t e r n ( e n t i t y S t r e a m , c o u n t P a t t e r n )
. s e l e c t ( e v e n t s => S i g n a l s . c r e a t e A l e r t ( P o l i c y . COUNT_POLICY, events , Punishment . UNSUBSCRIBE ) ) / / S e c o n d p a t t e r n : S o u r c e −> S i n k . A g g r e g a t i o n TimeWindow v a l a g g r e g a t e P a t t e r n = P a t t e r n . begin [ ExecutionGraph ] ( " s t a r t " , A f t e r M a t c h S k i p S t r a t e g y . s k i p P a s t L a s t E v e n t ( ) ) . where ( P o l i c i e s . executionGraphChecker ( _ , " s o u r c e " ) ) . notFollowedBy ( " middle " ) . where ( P o l i c i e s . executionGraphChecker ( _ , " a g g r e g a t i o n " , P o l i c i e s . aggregateTime ) ) . followedBy ( " end " ) . where ( P o l i c i e s . executionGraphChecker ( _ , " s i n k " ) ) . timesOrMore ( 1 ) CEP . p a t t e r n ( operationStream , a g g r e g a t e P a t t e r n ) . s e l e c t ( e v e n t s => S i g n a l s . c r e a t e A l e r t ( P o l i c y . AGGREGATION_POLICY, events , Punishment . KILL_JOB ) )

Data Processing and Policy Enforcement
Once the policies have been defined, the DC may start to deploy processing jobs on their own infrastructure with the aim of extracting value from the supermarket data received. In order to validate the two policies defined by the DP, we have created two sample jobs that operate on the DP's data: Job I: Direct sinking of ticket data The first job reads the data received from the DP and sends them somewhere else, outside of the scope of the data usage architecture, in which operations on data are not monitored. Since this use allows the DC to process each piece of data individually, without prior aggregation, it is a clear violation of Policy A. When the DC deploys this job in the Flink cluster, an Execution Graph is calculated from the program code. The Execution Graph is the central data structure that coordinates the distributed execution of a data flow. It contains a representation of each parallel task, each intermediate stream, and the communication among them. Figure 6 shows the Execution Graph generated for Job I, in which all the operations performed on data are included. The first item in the Execution Graph is a Custom Source. It indicates that the program uses a custom connector as an input for receiving data streams, in this case, the custom source is the one provided by the FIWARE Cosmos GE. Since no additional operations are performed on data, the Source is immediately followed by a Data Sink, which consumes Data Streams and forwards them to files, sockets, external systems, etc. or prints them on the standard output. The log containing the chain of operations (as shown in Listing 3) generated by the DC's Flink processing engine is sent to the uPDP, which will detect that the Execution Graph contains no aggregation of data, thus failing to comply with Policy A. The uPDP will inform the PXP of this violation, which will send the corresponding control signal described in the policy; in this case, terminating the job as a punishment for noncompliance.
One major concern about using the Execution Graph for policy enforcement is ensuring that it has been correctly generated. The processing engine itself is in charge of detecting all the operations performed within a program and generating the Execution Graph. The integrity of the processing engine relies on the use of trusted environments, achieved through IDS connectors, in which no alteration of the run-time environment is allowed. Thus, all the operations that the processing engine detects it must perform within a certain program are reflected on the Execution Graph and no operation is overlooked or disregarded. The Execution Graph is reflected on a log which is then sent to the uPDP. As far as the integrity, confidentiality and authenticity of the in-transit logs is concerned, it is ensured by means of the mechanisms explained at the end of Section 4. Job II: Calculating average ticket price The second job calculates the average ticket price for all the purchases of each store every hour. This operation is an aggregation of data so, when the Execution Graph is checked by the uPDP, it will be verified that it complies with Policy A. Figure 7 represents the Execution Graph generated for this use case. The logs that are sent to the uPDP containing this Execution Graph are shown in Listing 4. Besides sending the Execution Graph logs to the uPDP, each time the DC receives information from one ticket, this event is logged and sent to the uPDP as well (as shown in Listing 5). If the uPDP detects that the DC has received more tickets than the amount specified by Policy B (200), the PXP will be notified and will enforce the corresponding punishment (i.e., removing the subscription to the tickets' data). c h a r s e t = u t f − 8" ," e n t i t i e s " : [ { " i d " : " t i c k e t " , " type " : " t i c k e t " , " a t t r s " : { " _i d " : { " type " : " S t r i n g " , " value " : 7 5 , " metadata " : { } } , " i te ms " : { " type " : " o b j e c t " , " value " : [ { " net_am " : 4 . 9 9 , " n _ u n i t " : 1 , " desc " : " BREAD " } , { " net_am " : 5 . 5 , " n _ u n i t " : 2 , " desc " : " PIZZA HAM/CHEESE " } , { " net_am " : 2 . 3 9 , " n _ u n i t " : 1 , " desc " : " FRANKFURT SAUSAGES " } , { " net_am " : 0 . 0 5 , " n _ u n i t " : 1 , " desc " : " SHOPPING BAG " } ] , " metadata " : { } } , " mall " : { " type " : " S t r i n g " , " value " : 1 , " metadata " : { } } , " date " : { " type " : " date " , " value " : " 0 1 / 1 4 / 2 0 1 6 " , " metadata " : { } } , " c l i e n t " : { " type " : " i n t " , " value " : 7 7 0 5 3 2 8 0 2 0 8 , " metadata " : { } } } } ] , " s u b s c r i p t i o n I d " : " 5 d308d139d5b4d3e64685da0 " } Overall, the case study presented in this section, including the deployment of both DC's jobs, shows that the data usage architecture provides a way of verifying that the DC complies with a set of predefined policies by the DP and of executing punishments in case of noncompliance.

Results
This subsection presents a series of metrics carried out in the case study presented that aim to calculate the enforcement time of the policies defined. To this end, the two jobs presented were deployed, achieving noncompliance conditions for both of them.
In accordance with the scheme presented in the implementation section, we define the deployment of all the components as follows: every building block of the DC and DP was deployed using Docker containers (Docker: https://www.docker.com). However, in order to test the Usage Control policies and obtain more accurate metrics, we deployed the DC and the DP in different platforms. On the one hand, the DP's containers were placed inside of a VM (Virtual Machine) in an Edge Computing Infrastructure using OpenStack (OpenStack: https://www.openstack.org), this VM has the following features: 2VCPU's, 4 GB RAM, 40 GB Disk. On the other hand, the DC's containers were placed in a local server located in the same network as the DP's VM with the following specifications: 4.0 GHz Intel core I7 CPU with 8 GB RAM and 256 SSD Disk. This deployment allowed testing different policies and measure overhead time. The workflow of submitting a job on the DC's side, detecting the policy noncompliance, and enforcing the due punishment was repeated N times (N = 100) for each policy. Through the system logs generated by the DC's and DP's containers, three different metrics were calculated:  Table 2 summarizes the results obtained for each policy and interval, including the mean time (M) and the standard deviation (SD). The results for each iteration can be seen in Figures 8 and 9.   As is apparent from the results shown in Table 2, the times registered for T d are significantly larger than those recorded for T x . The main reason for this difference is the fact that T d involves generating the logs on the DC's side, writing them on the processing engine's log file and sending them to the uPDP, which will receive them after they are first verified by the PEP. By contrast, the T x is very low since the PXP is embedded in the same program that the uPDP is running, which means that the delay introduced stems from the time it takes to receive an acknowledgement from the control signal sent to the system in charge of performing the actual punishment (to the Context Broker in order to remove a subscription, or to the DC's processing engine to cancel the job).
Furthermore, the slight difference in measurements for T x between policies A and B draws from the difference in response times between the Context Broker and the processing engine on the DC's side. On the other hand, the difference in T d between policies A and B is due to the fact that in the former, the noncompliance is detected by inspecting the Execution Graph, which is received by the uPDP as a single log, whereas in the latter, the uPDP needs to inspect the message history to confirm that the notification limit established within the policy has been exceeded, which is a more costly operation.
As far as the total time (T t ) is concerned, the T x can be neglected for its calculation if the punishment does not involve interacting with stakeholders outside a shared network (the Context Broker, for instance). Otherwise, the network latency needs to be taken into account. This holds true for the T d as well, since typically the DC and the DP are in physically separate networks. The results obtained for T t fall within a reasonable range of values for most use cases, in which new data are generated every second or few seconds. However, in scenarios in which new data are published within milliseconds, there could be a period between the infringement of the rules and the enforcement of the punishments in which new data are unduly received by the DC. In order to verify that this was not the case, we tested our solution under different stress situations. We deployed the use case scenario using different frequencies of generation of new data. The main goal was to corroborate that no additional data events arrived from the moment an infringement was committed and the moment that the due punishment was executed. The use case scenario was tested for data generation periods ranging from 5 s to 25 ms (5 s, 1 s, 500 ms, 250 ms, 100 ms, 50 ms, 25 ms), repeating each simulation 100 times. We found that all the situations of data infringement were detected, and the appropriate punishment was enforced in due time in 100% of the cases. Thus, no new information was unduly received after failing to comply with a certain policy. Nevertheless, it is worth pointing out that for periods under 25 ms, we have to consider the throughput limit of the SDS (in our case the Context Broker) since this component is the one that determines the actual speed at which data will be sent to the endpoints that are subscribed to new data.
Prior works [33,34] have also collected a series of metrics to validate their models. For example, Marra et al. [34] determine if the performance of their Usage Control system is higher whether it is applied to local or remote attributes. Furthermore, instead of measuring enforcement time, Wu et al. [33] focus on performance at the Access Control level. Although an exact comparison between said models and the one presented in this work cannot be performed, since none of the works found in the existing literature provide measurements of the decision and enforcement times, it can be seen that the measurements for delay and response times are in the same order of magnitude.

Conclusions and Future Work
The implementation of the architecture presented in this paper provides a comprehensive and affordable solution for providing access and Usage Control in industrial data ecosystems. One of the advantages of this proposal stems in the fact that it is suitable for being implemented in a wide range of different scenarios since it is a technology-agnostic solution. This characteristic, along with its fine-grained Access and Usage Control capabilities and its multi-actor architecture contributes to fill the gap in the existing literature.
Moreover, this piece of research also presents an implementation of the referenced architecture using FIWARE Generic Enablers that completes the previously proposed implementation of IDS architecture [17]. The implementation presented has been validated with a use case in the food industry, presenting a series of metrics of the response time of policy compliance verification and punishment enforcement. The data Usage Control components developed in the scope of this work (uPDP, PXP and PTP) have been proposed and accepted as a new Generic Enabler in the FIWARE catalogue.
As a conclusion to this work, we identify the need for defining a policy specification language capable of handling the fine-grained policy definitions to provide data Usage Control capabilities in Industry 4.0. Furthermore, we consider that this architecture could be extended to become GDPR-compliant by introducing the GDPR regulation rules inside the definition of the policies and deploying them inside this architecture. The work presented in [59] provides a preliminary version of the definition of ODRL policy specification oriented to GDPR. This is still an ongoing research topic. We consider that providing an ODRL vocabulary for GDPR and some policy examples can lead to the inclusion of this vocabulary as an ODRL extension in the W3C working groups and further be supported by the community interested in this topic. We will focus our future efforts on the Policy Specification Language definition, conceived as an extension of ODRL, as well as the definition of a common vocabulary that allows standardization and identification of the events and traces of the system for the different processing engines. Additionally, we intend to develop new and more complex tests that allow us to extract additional metrics in the scope of data protection.
Another possible area of future research would be to investigate the integration of Blockchain technologies within the Usage Control proposed architecture. Data Providers could store the Usage Control policies that are applied to specific data and to specific Data Consumers in the distributed ledger and check them whenever the latter perform an operation. Blockchain scales best with lightweight information types, so the data itself should remain out of the ledger and only the metadata representing the data to which Usage Control policies are applied should be stored in the ledger. This approach would enhance trust and transparency over data accountability and traceability between consumers and providers. However, further research needs to be performed on how to integrate Blockchain in dynamic environments with high density of requests such as those in IoT scenarios.