Many organisations are complex entities that perform heterogeneous processing on diverse personal data, often organised using multiple organisational units or outsourced processing partners and sometimes under the jurisdiction of multiple Data Protection Authorities (DPAs). Under the EU’s General Data Protection Regulation (GDPR), organisations that act as a ‘Data Controller’ are obliged to create and maintain a “Register of Processing Activities (ROPA)” as a comprehensive record of personal data processing activities carried out under their responsibility (GDPR Art.30). The ROPA, as described in the GDPR, is a temporal snapshot of the organisation’s practices and is the point of initiating communication or investigation regarding compliance such as with a DPA. It is thus an important part of the organisation’s processes related to ensuring and documenting its compliance.
In practice, organisations struggle to keep accurate and up-to-date ROPAs [1
]. They often fail to integrate the maintenance and management of the Register of Processing Activities into their day-to-day operations [1
]. This can result in a breakdown in the GDPR accountability principle (GDPR Article 5.2) as there is a lack of clarity as to the who, how, and when the ROPA is updated. To assist organisations with their ROPA-related duties, DPAs have provided guidance and templates that intend to ease the task of understanding requirements and harmonise the documentation through commonly used formats and environments such as spreadsheets [2
]. In providing these templates, DPAs indicate what can be considered ‘good practice’ regarding what information should be documented within a ROPA. However, despite being based on a common legal obligation (GDPR Art.30), there is variance in the templates provided by DPAs where additional fields (not in the GDPR) are also encouraged to be documented [2
]. An organisation operating in multiple jurisdictions is thus tasked with consolidating differing requirements from each DPA as either a distinct set of ROPA documents or a single combined one.
Furthermore, the exercise of gathering the information necessary to create a ROPA is not a one-off activity [4
] as there may be several data sources both internally (e.g., departments) [5
] and externally (e.g., data processors) [5
]. Therefore, ROPA creation requires communication between these distinct units to collate information pooled from ’heterogeneous sources’ into a singular location to produce a ROPA. This necessitates some form of information management process for the tasks associated with documents such as reading or viewing, writing all or parts of it, exchanging them between relevant stakeholders, and ensuring their correctness and availability (e.g., backups or version control).
To address such requirements, the market vendors offer dedicated solutions for ROPA management, often as part of a larger suite of the GDPR compliance tools [7
]. This follows the increasing trend of organisations adopting regulatory technology (RegTech) [8
] to assist with legal compliance and requirements. The utilisation of a ROPA is poised to be an important and key feature given its importance in the GDPR compliance processes.
However, these RegTech solutions are primarily centralised and proprietary, and they emphasise custom processes that cannot be utilised outside vendor-defined use cases. In particular, the information being exchanged between internal and external stakeholders has been poorly researched in academia and commercial offerings (see Section 2
) despite the need for shared business and regulatory taxonomies for facilitating semantic interoperability [10
] between stakeholders to identify feasible and compliant software solutions for data protection and privacy regulations [11
There is a lack of ROPA-related explorations in academic research, with existing efforts limited to early-stage work involving enterprise architecture models [13
] or data [14
]. For larger projects that have focused on GDPR compliance with explicit requirements regarding non-proprietary technologies and focusing on interoperability (e.g., semantic web), there is a distinct absence of research addressing ROPA-related tasks despite overlapping with the same information requirements. In terms of ongoing work, the ONTOROPA project [12
] proposes building a semantics-based ROPA with blockchain-based trust guarantees.
We propose an approach to solving these challenges, whereby we identify what data is required to complete the ROPA, who the ROPA stakeholders are, how they utilise the ROPA, and what information flows requiring interoperability and machine-readability of the ROPA are required. To address the identified challenges and their solutions, we present our work based on the following research objectives:
Identify information and information flows relevant for a ROPA in terms of stakeholders based on the GDPR and EU DPAs guidelines and templates;
Develop a machine-readable specification for representing and exchanging ROPA relevant information in an interoperable manner;
Specify a mechanism for using developed machine-readable formats for aggregation, querying, validation, and exporting of information based on identified ROPA-related information flows.
Our previous work on this topic consisted of creating a semantic model of a ROPA [5
]. In this, we evaluated the GDPR and six DPA templates and guidelines to identify a set of concepts required for the representation of ROPA-related information and proposed its formulation as a ‘common semantic model
’ for representing commonality across the EU. We utilised the data privacy vocabulary (DPV) [15
], developed by the W3C Data Privacy Vocabularies and Controls Community Group (DPVCG), as a vocabulary for representing identified concepts (Note: H. J. Pandit chairs DPVCG and is the editor of DPV https://www.w3.org/community/dpvcg/
as of 5 May 2022). We found and reported missing concepts to DPVCG, which subsequently extended the DPV with our contribution. We further developed our common semantic model into a proposal for establishing a ‘Data Processing Catalogue (DPCat) [16
] that utilises the Data Catalog Vocabulary (DCAT) [17
] and its extension, the DCAT Application profile for data portals in Europe (DCAT-AP) [18
], to represent the ROPA-related information in the form of ’datasets’ and ’catalogues’ that could be maintained, used, and shared consistently.
This article expands on our prior work to provide a more complete and feasible solution for establishing a common machine-readable and interoperable mechanism for a common representation of a ROPA. We extended the common semantic model to incorporate ROPA templates from all EU DPAs (17 of 31 DPAs have published templates) and updated the DPCat specification and the DPV to support representing this information. To demonstrate its practical application and usefulness, we applied the DPCat specification to ROPA documents published by the European Data Protection Supervisor (EDPS) for each identified use case (see Section 6
). Finally, we go beyond state of the art by demonstrating the potential of our solution in realising the EU’s ‘Data Spaces’ vision [19
] by creating ‘compliance-related specifications’ that support representation (RDF), querying (SPARQL), validation (SHACL), and exchange (DCAT + DPV) of information.
The principal contributions of this paper are summarised as follows:
use cases exploring ROPA data governance and stakeholders (RO1);
A Common Semantic Model for ROPA (CSM-ROPA) representing information requirements from EU DPAs (RO2);
Data Processing Catalogue (DPCat) specifications for representing and exchanging ROPA-related information and provenance (RO2);
Demonstration of representation, querying, validation, and exchange of ROPA-related information using DPCat and semantic web technologies (RO3);
Discussion on the practicality and application of DPCat as a ’common mechanism’ for exchanging compliance information.
All associated data in documents, analysis, code, and executable artefacts are available under an open and permissive licence at https://w3id.org/dpcat/repo
The remainder of the paper is structured as follows: Section 2
discusses the state of the art and related work, and Section 3
describes the development of the Common Semantic Model for ROPA (CSM-ROPA) development. In Section 4
, we discuss ROPA information flows and data governance requirements for ROPA. Section 5
describes the DPCat data processing catalogue to enable ROPA information sharing, aggregation, and querying for ROPA stakeholder interoperability. Section 6
provides an application use case to demonstrate the practicality and feasibility of DPCat. The remainder of the paper discusses the impact of our approach on real-world use cases based on enabling better automation and tooling for regulatory compliance and critically for authorities to ease investigative burdens towards effective enforcement, and we provide our conclusions and recommendations for future work.
3. A Common Semantic Model for ROPAs (CSM-ROPA)
Despite a ROPA being based only on requirements established by the GDPR Art. 30, our prior work found variance amongst ROPAs templates provided by six DPAs in terms of what information needed to be documented. The additional fields were related to what the DPAs considered best practices to assist organisations in collecting and representing information from their various business processes. We harmonised the requirements from different templates to construct a ‘common semantic model’ for a ROPA (CSM-ROPA) to enable the representation of all DPA-specified ROPA information [2
]. We then represented these information requirements through concepts from the data privacy vocabulary (DPV) [15
] to provide an interoperable machine-readable vocabulary that can act as a mediation mechanism between stakeholders and tools operating on a ROPA and associated compliance processes. In this section, we present results from our extended work where we analysed and incorporated ROPA templates from all EU DPAs to create a single (and truly) ‘common semantic model‘ for a ROPA and represented it using DPV to provide a consistent and interoperable specification for representing a ROPA and its relevant information.
3.1. Analysis of DPA ROPA Templates
The GDPR has 31 DPAs representing nations and member states from the EU and the EFTA EEA (Note: based on EDPB membership this consists of 31 DPAs from 27 EU member states, the EDPS, and 3 additional members comprising the EFTA EEA states; the German regional DPAs were considered part of the national German DPA). Each DPA provides guidance regarding the ROPA based on its basis in the GDPR Art.30, and some DPAs also provide templates to assist organisations with maintaining their ROPA documents. In our prior work analysing six DPA templates [2
], we found that the DPA ROPA templates go beyond the GDPR Art.30 requirements, are not consistent with other DPA templates, and represent a challenge in producing a ‘collective understanding’ of what information is required for maintaining a ROPA.
In this work, we expanded the analysis to all 31 DPAs, and found 17 DPAs provided ROPA templates varying in language and content (Note: Of 17 DPA templates, 5 used English. We used Google Translate to convert the rest to English and manually ensured consistency in translation between templates regarding terms used). On these 17 templates, we performed term extraction, semantic analysis, term frequency enumeration, de-duplication, and antonym/homonym identification. Templates with minimal information restricted their contents for conforming with the GDPR Art.30. Some templates, such as those provided by Belgian and Greek DPAs, were extensive in fields beyond what the GDPR or other DPAs suggested.
The exercise, carried out over 2020–2022, yielded 47 unique concepts representing information to be recorded in a ROPA. Of these, 18 concepts were related to the requirements defined in the GDPR Art.30, and the rest (29 concepts) were either supplementary to these or added by DPAs. An overview of the exercise is presented in Appendix A
which shows the identified concepts and their relevance to each DPA template analysed. (Note: We could not discern a source or basis in law (EU or national) for concepts added by DPAs).
3.2. Developing a Semantic Model for ROPA Using
In our previous work [2
], we utilised DPV to represent terms identified from ROPA templates as machine-readable and interoperable concepts for use in information management and compliance-based approaches. Through this, we proposed a ‘Common Semantic Model for ROPA’ (CSM-ROPA). In this section, we describe our work in expanding the CSM-ROPA to cover additional requirements and concepts identified from the analysis of DPA ROPA templates and incorporate the updates made to DPV.
The DPV provides a semantic vocabulary consisting of hierarchical taxonomies of concepts relevant to the GDPR such as personal data, purposes, processing operations, technical and organisational measures, legal bases, and entities. We chose DPV as it provides the most comprehensive vocabulary for our purposes, is open and accessible, has ongoing development and mechanisms to submit contributions, and is familiar to the authors.
The process of representing identified concepts using DPV used the methodology [39
], where for each term, we constructed a competency question to identify relevant DPV concepts. For example, the term purposes
was phrased as the question: What are the Purposes of processing?
We then identified whether the DPV contained the (semantically) exact concept—which we call an ‘exact match
’, failing which we looked for the closest relevant term(s) which could be used as a substitute—called a ‘partial match
’, and if any existing term could not represent the term—we considered it a ‘new term
’ to be proposed to the DPVCG for inclusion in the DPV. Of the 47 unique concepts found through ROPA templates analysis, we found 44 exact matches, one partial match, and two new terms proposed and added to the DPV. Appendices Appendix A
and Appendix B
provide an overview of this outcome.
The output of this was the CSM-ROPA consisting of 47 concepts covering information requirements from the GDPR and DPA templates for representing a ROPA. CSM-ROPA, through the use of DPV concepts, provides the ability to express a ROPA as a machine-readable and interoperable ’graph’ that can be utilised in technological solutions for automating processes associated with ROPA and the GDPR compliance. The CSM-ROPA data and analysis are available online at https://w3id.org/dpcat/csm-ropa
4. Information and Data Governance for ROPA
The CSM-ROPA, described in the previous section, enables the representation of a ROPA in a machine-readable and interoperable manner and covers information requirements from the GDPR and DPA ROPA templates. However, a ROPA is not a single document in practice but is a related set of evolving information that must be periodically collected and maintained. The information required for maintaining a ROPA thus may have one or more internal sources such as a department, unit, or assigned person where such ‘organisational units’ provide data about their respective processes and activities. A ROPA may also have one or more external sources—such as processors, contractors, vendors—where such ‘external entities’ provide the information required for establishing records of agreed activities and assurance of compliance obligations.
The ROPA provides the DPO with an important overview of the organisation’s practices [22
] and is part of the DPO’s obligations regarding compliance (GDPR Art.39) [40
]. This requires communication between internal stakeholders such as units or departments and external stakeholders such as DPAs, auditors, and certification bodies to collate necessary information for ROPA governance.
We present five use cases that explore the key stakeholders and their roles regarding the ‘heterogeneous sources
’ in ROPA-related data governance. This follows the methodology from prior work [6
] regarding identifying stakeholders and information flows related to GDPR compliance and establishing the utility of developing machine-readability and semantic interoperability mechanisms based on it. (Note: In this, we relied on P. Ryan’s experience as an active DPO for over 30 legal entities).
In our analysis, we considered the DPO as the nominated entity with responsibility within an organisation to oversee the ROPA-related processes as per the obligations from the GDPR (Art. 39). From this perspective, we explore possible combinations based on the existence or involvement of specific stakeholders and their effect on the DPO’s duties to collect and maintain ROPA-related information. We also considered a data controller as the primary type of organisation despite a data processor being required to maintain a ROPA and involve a DPO as a stakeholder. The data controller’s use cases are more complex than a data processor’s, and a solution satisfying a controller’s ROPA requirements can be trivially modified for use by a processor.
This exercise concludes with an argument for the expression of ROPA-related information in a machine-readable and interoperable format. Section 6
then presents DPCat as our solution to communicate or exchange ROPA-relevant information between stakeholders and assist in the automation of compliance processes.
4.1. Use Case U1: Data Controller
This use case, illustrated in Figure 1
, represents a single data controller that maintains a ROPA (GDPR Art.30) for which it identifies and documents relevant processing activities conducted under its responsibility. In addition, as best practice, the controller must assess guidelines and templates provided by relevant DPA(s) and adapt its documentation processes accordingly to meet any additional suggestions or requirements. A ROPA produced by a controller is utilised by its DPO as part of the responsibility to oversee compliance. The ROPA may also be accessed by a DPA or an auditor (e.g., a certification body) as part of their correspondence with the controller or an investigation or auditing process.
The information flows between these stakeholders can involve: (i) a ROPA that conforms to the GDPR Art.30 requirements; (ii) a ROPA that conforms to a DPAs guidelines and templates; (iii) provenance, e.g., ROPA issuer, timestamps, contact details; and (iv) a selective part of ROPA, e.g., temporal period, specific processing activities.
4.2. Use Case U2 Data Controller with Internal Organisational
The second use case, presented in Figure 2
, expands U1 with internal information flows through four ‘organisational units
’ or departments: marketing, human resources, IT services, and web services, where relevant data for generating a ROPA must be collated into a common location [4
]. U2 also involves potential follow-ups with each unit regarding maintenance of records per department, and establishing ‘points of contact’ and ‘responsible entity’. The external information flows, i.e., DPAs and auditors, stay the same as internal units are not separate legal entities subject to direct investigation.
The key information flows between these stakeholders in addition to those in U1 may involve: (i) complete or partial ROPA information for each internal organisational unit; (ii) provenance, e.g., department as issuer, point of contact or responsible entity, timestamps, contact details; (iii) collation of department information into a common ROPA for external stakeholders; and (iv) a selective part of ROPA, e.g., specific department.
4.3. Use Case U3: Data Controller with Data Processors
The third use case, illustrated in Figure 3
, has additional information flows where the controller and its DPO collect relevant information from appointed processors. In cases where a processor is common to all departments or is managed at the organisational level, U3 is an extension of U1. Where organisational units utilise specific external vendors (i.e., data processors), U3 is an extension of U2. In this, we consider the practical situations where data governance is often managed by internal units despite the GDPR associating data processors directly with a data controller.
The key information flows between these stakeholders, in addition to U1 and U2, involve: (i) ROPA information from appointed processors; (ii) Provenance, e.g., sources, timestamps, contact details; (iii) Collation of information from heterogeneous sources into a common ROPA; and (iv) A selective part of ROPA, e.g., specific processor.
4.4. Use Case U4: Data Controller in a Joint Controllers Relationship
The fourth use case, illustrated in Figure 4
, expands U3 with the Data Controller being in a joint controllers relationship with two or more controllers sharing the responsibility of processing as per GDPR Art.26. Similar to the possibility of associating processors with organisational units in U3, joint controllers can also similarly be associated with units for situations where the processing is limited to a unit’s activities. In U4, the controller and its DPO have additional information flows regarding collecting relevant information from other (joint) controllers and any potential follow-ups.
The key information flows for these stakeholders, in addition to U3, involve: (i) ROPA information from joint controllers; (ii) provenance, e.g., sources, timestamps, contact details; (iii) collation of information from heterogeneous sources into a common ROPA; and (iv) a selective part of ROPA, e.g., specific (external) controller.
4.5. Use Case U5: DPO Overseeing Multiple Data Controllers
Use cases U1–U4 considered the perspective of a data controller that employs a DPO in terms of managing their ROPA information. In U5, illustrated in Figure 5
, we consider the scenario of a DPO being an external organisation or individual providing ‘DPO-as-a-service’. We call this entity ’External DPO’, and consider their duties as involving overseeing multiple organisations. The external DPO has to address U1–U4 for several organisations which translates to additional information flows. This is distinct from information flows associated with other external entities, i.e., DPAs or auditors, in that the external DPO requires information including internal organisational units and data governance processes for an accurate understanding and potential follow-up tasks.
The key information flows for these stakeholders, in addition to U1–U4, involve: (i) collect ROPA information from multiple organisations; (ii) produce ROPA for a specific controller; (iii) provenance, e.g., sources, timestamps, contact details; (iv) separation of ROPA-related information reflecting organisational units, e.g., departments; (v) a selective part of ROPA, e.g., specific department for a specific controller.
4.6. Requirements Analysis
Since the GDPR does not dictate or concern how a ROPA is generated or maintained as long as it meets the legal requirements, the organisation has the freedom to determine practices that suit its compliance approach and style. For example, an organisation may choose to maintain ROPAs centrally overseen by the DPO, where information from all sources is fetched externally and collated into a common document (e.g., a spreadsheet) and added to the information management system. Alternatively, an organisation may opt to maintain separate ROPA documents for each of its departments.
In either case, upon being asked by a DPA or an auditor, an organisation has to produce a ROPA for the specified criteria such as reflecting specific processing activities or for a certain time period. An organisation must first identify the relevant information from its ROPA documents and extract the required information. This task can involve manual efforts by the DPO or responsible entity unless the organisation utilises technological solutions that support such use cases and provide an easier workflow based on automation.
Identifying these distinct use cases enables us to collect and harmonise the requirements regarding the expression of information and interpret them for use of the CSM-ROPA for data governance. We indicate that a solution must:
Indicate the source of information, e.g., department, processor;
Present collation of ROPA from discrete, possibly partial information artefacts, e.g., purpose from the department, technical measures from processor;
Record provenance, e.g., timestamps;
Record organisational details, e.g., point of contact, responsible entity;
Maintain distinct records, e.g., department, processors, or temporal periods;
We present additional requirements for enabling systems to use this information:
‘Packaging’: Sharing ROPA record(s) with internal or external stakeholders, e.g., department to DPO or processor to controller;
Querying: Retrieving partial information from ROPA, e.g., specific period or process;
‘Exporting’: Generating ROPA documentation as per requirements, e.g., GDPR Art.30;
‘Customisation’: Customising information storage, retrieval, and exporting based on a variance in requirements, e.g., additional information for specific DPA templates;
‘Assuring’: Providing data integrity and other quality guarantees for records;
We present further additional requirements that motivate operational details:
Machine-readability: for using automation and tooling for information management;
Interoperability: for consistency in and interpretation across stakeholders;
‘Openness’: for enabling adoption without lock-ins across technologies or providers;
‘Extendability’: to enable customisation of a solution for a use case or contextual requirements, e.g., additional terms, new information requirements;
‘Verifiable’: to support information management through validation of information in terms of correctness and completeness, e.g., all necessary fields are declared with valid information types, as well as to support compliance processes in ensuring validity and accountability, e.g., ensuring every processing has a purpose.
From this, we conclude that CSM-ROPA, while sufficient to represent information required to generate a ROPA, is insufficient to meet requirements for exchanging or using information amongst relevant stakeholders. In the next section, we present our use of this as a motivating factor for developing a ‘catalogue’ that can encapsulate the ROPA-related information and satisfy requirements for its maintenance and exchange with stakeholders.
5. DPCat: A Data Processing Catalogue
In the earlier sections, we presented a ‘Common Semantic Model for ROPA’ (CSM-ROPA) representing a consolidated set of information requirements based on an analysis of DPA ROPA templates that can be used as a machine-readable and interoperable vocabulary through the use of DPV (see Section 3
). We then explored the sources and use of ROPA in terms of data flows between stakeholders and identified five use cases that provided further requirements regarding using CSM-ROPA in practical settings. In this section, we present the data processing catalogue (DPCat) specification, also published online (https://w3id.org/dpcat
), that addresses identified requirements and facilitates governance of information from intra- and inter-organisational heterogeneous sources to enable representation of a ROPA in a machine-readable and interoperable manner.
DPCat extends the Data Catalog Vocabulary (DCAT)—a W3C standard for facilitating interoperability between data catalogues [17
], with concepts identified in CSM-ROPA using DPV to enable representation of ROPA and associated information as ‘catalogues’ and ‘datasets’, respectively, that can be recorded and exchanged between stakeholders. DPCat also aims to maintain compatibility with the ‘DCAT Application profile for data portals in Europe’ (DCAT-AP) [18
] which represents the EU’s efforts at standardising catalogue metadata in data portals. The compatibility with DCAT-AP is based on our vision for DPCat to be usable in all DCAT-AP based catalogue information management tools and data portals to present a mechanism for sharing ROPA related information using an EU-advocated standard and to promote the possibility of reusing existing data portal infrastructures for compliance-related purposes - such as requirements for ROPA between controllers, processors, and DPAs.
Our prior work [2
] regarding DPCat was based on the CSM-ROPA developed from six DPA ROPA templates that addressed two (U2: organisational units, U3: processors) of the five use cases. This work incorporates updated CSM-ROPA for 17 DPA ROPA templates, updates made to the DPV (from v0.2 to v0.5), and integration of DCAT and DCAT-AP requirements (e.g., cardinality) for compatibility.
5.1. DPCat Overview
DPCat, illustrated in Figure 6
, distinguishes between a ROPA (as a document or an artefact) and ‘entries
’ within a ROPA where each entry
represents a specific context—such as a business process or data processing purpose. To represent these, we semantically extend the DCAT-AP concepts ‘catalog
’ and ‘dataset
’ as ‘ROPA
’ and ‘ROPARecord
’, respectively. We also extend ‘catalog
’ as ‘ROPACatalog
’ to represent a collection of ROPA
catalogues (i.e., a catalogue of catalogues) for when an organisation has multiple ROPA documents, e.g., representing different temporal periods or activities or organisational units (e.g., departments). Appendix C
provides an overview of these concepts.
ROPARecord is a dcat:Dataset that catalogues information to be documented in a ROPA, is akin to a ‘single row’ in a ROPA spreadsheet and represents a single record of processing. It is used as an instance of dpv:PersonalDataHandling to associate concepts such as purposes of processing or legal bases using the relevant DPV concepts identified from the CSM-ROPA analysis. To ensure compatibility with DCAT and DCAT-AP requirements and recommendations, such as a publisher being a foaf:Agent, DPCat declares the relevant DPV concepts as a subclass of DCAT-AP specified concepts. In a ROPARecord instance, the concepts are coherent, i.e., all purposes apply to all personal data and are shared with all recipients and so on. To indicate separation, separate instances should be created.
A (dpcat:)ROPA represents a dcat:Catalog consisting of one or more ROPARecord datasets and reflects the conventional perspective of ‘ROPA as a single document’ with each entry being a ROPARecord within the catalogue. In both ROPA and ROPARecord, the DCAT properties provide an association with relevant information such as the publisher indicating who had produced or provided that record, temporal annotations such as when the record was produced, or the time period represented, and annotations such as titles and descriptions. A ROPACatalog is the same as a ROPA in terms of being extended from dcat:Catalog and is used to bundle related ROPA catalogues together using dcat:catalog relation.
For common ROPA-related communication between stakeholders, such as associating a ‘point of contact’ (e.g., department or manager) for that information, DPCat uses DCAT relation dcat:contactPoint. Additionally, to adhere to the GDPR terminology, it uses the DPV properties to indicate controller (dpv:hasDataController), DPO (dpv:hasDataProtectionOfficer), and ‘responsible entity’ (dpv:hasResponsibleEntity). In this, the overlap between DCAT and DPV terms, such as the controller being the publisher or the DPO being the point of contact, may not always occur—such as when representing activities limited to a department where the point of contact is a member of that department who liaises with the DPO.
5.2. Using DPCat for ROPA Information Management
As we elaborated on in Section 4
, the information and data governance requirements within the use cases show a need for each entity to organise, maintain, and exchange relevant information to carry out ROPA-related processes. DPCat, as a specification, supports automation through integration into tools used for information storage and retrieval (e.g., databases) and information management practices (e.g., documents and data catalogues). It can represent all ROPA-related information or only catalogue metadata with links to the actual information stored externally (e.g., spreadsheets) as datasets. In either case, DPCat provides a consistent information structure that enables technological solutions such as querying, validation, and exporting (see next sections) to assist the relevant stakeholders in their tasks.
DPCat facilitates data governance for ROPA by incorporating the organisation’s structural and managerial requirements. For U1, where a ROPA has to be maintained at the organisational level, the ROPA and ROPARecord data can be maintained centrally. For U2–U4, where there are heterogeneous sources of information, and it is desirable to record them in the same manner for provenance and follow-ups, the DCAT relations enable provenance of publishers and points of contact. In contrast, ROPACatalog enables collections of related information issued by, e.g., a department or a processor.
The semantics of DCAT provides flexibility in determining how ROPA information could be organised and stored without determining a single method or structure. For example, in addition to the structuring based on organisational units and external entities, it may be desirable to keep records based on contextual information—such as specific business processes related to a product or service. This can be achieved by creating additional ROPACatalog entries representing the other collection and linking them to relevant ROPA entries. Through this, organisations can achieve multifaceted approaches in using the ROPA information without data duplication.
The use of technological solutions advocated by DPCat faces a hurdle in that the sources of information in ROPA-related workflows may not necessarily have the technical knowledge to produce consistent and valid metadata. For example, a DPO with the necessary legal knowledge does not necessarily have or is concerned with the underlying technicalities of information storage and retrieval beyond what is necessary to perform their duties. In such cases, existing information storage and management mechanisms such as databases and spreadsheets can continue to be used by DPCat being integrated into them rather than acting as a replacement. For example, using a SQL database, the information represented in its tables would utilise the DPCat as a schema with the input provided through existing means, e.g., input forms or importing spreadsheets using controlled structures. Alternatively, using an RDF-based solution such as a triple-store, the forms or spreadsheets could be converted to DPCat by utilising mappings.
We envision DPCat to be integrated into a typical workflow (i.e., U2–U4) for recording ROPA as follows. The source (e.g., department representative) generates a ROPARecord containing relevant information with provenance as the department. They use mechanisms available to them, e.g., a series of forms or a script that converts spreadsheets. This information is collated into a ROPA collection representing contextual grouping as determined by the organisational structure (e.g., maintained per department). For sources external to the organisation (e.g., a processor), the provided information is similarly stored in dedicated ROPA and ROPARecord entries and optionally integrated directly into relevant datasets (e.g., controller listing processor’s technical measures in its ROPA). This can use technological solutions such as a database or a portal. To facilitate the structuring of ROPA records in an organised manner, ROPACatalog entries are used to collect and group ROPA entries according to some criteria, e.g., temporal period, legal counsel, or responsible managers.
5.3. Using DPCat for Querying and
DPCat supports and enables a wide assortment of queries and validation approaches that utilise its metadata-based structure to perform information retrieval and verification tasks. DPCat can be a vital tool in technological solutions used for compliance-related processes through these. This section presents a few examples of queries and validation tasks that motivate the use of DPCat in an organisation’s ROPA-related processes.
A common query associated with a ROPA is retrieving the GDPR Art.30 information for a specific context such as data transfers or covering some time period. DPCat supports such queries through DCAT and DPV metadata, e.g., indicating transfer locations as dpv:DataTransfer and dpv:hasLocation and DCAT dcat:temporalPeriod to perform time-based filtering. An example of this expressed as a SPARQL query is provided in Listing 1.
Similar to querying, DPCat also supports verification and validation of information, typically ensuring or assessing compliance with the GDPR. Validation refers to whether sufficient information is available, is in the correct form and format, and is sufficient according to some requirements. Verification refers to the evaluation of the information based on some norms such as specific obligations of the GDPR.
Constraints based on mandatory fields as prescribed by DCAT and DCAT-AP specifications also apply to DPCat. Therefore, data represented using DPCat can utilise existing validation and verification mechanisms for conformance to these standards. In addition, DPCat promotes the expression of the GDPR-specific constraints that are typically expressed as guidelines by DPAs and have been the subject of research by academic and commercial offerings. However, DPCat has an advantage over these existing solutions in that it also promotes interoperability between such verification mechanisms by virtue of being an interoperable specification for information to be verified.
|Listing 1: SPARQL query retrieving ROPA records involving data transfers in a time period.|
As an example of information validation typically involved for the GDPR, Listing 2 presents a SHACL constraint that ensures every ROPARecord instance has an associated purpose. In addition, to ensure information is present and in correct form, SHACL constraints are also useful towards GDPR compliance such as for ensuring an appropriate legal basis as follows: (i) It must have a corresponding legal basis from the GDPR Art.6 (Note: Though the GDPR Art.30 does not require a legal basis in a ROPA, DPA guidelines strongly recommend it.); (ii) If processing involves special categories of personal data, it must additionally have a corresponding legal basis from the GDPR Art.9; (iii) If processing involves data transfers to non-EU locations, it must additionally have a corresponding legal basis from the GDPR Art.45, 46, or 49. We plan to provide such SHACL shapes for both information validation and the GDPR-based requirements verification in the future.
|Listing 2: SHACL constraint to ensure every ROPARecord has an associated purpose.|
5.4. DPCat for Interoperable Information Exchange
DPCat provides a machine-readable and interoperable representation of information that an organisation can use to automate its ROPA management and associated tasks. In cases where information has heterogeneous sources, especially when involving external stakeholders such as processors and other controllers, DPCat can be utilised as a ‘standardised information representation’ for convenience in information flows. In this section, we explore the potential for such developments.
When they hire data processors, a data controller’s obligation includes maintaining information about the processing activities outsourced to the processor and some specifics regarding how they are carried out. For example, controllers may ask processors to provide the technical and organisational measures they implement to ensure sufficient safety and security in processing. Similarly, controllers may require information for data storage locations of data for cross-border data transfers. In cases where a processor contracts another (sub-)processor to carry out the processing, it has to maintain similar records of the sub-processor’s operations, but it also provides them to the controller as requested. In all these, information has to be periodical—maintained independently by the entity itself and communicated to other entities as contextually necessary, and the other entities must also maintain this information independently. Such information flows and requirements are also necessary for a joint controller’s relationships regarding involved controller(s).
If two entities communicating information for ROPA-related tasks use DPCat for their internal information representation, they can directly exchange ROPA information using DPCat-specified records. This is an ideal scenario. However, even if either or no entities do not use DPCat internally, DPCat can be utilised as a common specification for exchanging ROPA information between entities. In this case, the sender entity converts whatever internal representation it has into DPCat and sends it to the receiver entity to ensure that it can understand and interpret the information. The receiver converts DPCat-based information to whatever internal representation they utilise. Thus, DPCat offers advantages for ROPA information exchanges even if organisations do not wish to adopt it completely for internal processes. DPCat is also useful for DPAs and auditors in the same manner, where they can utilise it as an interoperable format for requesting information from organisations. The consistency and machine-readability of DPCat provide investigators with the potential for using automation and tools to reduce workload and repetitions.
6. Demonstration of DPCat in a Real-World Use Case
In this section, we demonstrate the application of DPCat in representing real-world ROPA documents published by the European Data Protection Supervisor (EDPS) [41
] (EDPS), perform validations of them using SHACL, retrieve relevant information using SPARQL queries, and export it as RDF graphs as well as spreadsheets adhering to DPA templates. We provide evidence for the practicality and feasibility of DPCat and its benefits in ROPA information management processes. The data, code, and outputs are available online (https://w3id.org/dpcat/demo/edps-ropa
6.1. Representing Information Using DPCat
EDPS is the DPA responsible for overseeing compliance by EU institutions, which consists of many employees across the various EU bodies and their associated personal data processing activities. The EDPS has published detailed ROPA documents based on the GDPR Art.30 requirements that provide transparency and accountability. As of March 2022, the EDPS has made available 58 ROPA document collections—with each consisting of one more PDF (format) document providing information in English regarding the processing operations. Collections are structured based on ‘topics’—which can be a department (e.g., administrative and human resources or IT), processes (e.g., communication or public events), or specific measures (e.g., access to documents or physical security).
We analysed EDPS ROPA documents and selected four (ids: 01, 05, 13, 55) that covered the U1–U4 use cases for departments, processors, joint controllers, and data transfers. We did not include the other documents despite their relevance due to the large labour and analysis efforts required and because the selected documents sufficed in demonstrating DPCat’s application. The documents were PDFs intended for human comprehension and lacked consistent semantics, e.g., purpose field also contained legal basis.
We interpreted these documents and their structure as follows: each document (i.e., PDF) represented a single ROPA instance, and the information contained within it was structured using ROPARecord instances. We utilised the criteria that each ROPARecord would adhere to a single ‘contextual entry’ based on qualitative criteria regarding the complexity of information and separation of concerns. For example, document X specified two processors, which we interpreted as separate ROPARecord instances for each processor to indicate the separation of concern in the controller’s communication and data governance. The entire collection of documents and RDF graphs were then expressed as part of a single ROPACatalog instance reflecting the published set of records on EDPS’ website.
The manually created RDF graphs were enhanced using the Apache Jena RDFS reasoner [42
] to create a ‘complete graph’ for simplifying querying and validation. The limited RDFS reasoning was sufficient here to obtain the expansion of subclasses and subproperties within the graph rather than generating inferences using an OWL reasoner. For storing the information and offering a querying interface, we utilised GraphDB Free Edition triple store [43
], as it is a freely available triple-store compliant with relevant standards (e.g., SPARQL) and has several features for convenience, e.g., friendly interface, integrated reasoners, SHACL validation. A concise example of the ROPARecord
, and ROPARecord
instances used to represent the EDPS ROPA documents and information is presented in Listing 3. An overview of data workflow is provided in Figure 7
|Listing 3: Summarised overview of ROPA and ROPARecord instances based on EDPS ROPA documents.|
6.2. SHACL Shapes for DPV
For verification and validation of the generated RDF graphs, we first utilised the SHACL constraints provided with DCAT-AP specifications to ensure data correctness according to DCAT and DCAT-AP defined requirements, e.g., publishers being of type foaf:Agent
. We then developed and utilised SHACL shapes representing the cardinality and type constraints to ensure correctness for DPCat’s requirements. For executing the constraints, we utilised the open-source and freely available TopBraid SHACL tool [44
In performing validation of the information, the shape constraints are based on DPCat, which utilises DPV and DCAT concepts to represent relevant information. However, neither DPV nor DPCat indicates what ‘shape’ some information must be represented. Consequently, there may be more than one ’shape’ for a given scenario, often at arbitrary levels of complexity, which prevents a single set of common SHACL shapes from being developed and provided alongside the DPCat specification. For example, a SHACL constraint for ensuring data transfers is specified along with its appropriate location can be modelled in terms of dpv:hasLocation of dpv:DataTransfer. However, the DataTransfer instances could be used at any arbitrary node within the graph, making it difficult to define follow-up constraints such as the recipient of that transfer and its location.
A simple solution would be to associate all the relevant fields with a ROPA or ROPARecord instance. A challenge in this is that all the DCAT-based structure may not be capable of incorporating all fields or that it would make DPCat too complex. An alternate approach would be identifying use cases for each concept’s use and defining specific SHACL shapes for how that information should be expressed using DPV. Given that this requires significant analyses and effort, for the purposes of this article, we limited our defined SHACL shapes for representing information from the EDPS documents. However, we argue for further research and development of such shapes so that they can be used to ensure data is consistently represented across use cases and implementations.
6.3. Querying ROPA Information
To simulate typical tasks performed by a DPO or a DPA, we utilised SPARQL queries for two use cases: (i) retrieval of information required by the GDPR Art.30; and (ii) overview of practices within an organisation in terms of various organisational units, purposes, legal bases, recipients, data transfers, etc. Here, query (i) relates to common compliance documentation procedures, and query (ii) shows the potential for DPCat to help create internal reports or dashboards based on ROPA information, e.g., for a DPO.
The first query, shown in Listing 4 with an output snippet in Table 1
, retrieves ROPA information as per the GDPR’s Art.30.
The second query, shown in Listing 5 with an output snippet in Table 2
, provides an overview of the organisation’s processing activities and relationships with external entities by retrieving relevant information from ROPARecord
|Listing 4: Obtaining a GDPR Art.30 based overview using SPARQL.|
|Listing 5: SPARQL query for overview based on GDPR Art.30 using DPCat.|
6.4. Exporting a ROPA
To demonstrate how DPCat can facilitate information exchange and data governance within and between stakeholders, we provide two examples of information being exported. The first example exports information as DPCat-defined catalogues by using SPARQL CONSTRUCT queries to retrieve related information as an RDF graph. The SPARQL query and the resulting graph can be viewed online. Such exports help store information in the form of backups, copies, or graphs. It is also helpful in exchanging ROPA information between stakeholders such as those accompanying data governance between data controllers and data processors that all support use of DPCat.
The second example simulates automation of a DPO manually managing information in a spreadsheet. For this, we utilised a Python script that executed SPARQL queries and exported results into an MS-Excel (.xlsx) document based on DPA ROPA templates. While the output of a SPARQL query itself could also be exported as a CSV document, the use of Python in this case was to replicate the structure and contents of the DPA template and to operate over the more complex XLSX format that supports tabs within spreadsheets.
6.5. Analysis of Implementation and Lessons Learned
The application of DPCat to real-world ROPAs exposed inherent difficulties in constructing semantic representations due to inputs lacking or being loosely structured as opposed to strict structure that machine-based tools require. We discussed exploring this issue further with a proposed solution where a separate registry of controlled vocabularies is created by the organisation for the use case to first register their concepts such as the specific purpose used or data category processed and to then ensure the ROPA documents only used these concepts. However, we found this solution to significantly deviate from the organisational processes that lack such structured data collection methods. We consider this an open problem with the hope of better tooling being able to resolve it.
In representing the ROPA information using DPV, we faced hurdles in that the DPV as a vocabulary can support a wide range of data modelling styles. This presents barriers to the use of DPCat as a common information representation mechanism as two different organisations can model their data differently. While the common conceptual structure of DPV can assist in aligning the two models, it is better for the development of tools to have a consistent information structure. For this, we propose the creation of ‘DPV Shapes’ that provide suggested data modelling practices for modular use cases. Such shapes, expressed using SHACL, will foster commonality in how the DPV is used and will act as a common model for other modelling approaches that can be reduced or aligned to. In this, it is important to state one of the strengths of the DPV is its lack of rigid adoption requirements which provides an adopter the flexibility to use it within their use cases. The provision of shapes enables continued flexibility of the DPV as a vocabulary while providing guidelines for how it can be consistently used or made interoperable across different applications.
Finally, we faced challenges in determining a suitable mechanism for validation of DPCat-specified information. While we utilised SHACL shapes to demonstrate the potential for such validations based on information and GDPR compliance requirements, this area merits further exploration. In particular, SHACL constraints can be used for two categories of evaluation: first to check whether the necessary information is present and has expected values—similar to DCAT-AP SHACL shapes. The second is based on requirements drawn from the GDPR such as ensuring the correct legal basis is used. In these, the first is an inherent evaluation of conformance
to a specification, as seen from the cardinality constraints in DCAT and DCAT-AP, while the second directly addresses GDPR compliance verification. This follows earlier research explorations demonstrating the use of SHACL constraints in ensuring information correctness and conformance for GDPR compliance [45
The heterogeneity of data sources representing the organisation’s data processing activities presents significant challenges when completing a ROPA. Our research sought to establish the extent to which the DPCat specification for an interoperable and machine-readable data processing catalogue based on DCAT-AP and DPV could overcome the heterogeneity of sources to facilitate the preparation of a ROPA.
We have shown that the DPCat specification enables more automation for realistic distributed ROPA maintenance use cases, leading to stronger regulatory compliance. DPAs, who already struggle with lack of funding and resources [52
], can benefit from the use of DPCat that could ease the investigative burden required for effective enforcement. In pursuit of our first research objective (RO1
) to identify the information necessary to represent ROPAs, we reviewed 17 ROPA templates across 31 DPAs. Our analysis identified 47 unique GDPR concepts, with templates requiring a minimum of 18 concepts up to a maximum of 32 concepts. Over the past 3 years, the DPV has been enhanced to express these concepts, and currently, 44 of the 47 concepts can be expressed exactly, and 1 can be partially expressed. The 2 remaining concepts are with the DPVCG for consideration.
For the second research objective (RO2), we presented the Data Processing Catalogue (DPCat) specification that facilitates governance and maintenance of data from intra- and inter-organisational heterogeneous sources to enable representation of information related to ROPA. Its application to EDPS ROPA demonstrated how DPCat could be utilised as a machine-readable solution to overcome conventional limitations for when data is maintained in documents, or proprietary systems lack machine-readability and interoperability.
The EDPS application also showed how DPCat enabled a data controller/processor to describe processing activities using a standardised model and vocabulary that facilitated aggregation, querying, validation, and exporting from heterogeneous sources (RO3). We used SHACL to ensure correctness, and SPARQL to query and export information for the GDPR articles and DPA templates. Through this, we established the data quality governance process for ROPA by harmonising inputs from heterogeneous sources and producing dynamic documentation that accommodates differences in regulatory approaches across DPAs.
In addition to formulating a research problem, we also explored the potential impact in real-world situations through the use case, application, discussions, and identification of concrete future directions to ensure practical benefits from implementing our work. In addition, as DPCat is an interoperable machine-readable record of the personal data processing activities of organisations, it offers avenues of future research, such as the generation of privacy notices, DPIAs, automatic supplier due diligence checking and international transfer compliance assessments from a common information model. The approach taken by DPCat, though being based on the GDPR, also has benefits for documentation requirements and compliance obligations from other laws, such as the California Consumer Protection Act (CCPA) [24