Do the European Data Portal Datasets in the Categories Government and Public Sector, Transport and Education, Culture and Sport Meet the Data on the Web Best Practices?

The European Data Portal is one of the worldwide initiatives that aggregates and make open data available. This is a case study with a qualitative approach that aims to determine to what extent the datasets from the Government and Public Sector, Transport, and Education, Culture and Sport categories published on the portal meet the Data on the Web Best Practices (W3C). With the datasets sorted by last modified and filtered by the ratings Excellent and Good+, we analyzed 50 different datasets from each category. The analysis revealed that the Government and Transport categories have the best-rated datasets, followed by Transportation and, lastly, Education. This analysis revealed that the Government and Transport categories have the best-rated datasets and Education the least. The most observed BPs were: BP1, BP2, BP4, BP5, BP10, BP11, BP12, BP13C, BP16, BP17, BP19, BP29, and BP34, while the least observed were: BP3, BP7H, BP7C, BP13H, BP14, BP15, BP21, BP32, and BP35. These results fill a gap in the literature on the quality of the data made available by this portal and provide insights for European data managers on which best practices are most observed and which ones need more attention. Dataset: https://doi.org/10.34622/datarepositorium/N2P0NK. Dataset License: https://creativecommons.org/publicdomain/zero/1.0/.


Summary
The definition of data can vary remarkably between researchers and, even more so, in different knowledge domains. This diversity around the concept of data is because data are generated for various purposes, by multiple communities and processes. Data can be understood as a "( . . . ) unit of content necessarily related to a certain context and composed by the triad entity, attribute and value, in such a way that, even if the details about the context of the content are not explicit, it should be implicitly available to the user, thus allowing its full interpretation" [1] (p. 2005). A dataset is a "collection of data, published or curated by a single agent, and available for access or download in one or more serializations or formats" [2], usually presented as a table [1]. Regardless of the kinds of data, they should be related to metadata, adding value to data mainly in terms of description, management, legal requirements, technical functionality, use, and preservation [3,4]. Metadata are data about data or structured data about data which, in the context of computer science and information science, are attributes that represent the data, such as authorship, classification, description, policy, distribution terms, and copyright [1,5]. Good quality metadata help people discover and reuse datasets [6].
Currently, public sector aggregators collect large amounts of data that will later be published and made available in a single portal as open data. Open data means all "( . . . ) information collected, produced or paid for by public bodies and can be freely used, modified and shared by anyone for any purpose" [7].
Open data are seen as an "( . . . ) essential resource for economic growth, job creation, and societal progress" [8]. Open data bring numerous benefits, such providing insight that aids in decision making, whether in a visualized form or by reference, and they help to realize the importance of reusing data. The sector that benefits the most from open data is the public sector, indicating that the public sector is the first reuser of its data [8].
"Data portals are web-based interfaces designed to make it easier to find reusable information. ( . . . ), they contain metadata records of datasets published for reuse, mostly relating to information in the form of raw, numerical data" ( . . . ) [9].
As far as open data portals are concerned, they increasingly enable finding datasets, making possible the interaction between data publishers and reusers through forums and feedback from data and classification systems [10]. Simperl and Walker [6] (p. 16) present ten ways for open data portals to evolve to achieve sustainability and added value: "be discoverable, be measurable, promote use, organize for use, be accessible, promote standards, publish metadata, provide linkage data, co-locate documentation, and provide co-location tools". An example of a data portal is the Portuguese Open Data Portal, dados.gov.pt, or the European Data Portal (EDP), data.europa.eu.
One of the global initiatives that aggregates and give access to open data is the European Data Portal (EDP). The first version of the EDP was made available in 2016. The EDP harvests metadata available on public data and geospatial portals across European countries, which include EU member countries, EFTA countries, and countries involved in EU neighborhood policy. The datasets include, for example, land records, state maps, and the location of post offices. Access to the portal is possible through machine-readable API and human-readable websites [11,12]. In addition to this, the portal also provides thirteen data categories defined in accordance with Eurovoc domains. This thesaurus enables users to conduct multilingual searches by data categories and subject [11,13]. The EDP also aims to promote the accessibility and value of open data.
As with other initiatives, there is great concern on the part of EDP regarding data quality. In this sense, the EDP evaluates the quality of the datasets harvested concerning the FAIR principles. The FAIR principles, an acronym adopted for fairness, accuracy, interoperability, and reuse, were introduced in 2014, to "guide data producers and publishers ( . . . ) helping to maximize the added-value gained by contemporary, formal scholarly digital publishing" [14] (p. 1). The authors point out that the FAIR principles apply "( . . . ) not only to 'data' in the conventional sense, but also to the algorithms, tools, and workflows that led to that data put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals" [14] (p. 1). The adoption of the FAIR principles enhances interoperability between different data environments [14]. Although the portal adopts a comprehensive evaluation based on the FAIR principles, some aspects are not contemplated.
On 31 January 2017, the W3C released a recommendation with 35 best practices (BPs) for publishing data on the web, named Data on the Web Best Practices (DWBP) [13]. This set of BPs addresses several challenges encountered in publishing and reusing data. The DWBP specification assigns each BP one or more benefits, out of the following eight: comprehension, processability, discoverability, reuse, trust, linkability, access, and interoperability. The following briefly presents the 35 BPs [13], as well as the benefits they can provide: • Best Practice 1: Provide metadata-provide metadata for both human users and computer applications. This BP provides the following benefits: reuse, understanding, discovery, and processability.
• Best Practice 2: Provide descriptive metadata-the general characteristics of datasets and their distributions, facilitating their discovery on the web, as well as the nature of the datasets. Benefits: reuse, comprehension, and discoverability. • Best Practice 3: Provide structural metadata-the schema and internal structure of distribution (e.g., description of a CSV file, an API, or an RSS feed). Benefits: reuse, comprehension, and processability. • Best Practice 4: Provide data license information-using a link or copy of the data license agreement. Benefits: reuse and trust. • Best Practice 5: Provide data provenance information-the origins of the data and also of all the changes they have already undergone. Benefits: reuse, comprehension, and trust. • Best Practice 6: Provide data quality information-"provide information about data quality and fitness for particular purposes". Best Practice 20: Provide real-time access-for immediate access to encourage the development of real-time applications. "Applications will be able to access time-critical data in real-time or near real-time, where real-time means a range from milliseconds to a few seconds after the data creation". Benefits: reuse and access. • Best Practice 21: Provide data that is up to date-and make the frequency of updating explicit. Benefits: reuse and access. • Best Practice 22: Provide an explanation for data that is not available-"provide an explanation of how the data can be accessed and who can access it", to provide full context for potential data consumers. Benefits: reuse and trust.
• Best Practice 23: Make data available through an API-to offer the greatest flexibility and processability for the data consumers. Benefits: reuse, processability, interoperability, and access. • Best Practice 24: Use web standards as the foundation of APIs-so that they are more usable and leverage the strengths of the web. APIs should be built on web standards to leverage the strengths of the web (e.g., REST). Benefits: reuse, processability, access, discoverability, and linkability. • Best Practice 25: Provide complete documentation for your API-in a way that developers perceive its quality and usefulness. "Update documentation as you add features or make changes". Benefits: reuse and trust. In this study, we try to determine to what extent the datasets from the Government and Public Sector, Transport, and Education, Culture and Sport categories published on the portal meet the Data on the Web Best Practices (W3C).

Data Description
This section presents the data resulting from the study, whose methodology is described in Section 3 below.
A total of 150 datasets were analyzed in light of 29 BPs and, because some were targeted to both humans and machines, a total of 4350 analyses were performed.
The number of datasets observing or not observing each BP in the Government and Public Sector category is presented in Table 1 and Figure 1.   The results of the Education, Culture and Sport category are displayed in Table 3 and Figure 3.

Methods
Three categories were randomly selected from the European Data Portal: Government and Public Sector, Transport, and Education, Culture and Sport. The research was conducted in two stages: an exploratory study and a final study. We prepared a spreadsheet with the BPs in rows and the datasets' identifiers in columns for both studies. The final analysis was performed and recorded by putting in each cell one of the following codes: "Yes" (Y), "No" (N). In addition, rows were added in final study sheets to 4 BPs to indicate whether those BPs correspond to machine-or human-readable data. Some BPs were not analyzed as they were out of scope for this study. In these cases, the respective cells have the value "Does Not Apply" (NA) (we provide the datasets as Supplementary Materials, Tables S1-S4, at DataRepositoriUM, https://doi.org/10.34622/datarepositorium/N2P0NK (accessed on 6 August 2021)).
In the final study, in addition to the best practices field, an observation row was inserted for each BP to include some notes as needed (we provide the datasets as Supplementary Materials, Tables S2-S4, at DataRepositoriUM, https://doi.org/10.34622/datarepositorium/ N2P0NK (accessed on 6 August 2021)).
The procedures for each study are described below.

Exploratory Study
To analyze the data quality of the European Data Portal in the categories of Government and Public Sector, Transport, and Education, Culture and Sport, an exploratory study was conducted, where the first 20 datasets from each category were analyzed.
Since it was not possible to analyze all datasets manually in a timely manner, a sample had to be defined. Initially, a systematic sampling of the datasets classified as Excellent and Good+ was carried out. For this purpose, the Algorithm 1 was used. The exploratory study only focused on the first 20 datasets of each category, as this study aimed to verify the suitability of the algorithm for constituting the sample, obtain the first results, and identify potential implementation problems. The first 20 datasets served as a test for the algorithm, and if it proved to be effective and did not limit the analysis, it would be adopted in the final study and the sample would be extended to 50 datasets of each category (we provide the dataset as Supplementary Materials, Table S1, at DataRepositoriUM, https://doi.org/10.34622/datarepositorium/N2P0NK (accessed on 6 August 2021)).
With the preliminary study, we verified that the sampling algorithm was not optimal, since many datasets belonging to the same country were selected, many of them similar, only differing in the modification date and the data themselves. To overcome this problem and increase the variety of datasets, some changes were made to the sampling procedure as shown in the next section.

Final Study
In the final study, the Algorithm 2 was used. For each category, the selection of 50 datasets was carried out based on the aforementioned algorithm, in which the datasets were filtered by the classification of Excellent or Good+ and the datasets of the exploratory study were removed. Thus, no datasets classified as Excellent were left in any category. The list was scrolled through dataset by dataset to constitute the sample with 50 datasets, according to the requirement of not including similarities.
To perform the analysis of each dataset, the human-and machine-readable (Turtle) EDP catalog information of each dataset was used. For some BPs, this study was carried out in two or more rounds because it was necessary to reverify or fine-tune information. Additionally, the following BPs were left out of the analysis as they were not applicable in the context of the EDP or in the context of this study: B18, B20, BP23, BP24, BP25, B26, B27, B28, BP31, and B33.
Although the W3C specification on DWBP is clear on how to identify compliance or non-compliance with each BP, it was necessary to further elaborate on these criteria to eliminate subjectivities and deviations in analysis over time. To facilitate the analysis and the presentation of the data, we divided some BPs into human-readable and computerreadable versions. The criteria applied in the analysis of the observation of each BP were the following: Best Practice 1-Provide metadata. Provide metadata for human users and computer applications, so that humans can analyze the metadata and computer applications can process it. Always add the code "yes" since the European Data Portal always requires the provision of metadata.
Best practice 2-Provide descriptive metadata. If the dataset catalog provides information about the date, keywords, title, and publisher, among others, then put "yes", otherwise put "no". Add the unavailable metadata elements, considered essential, into the respective remarks field.
Best practice 3-Provide structural metadata. They are needed to open the dataset. If they have information about the meaning and acceptable values for each field, put "yes", otherwise put "no".
Best Practice 4-Provide data license information. If the dataset has a license and the license type is clear [15], then put "yes", but add relevant information in the remarks field. If it does not have a license, then put "no".
Best practice 5-Provide data provenance information. There is provenance information if there is information about the publisher ID, dates of creation, and modification of the dataset [16,17]. They were analyzed from two perspectives: (a) besides the dct:issued property being present, one of the following properties or all three should be present: dct:creator, dct:publisher; dct; publisher and (b) if none of the previous properties is present, prov:actedOnBehalfOf should be present. If one of the two possibilities or both are present, put "yes", otherwise put "no" and add information in an appropriate remarks field.
Best Practice 6-Provide data quality information. If the dataset has the property dqv:hasQualityMeausrement, put "yes"; otherwise, put "no". Best Practice 7-Provide a version indicator. Divided by us into BP7H, humanreadable information, and BP7C, computer-readable information. In BP7H, if the dataset has version information, put "yes". In BP7C, "yes" is only added if it has some property such as pav:version or owl:versionInfo. These properties can be identified in the Turtle syntax. In this case, appropriate information is added to the remarks field.
Best Practice 8-Provide version history. Divided by us into BP8H, human-readable information, and BP8C, computer-readable information. In BP8C put "yes" only if it has the metadata elements dct:isVersionOf, dct:hasVersion, owl:versionInfo, pav:version, or an equivalent associated with rdfs:comment. In this case, appropriate information is added to the remarks field. For BP8H, if a summary of the differences between versions is provided, put "yes"; otherwise, put "no".
Best Practice 10-Use persistent URIs as identifiers within datasets. Check if properties such as dct:creator, dct:publisher, dct:location, dct:spacial, dct:subject, dct:licence, or dct:contributor are referenced by a persistent URI, e.g., DOI or Handle for documents, Orcid for the author, or URI for Creative Commons license. If yes, put "yes"; otherwise put "no".
Best practice 11-Assign URIs to dataset versions and series. If a URI is assigned to each version, put "yes"; otherwise, put "no".
Best Practice 12-Use machine-readable standardized data formats. If the dataset has standardized machine-readable distributions, such as XML, JSON, Turtle, and/or CSV, put "yes"; otherwise, put "no".
Best practice 13-Use locale-neutral data representations. Divided by us into BP13H, human-readable information, and BP13C, computer-readable information. In BP13H, if there is information on how to interpret the respective values in the columns (dates, times, currencies, and numbers), then put "yes"; otherwise, put "no". For the BP13C, it was necessary to search by properties: dct:conformsTo, dct:language, dct:location and/or dct:spacial. If identified, put "yes", otherwise, put "no".
Best practice 14-Provide data in multiple formats. If the dataset has distributions in several formats, put "yes"; otherwise, put "no".
Best practice 15-Reuse vocabularies, preferably standardized ones. The EDP uses DCAT and the data theme authority table, adopted for dataset classification (dcat:theme), by default for all datasets. Therefore, our analysis focused only on the use of value vocabularies such as Eurovoc. Thus, if the dataset descriptions has the property dct:subject with values from standard vocabularies (Eurovoc), put "yes"; otherwise, put "no".
Best practice 16-Choose the right formalization level . For this best practice, if the dataset uses appropriate vocabulary, as Dublin Core and Schema.org, to describe, put "yes"; if it uses vocabulary that is not over-or underspecified, put "no".
Best practice 17-Provide bulk download. If the dataset can be downloaded all at once, put "yes"; otherwise, put "no".
Best Practice 18-Provide subsets for large datasets. This best practice only applies to large datasets. In the EDP, these are already divided, so it does not apply.
Best Practice 19-Use content negotiation for serving data available in multiple formats. Check the available representations of the resource and try to get them by specifying the accepted content in the HTTP request header. If it returns, put "yes"; otherwise put "no".
Best Practice 20-Provide real-time access. The EDP encourages data providers to make data available in real time. However, this BP cannot be verified by analysis of its catalog and was therefore not included in the analysis.
Best Practice 21-Provide data up to date. If there is the property dct:accrualPeriodicity or similar, put "yes", otherwise, put "no".
Best practice 22-Provide an explanation for data that are not available. Divided by us into BP22H, human-readable information, and BP22C, computer-readable information. For BP22H, if datasets are accompanied by an HTML document with information about data referred to in the dataset but not available for some reason, put "yes"; otherwise put "no". For BP22C, if appropriate HTTP status codes are used, such as 303 (see others), 410 (permanently removed), or 503 (service *provides data* not available), put "yes"; otherwise, put "no".
Best practice 23-Make data available through an API. The EDP enables the distribution of datasets by API, but as compliance with this BP does not depend on the datasets, it was not analyzed.
Best practice 24-Use web standards as the foundation of APIs. The compliance with this BP does not depend on the datasets, so it was not analyzed.
Best practice 25-Provide complete documentation for your API. The compliance with this BP does not depend on the datasets, so it was not analyzed.
Best Practice 26-Provide complete documentation for your API. The EDP provides complete documentation for their APIs. The compliance with this BP does not depend on the datasets, so it was not analyzed.
Best practice 27-Preserve identifiers. This BP is also not applicable to the scope of this study since we did not look at removed datasets.
Best Practice 28-Assess dataset coverage. The analysis of the compliance of this BP is also out of the scope of this study since it is related to preservation information for archival purposes.
Best practice 29-Gather feedback from data consumers. Data consumers will be able to provide feedback and evaluations on the datasets and their distributions. If there is a feedback mechanism for data consumers, such as email or another communication channel, put "yes"; otherwise, put "no".
Best Practice 30-Make feedback available. The feedback may be made available to data consumers. The existence of the property was verified as rdfs:comment or similar and, if so, put "yes", otherwise, put "no".
Best Practice 31-Enrich data by generating new data. As we found no information either within the datasets or in the metadata that would allow us to say that the data were enriched, this best practice was not included in the analysis.
Best practice 32-Provide complementary presentations. If there are complementary presentations of the dataset such as a graph, put "yes"; otherwise, put "no".
Best Practice 33-Provide feedback to the original publisher. Compliance with this BP is out of scope of this study as we had no access to the communication between the EDP and its data providers.
Best practice 34-Follow the licensing terms. Although the EDP collects data with the same type of license provided at the source (https://data.europa.eu/pt/faq (accessed on 28 May 2021), it was checked whether the dataset follows the license of the data according to the presented term. If it does, put "yes"; otherwise put "no".
Best practice 35-Cite the original publication. If the citation of the original source of any dataset was available by a text or a link (e.g., data source, available from) put "yes"; otherwise put "no".
These results highlight the importance of quality-driven data publishing. Data publishing provides benefits for both managers and users. Data publication can be very useful for various sectors and users, as in the case of transport, to provide a more efficient response in emergencies [18] or by providing subsidies in decision making. However, it does not make sense to publish data without the attention that should be given to quality as it is necessary to ensure reliability in access and reuse. As well as the FAIR principles, the observance of the best practices recommended by W3C enhances the quality of open data, with DWBP being more comprehensive.
The result of this study offers insights to data managers, notably in the context of government, on which best practices are most observed and which need more attention. In addition, it fills a gap in the literature on the quality of data provided by the EDP from the DWBP perspective.
The limitation of the study was that we did not analyze BP18, BP20, BP23, BP24, BP25, BP26, BP27, BP28, BP31, and BP33 due to not meeting the scope of this study.
Despite the extra care with the sampling technique, many datasets are still similar, so new studies will need to start by refining the sample constitution algorithm.

User Notes
Our datasets are made available as CSV files, an open format. On the first page of each CSV, there is structural information about the data. Legends for the abbreviations are at the bottom of the CSV.
The CSV sheets are structured as follows: rows-BPs; columns-identifiers for each dataset. In the final study sheets, we included a row for remarks on each BP.