How to Inspect and Measure Data Quality about Scientiﬁc Publications: Use Case of Wikipedia and CRIS Databases

: The quality assurance of publication data in collaborative knowledge bases and in current research information systems (CRIS) becomes more and more relevant by the use of freely available spatial information in different application scenarios. When integrating this data into CRIS, it is necessary to be able to recognize and assess their quality. Only then is it possible to compile a result from the available data that fulﬁlls its purpose for the user, namely to deliver reliable data and information. This paper discussed the quality problems of source metadata in Wikipedia and CRIS. Based on real data from over 40 million Wikipedia articles in various languages, we performed preliminary quality analysis of the metadata of scientiﬁc publications using a data quality tool. So far, no data quality measurements have been programmed with Python to assess the quality of metadata from scientiﬁc publications in Wikipedia and CRIS. With this in mind, we programmed the methods and algorithms as code, but presented it in the form of pseudocode in this paper to measure the quality related to objective data quality dimensions such as completeness, correctness, consistency, and timeliness. This was prepared as a macro service so that the users can use the measurement results with the program code to make a statement about their scientiﬁc publications metadata so that the management can rely on high-quality data when making decisions.


Introduction
Information and communication technologies are not the only ones that play an important role in all aspects of modern society [1]. But also the processing of electronic research data by the institutions. Research data are an essential part of the operational processes of scientific organizations. They also form the basis for decisions. In recent years, research data at the institution level has become accessible to researchers in many countries [2]. There is also an explosion of large data-various forms of establishment-level information that are typically created for business purposes [2]. In most facilities, incorrect research data only come to light in the current research information systems (CRIS) (The nomenclature for research information systems is more or less not standardized, including RIMS (Research Information Management System), RIS (Research Information System), RNS (Research Network System), RPS (Research Profiling System) or FAR (Faculty Activity Reporting). In this paper, the preferred term is CRIS (Current Research Information System) because it is widespread in European countries. A CRIS is a database for the collection, management and provision of research information

Related Work
There are various studies related to analysis of the Wikipedia references. One of the studies showed that a large number of academic and peer-reviewed publications used as sources in this collaborative knowledge base [17]. Other works showed that Wikipedia preferred referenced with higher academic status and accessibility of journal [9,18]. There are also studies in a similar direction that concluded that Wikipedia can help assess the impact of scientific publications [19,20]. Moreover based on Wikipedia references it is possible to identify the most influential journals [21]. There is also research that showed how special identifiers can be used to find and unify similar sources with different metadata in Wikipedia [8]. For instance, it can be useful when Wikipedia users provide some mistakes in metadata, but also when they translate and provide titles of the publications on their local languages. It is also important to note that in Wikipedia there are special tools for adding and editing references in articles with limited metadata [22]. There are also publications, which take into account the number of references to automatically assess the quality of information in Wikipedia articles [23], including approaches that used machine learning algorithms [24,25] or synthetic measure [26].
All previously mentioned studies did not analyze the quality of metadata of the Wikipedia references. One of the most relevant projects-WikiCite, which is an initiative to create a bibliographic database based on Wikidata [24]. The goals of this project include the improvement of citations in Wikipedia and other Wikimedia projects. However, not all citations from Wikipedia articles are extracted to Wikidata.
Assessment of the quality of the bibliographic metadata is a known problem. However, there are few existing studies that are focused on quality assessment of metadata in other resources. Based on literature research, the same environment is examined and the focus is on ensuring the data quality of metadata in institutional information systems such as CRIS (e.g., Pure, Converis, Symplectic Elements, etc.).
In research institutions, bad data quality can affect different areas. An improvement in data quality is therefore desirable and often meets with little resistance. However, some have found that improving data quality can be complex and time-consuming, but the opposite has to be analyzed and known only about the source of the data quality issues. This can be resolved using data profiling and data cleansing. Because only if the cause is remedied, a lasting improvement of the data quality can be achieved. For more details in this particular area, see the works from [5,[27][28][29][30].
The collection of publication data in CRIS can lead to quality errors. If errors are already made when collecting the data, this can have many negative effects [30]. Spelling errors and typos, incorrect values or missing and inconsistent values are just a few examples of errors in collecting research information into CRIS [31]. To ensure data quality, continuous analysis of research information during its integration into CRIS is required [29]. For CRIS, high data quality is one of the main criteria that determine whether the project is successful and the resulting information is complete, correct and consistent.

Data Quality Challenges in References Data
If research information is incorrect, complete or consistent, it may result in significant organizational consequences. With increasing data volume and the number of source systems, it becomes increasingly difficult to meet the requirements for data acquisition and transformation processes. With both manual research information and automated data collection processes, increasing the amount of data can lead to more errors. The type and number of users can also have an impact on data quality.
In this section, we want to highlight the data quality issues from the reference data in Wikipedia and the publication data from Web of Science. For this purpose, methods are presented to make these problems of data quality recognizable and subsequently evaluable. Recently, the occurring data quality problems of the publication data from the Web of Science for the year 2018 were analyzed on behalf of the German Centre for Higher Education Research and Science Studies (DZHW) and the possible errors were categorized. Figure 1 illustrates the observational publication data from the Web of Science.
The following data quality issues categories were identified in the publication data from Web of Science: 1. Name change after marriage: If Vivian Braun publishes after her marriage as Vivian Mathis, it is not clear that it is the same person. 5. Names of authors are written differently in different countries, e.g., in the Russian space "Дмитрий Менделеев" and in the European area "Dmitri Mendeleev". 6. Faulty multiple registration of institutions: An institution is recorded with different name forms. 7. Incorrect, incomplete and inconsistent collection and order of institutional information of the authors: The registration of institutional information is not done in the correct order. That is, if the order of the information does not correspond to the hierarchical structure of the associated organization. Example: "Institute for Database Systems and Information Systems-Technische Universität Berlin". 8. Erroneous separation: When reading in, various institutions are incorrectly separated (e.g., "and" not recognized and therefore two detected as one). 9. Duplicate detection of Digital Object Identifiers (DOIs): A DOI is assigned in different articles or a publication has different DOIs. In such collaborative knowledge bases as Wikipedia metadata about sources can be provided in different ways depending on the skills and experience of the users. Additionally, there is no central editorial there and not all of the provided data are checked regularly. Therefore, we can expect that various problems related to the quality of the source metadata may appear in this open encyclopedia. As was mentioned before, there are different possibilities to place information about sources in Wikipedia articles. One of the shortest way to add reference is to put some basic information about the publication or URL address of the page, where we can see more information about this source. However, in the case of URL address, Wikipedia readers must follow this link to see more information about this source. At the same time, basic information about the source without URL forces the reader to search for the source of its real existence. An additional problem with the description of the source can be connected with different formats of the citations (such as the American Psychological Association (APA), Modern Language Association (MLA), Harvard), when it presented as an unstructured text. So, in this case, it is difficult to automatically extract source metadata about author, title, publisher and others for analysis of correctness or completeness.
In Wikipedia, sources can be described using special citation templates different parameters depending on the source type and language version. Such parameters include authors, title, publisher, publication date, URL, access date, DOI, ISBN and others. Table 1 shows the number of references with citations templates that contains special identifiers in the top 55 most developed languages versions of Wikipedia. The most developed languages were selected based on article count and depth as proposed in [26]. Based on extracted data it is possible to get the information that was not directly provided in the citation templates. For example, based on special identifiers we can obtain the data about the publishers of each reference using other open bibliographic databases such as Crossref [32]. Thus, after extraction of metadata for over 1 million unique scientific publications with DOI numbers from over 40 million Wikipedia articles in considered languages, we found the top 20 most common publishers of the Wikipedia scientific references: Citation templates can have different names depending on the source which we want to mention as a reference in Wikipedia article: journal, book, conference etc. For example, to give a reference to a book, in English Wikipedia we can use "Cite book" template [33]. However in German Wikipedia to cite the same source we need to use another template name-"Literatur" [34]. Figure 2 shows the different names of the Wikipedia citation templates related to scientific publications in various languages. Additionally, each of the citation templates in each language version of Wikipedia can have its own set of permitted parameter names that can be used to describe the reference. Even the first and last name of each author of the source can be provided separately. Figure 3 showed the most popular parameters in the Wikipedia citations templates related to scientific publications in English Wikipedia (related figures for other language versions can be found in the supplementary web page [35]).  Citation templates are more convenient for machine processing compared to unstructured text with source metadata. However, there may be problems with the quality of data in these templates, such as: 1. Incompleteness of the metadata. Not all parameters are filled by users in some cases. 2. Mismatches between parameters. For example, users can provide the DOI number related to other publications then it was provided in the URL parameter of the template. 3. Wrong value of the parameters such as title, authors and other parameters comparing to publisher database. 4. Invalid data format. Some parameters must have a special structure to be shown correctly in the Wikipedia article. For example, the "year" parameter cannot contain letters.
5. Uncertainty in the assignment of surname and first name. Sometimes parameter "first" consists of the last name of the authors, and conversely "last" parameter consists of the first name. 6. Redundancy of parameters values. For example in the parameter about the journal-title sometimes contains additional information, which must be placed in another parameter: journal = Scientific Data volume 3, article number: 150075. 7. Non-existent URL. This may be related to an incorrect value entry or removal of the destination web page after some time. Incorrect order of the publication authors.

Quality Analysis of References Data in Wikipedia and CRIS
In organizational database systems, there are usually very many data sources available. As an example, we will consider the reference data in Wikipedia as a data source. From the importance of data quality in organizational database systems, arises the question of how the quality of the reference data can be analyzed before it integrates into the information system and leads to a good decision. For that, it is necessary to examine the notion of data quality and introduce methods using the DataCleaner (https://datacleaner.org/) to analyze the data quality of reference data in Wikipedia.
We understand the concept of data quality both from the point of view of the provider of a data source and from the point of view of the user of the data source. The definition of data quality states that reference data in Wikipedia should be suitable for the purpose for which it was collected and generated.
Data quality is multi-dimensional and context-dependent [36]. It can not be defined by a single criterion, but from four different quality criteria (such as correctness, completeness, consistency and timeliness) together [3]. Only if these are considered together can the quality of the reference data be meaningfully described. According to [27] the quality of data must often be defined as the suitability of the data to be used for certain required usage goals, which must be error-free, complete, correct, up-to-date and inconsistent so that users can get better results.
High data quality can simplify integration on the one hand. On the other hand, quality issues become apparent in the course of data integration when comparing multiple data sources [37]. Data quality problems (such as duplicates, incomplete information, incorrect information, null values, etc.) during integration can be controlled by the process of data analysis and this is referred to as data profiling [5]. Data profiling is responsible for collecting as much information as possible about the data, making it easier to identify potential sources of error. In addition, data profiling analyzes the attributes at the instance level and captures as many metadata as possible. the attribute name, data type, value ranges, unique key, patterns, and domains [5].
Based on our practical example of reference data in Wikipedia with a subset of randomly selected 12,334 records, we performed data profiling using the DataCleaner tool to analyze and improve the quality problems of the reference data. With a tool like DataCleaner improves the data quality of existing data in a sustainable way and reduces the duplicates and redundancies within individual databases but also across an entire database. Furthermore, DataCleaner reduces the effort and cost of editing the data. The process of DataCleaner consists of three steps: 1. check data and identify errors 2. validate data 3. correct mistakes Our example includes three columns citation_id, parameter_name and parameter_value and for that they have analyzed their data structure and data contents and used this analyzer as follows (see Figure 4):

•
Completeness analyzer: Reference data can be checked with completeness analyzer, if all required fields are completely filled out.

•
Unique Key: Unique key analysis is the finding of null values or duplicates of reference data.
• Character set distribution: This analysis checks and maps the text characterization of reference data to the corresponding affinity, e.g., Arabic, Latin, etc. • Pattner finder: Possible patterns of reference data can be detected and resolved using pattern analysis and this could e.g., date format, e-mail addresses, etc. [5].

•
Value Distribution: The value distribution can be used to identify all the values of a specific column and to examine which rows belong to specific values. The results of our quality analysis for the reference data in Wikipedia with this analyzer are clearly shown in Figures 5 and 6. The findings are analyzed in tabular or graphical form.
Completeness Analyzer returns the number and percentage of NULL values for each column of the data set. Only explicit NULL values are taken into account, i.e., the absence of a value in a column. In our example, there are 119 incomplete records and this is a signal for non-meaningful data. Next, we used Unique Key Analysis, which determined that the 7336 attributes have a Unique Key among 12,333 rows. It is, therefore, necessary to create a rule that does not duplicate all entered values nor contains NULL values.
Character set distribution is useful to gain insight into the international aspects of our data. Here the question is answered: Can all our data be read and understood? Pattern finder discovers and identifies patterns or representations in our example by analyzing the attributes (see Figure 5). The values are searched for possible patterns and identified and correlated with the filtered-out patterns [5].
Based on Figure 6, we used our example to find out which values represent most of the value distribution profiles and which rows belong to specific values. In summary, DataCleaner is an open-source application for data analysis, profiling and cleaning. With its help, the activities can monitor and manage their data quality. Data analysis by DataCleaner helps to evaluate and detect errors and deficiencies in the reference data, even before they are processed with the data, they are adopted in a database system. Only when the analyzed reference data are clean and free of defects can it be used to make high-quality decisions that add real value to the organization.

Quality Dimensions of the Source Metadata in Wikipedia and CRIS
In order to make the quality of the reference data measurable, certain quality dimensions must be assigned to the data. Current information technology has a wide range of properties that subdivide data quality into measurable areas. Figure 7 below shows the most frequently mentioned data quality dimensions in the literature.
The quality dimensions differ according to subjective, objective and request-specific dimensions, which from the point of view of the user should reflect a structure of the requirements for the data.

•
Subjective quality dimensions can only be assessed by the user. These include e.g., interpretability, relevance, reputation, simplicity of presentation, comprehensibility of presentation.

•
Objective quality dimensions can, for the most part, be collected automatically and obtained from a data source, such as is possible when measuring the number of emergency zero values. This requires querying the data source in order to have values for processing. Objective quality criteria count e.g., completeness, timeliness, objectivity, correctness, security, consistency.

•
The request-specific quality dimensions change from request to request, such as the latency, which depends, among other things, on the time of day and the complexity of the request.
At the core of the measurement of data quality, all quality dimensions (see Figure 7) point in the same direction and we have not compared them with each other because they want to achieve high data quality through constant adjustments to the system. In order for the quality of data to be and remain good, it is necessary to measure and monitor this quality. As part of the survey carried out in the papers [30,31], the relevant dimensions for checking and measuring the data quality in CRIS were examined and thereby determined what is of particular importance for institutions. For the respondents, the correctness, completeness, consistency and timeliness of the research information are very important. For this reason, this paper focuses only on the four objective data quality dimensions in the context of reference data in Wikipedia and CRIS. Therefore, data quality dimensions are selected that must be measurable on the one hand, and on the other hand, identified by users as particularly important in practice [31]. These are described in the following subsections with their simple metrics as follows. For each metric, we can achieve a degree from 0% to 100%.

Completeness
Data is complete if it is not missing and available at the specified times in the respective process steps. It is essential to determine against which amount the completeness is tested. Metadata of the reference completeness shows how many parameters have appropriate value among all defined or the most important parameters in the citation template.

Correctness
Data is correct and error-free if it matches reality. In this case, the value of the parameters in the citation template must be compared to the primary source of the metadata (e.g., from the publisher website).

Consistency
Data is presented consistently if it is consistently mapped in the same way. Metadata of the references can be provided manually by different users in various language versions of Wikipedia, therefore often used citations can have a lower value of this measure compared to other sources.

Timeliness
Data is up-to-date if it reflects the actual property of the described object in a timely manner. References in the Wikipedia articles that were provided a relatively long time ago can have a lower value of this measure compared to recent ones.

Measures Modeling
The measurement of four described before objective data quality criteria will be explained in the following subsections and showed how data quality of the reference data in Wikipedia and CRIS can be measured. The aim was to present four metrics for the data quality dimensions, which enable an objective, targeted and largely automated measurement on different levels of aggregation (e.g., attribute values, tuples, etc.). Finally, there is an example of this measurement as written Python source code in pseudocode to measure the completeness, correctness, consistency and timeliness of the reference data before it is integrated into the CRIS. The CRIS employee can import his internal or external data source as a file at Python, copy the code provided and execute it as a script. This code then calculates the degree of completeness, correctness, consistency and timeliness for the user in order to have the most objective judgment possible. The program code including package will be available on the following website (https://github.com/OtmaneAzeroualDZHW/Forschungsinformationssysteme) and can be downloaded.

Measurement of Completeness
In order to measure completeness, we must analyze check if each parameter of the citation has some value. So, in the algorithm input we have records with certain number parameters such as author(s), title, DOI number, publication type, publication year. If some of the parameter values are empty ("NULL") we decrease completeness of this record by 0.25. When the algorithm iterates each record, it also counts how many values of the specific parameter in the column are empty, so in the end, we can also get information about completeness on each parameter. For instance, we can find how complete are citations in the entire dataset for DOI numbers.
In Algorithm 1 records means all records to be checked, while records_num is number of records. Each record contains fields that are related to different metadata: name, publication title, DOI number, date of publication etc. The f ield_emvalues variable counts the number of the empty values within the selected field, while empty_values means the number of empty fields of the current record. Fields that must be checked marked as fields, while the number of the fields as f ields_num.

Measurement of Consistency
Consistency can be measured by taking into account at least two datasets with citation metadata. Here we do not have "golden standard", so the algorithm compares records between selected datasets and detects any discrepancies between parameter values. Similar to completeness and correctness, the Algorithm 3 measure consistency at the level of each record and at the level of specific parameter (column). In Algorithm 4 standard is the golden standard database with metadata about all existing publications. Due to the fact, that not all fields can be changed during the time, the algorithm only check selected fields that are presented as f ields_to_check_num in pseudocode. In case, when the record is not in the golden standard database, it is not possible to measure timeliness. f ield_values is a dictionary with a specific value of the filed as a key, and the value as a date of first a. The decay rate of the data value presented as decline and it is a static value that depends on f ield_name. Age of the f ield_value market as age. Assessment for a particular field in each record takes tim f ield variable.

Discussion
To continuously analyze and measure all data quality problems of the reference data in Wikipedia and publication data in CRIS, in addition to this correction of accidentally conspicuous data errors [27,28]: 1. The identification and elimination of the respective sources of error, 2. Constant monitoring of the data and its quality, 3. Measures (such as pro-active measures) to prevent another mistake and 4. The regular control of new data errors.
It is not enough to clean up the research information only when it is integrated into CRIS, but it is necessary to communicate errors to the appropriate places so that they are already fixed in data sources.
This not only prevents the occurrence of the same data quality problems the next time the research information is loaded from the respective system into the the CRIS, but sensitizes the responsible IT person, e.g., during manual data entry.
In order to check the frequency of change of the research information from common values for publication data and reference data in Wikipedia, two aspects should be considered and this is explained as follows: 1. Trustworthiness of the data sources: Whether research information in the CRIS can achieve high quality depends on the quality of the respective data sources. The question to answer is whether a data source is trustworthy at all and whether it has a high-quality database from the outset that is easily overlooked. However, if data quality is serious, the data sources should also rate their trustworthiness [37]. 2. Selection of research information: In order to be able to analyze the research information, it must first be sensibly selected with regard to the intended use. The right choice of research information is the first step towards high data quality. Which research information should be selected for integration depends on the individual needs of the scientific organization.
In order to assess the timeliness of the source the metadata we need to know, which of the values of the available parameters in the Wikipedia citation templates can be changed over time. For instance, there is no parameter related to contact details of the author (e.g., email address), at the same time there is an author name, publisher, URL address and other fields that can be compared with the related historical values. Additionally, each of the historical values must have an assigned date that shows when this value was updated according to the real state.

Conclusions
Our paper illustrated how data quality can be analyzed and measured in the context of scientific references in multilingual Wikipedia and CRIS. Systematically building the data quality problems were first explained in practice and performed the analysis using DataCleaner. Furthermore, the data quality was measured by the four objective metrics and we show how they can be measured using pseudocode. We also adjust the code for Python programming language.
So far, the results of our proposed solution could not be compared with the results of the other existing related solutions, as this does not exist in the literature. The solution presented here offers a novelty in research since the measurement of the objective criteria has never been programmed with Python to diagnose and evaluate the quality of metadata from the scientific publications in Wikipedia and CRIS. It is advisable to carry out these measurements continuously because quality measurement of data is not a one-time action, but rather has to be viewed as a permanent task. For this purpose, employees have to be made aware of data quality and motivated to produce it. This is the only way to ensure long-term and sustainable better data quality.
The results of our paper suggest that data analysis uses data profiling to examine, analyze, and summarize reference data in Wikipedia and CRIS. This provides a clear overview that allows organizations to better understand problems, risks, and general data quality trends. Data profiling enables organizations to gain and exploit important data-based insights, e.g., predictive decision-making [5].
For an optimal assessment and measurement of the data quality of reference data in Wikipedia and CRIS, the corresponding four data quality dimensions (correctness, completeness, consistency and timeliness) are required. As Azeroual [3] says, according to its measurement results, with these four objective dimensions in the area of CRIS "the four selected dimensions enable an objective, effective and largely automated measurement within the CRIS. Their metrics have proven to be relatively easy to measure. In addition, these represent a particularly representative presentation of the reporting for the CRIS users and lead to an improved basis for decision-making. In this respect, the review of data quality must always be done with special regard to their context".
In each language version of Wikipedia, we can meet with different names of the templates that describe the scientific references. Each of them can have its own set of permitted parameters and we found which of them are commonly used -title, first name, last name, special identifiers (DOI, ISBN, ISSN, JSTOR) and others. In this work, we showed how often special identifiers are used in over 40 million Wikipedia articles in different languages. Based on special identifiers we can obtain additional data about scientific publications from other databases. Thus we showed the top 20 most popular publishers in Wikipedia scientific references. Moreover, we can compare completeness, correctness, consistency and timeliness of metadata when we know how to clearly identify the same publications in different databases.
Finally, it can be said that the data quality requirements depend on the organizations and in particular on the users of the data. With the help of the procedure described in this paper, quality problems are recognized and corrected automatically at an early stage. In this way, the data quality is continuously monitored and improved. Deviations or errors are quickly recorded across systems, localized in a targeted manner and thus remedied more cost-effectively. In order to avoid the mostly cost-intensive reactive measures, a holistic data quality management process is required. This makes it possible to introduce and permanently guarantee quality in facilities as an overall target for data.