2.1. The Data Basis and its Pre-Processing
In order to describe the complete network infrastructure, comprehensive data coverage is the crucial point. According to our experience, every kind of data is valuable, and in addition, storing timeline-related information is valuable as well. The data include not only the metadata that is provided by the particular repository but also the data generated when establishing and processing a connection to a repository endpoint. These observations over time periods are very useful for evaluation purposes.
In fact, the data are spread among different resources and creativity is needed to bring them together in a universally applicable common data schema.
Currently, our statistical information is composed of
Repository Metadata incl. common resource description;
Harvested Metadata (for publications and related objects);
Log Files of the harvesting processes;
the BASE index.
The basic data listed above are available in different forms (related to formats, scope and time dimension).
BASE has been collecting repository metadata since 2007, when a database solution was implemented in order to use these data for displaying content provider information in search results and in the list of indexed resources. Initially these data based on daily database dumps were stored annually and since 2015 at monthly intervals. Included are country of origin, federal state (only for Germany, Switzerland, Austria), continent, technical platform, repository type (based on a vocabulary developed and used in BASE), date of first indexing in BASE, Open Access policy (on the repository level), number of documents, number of OA documents (after normalization), and status of indexing (besides ‘indexed’ as default value various special states as ‘removed’, ‘to be checked’ etc.)
The analysis of the harvested metadata has been carried out daily since 2007 using an automatic tool for monitoring the harvesting environment. The tool provides as a log the number of documents (complete and cleaned with corrections and deletions) and the content of some OAI-PMH response information—repository name, supported metadata formats and the deleting strategy per repository. The evaluation process must be flexible; new developments such as the emergence of ORCID information in metadata allows monitoring and recording of its occurrence in a similar way to license types, for example. The ORCID identifier is a good and up to date example of a development that can be observed through a monitoring process. The stored metadata can provide extended numbers that show from which resource ORCID information originates and in which geographical and technical background. It is obvious that timeline based figures provide valuable information on how to support further dissemination. Other similar examples from the past in which new developments have been documented are DOI and license information and their increasing use in repositories.
BASE as a globally operating search engine includes a Lucene index, which offers API access since 2007. This BASE search index can be used as a database for analyzing the current state of the repository landscape. The API allows all kinds of combinations of available search aspects in the API syntax to construct a highly complex search query. This allows the execution of complex cascaded queries and evaluations for the current status, which go far beyond the information stored in the database. The essential point is to find the most appropriate strategy, how and by what means the comprehensive descriptive information can be extracted from the current search index. Consequently, any combination of search aspects is possible in any nesting supported by the index. Unfortunately, it is not possible to save the complete index as a snapshot and to use it for later complex analyses at earlier points in time. The simple reason for this is that the index is so large (currently more than 1.5 TB of storage space is required) that its storage, even at reasonable intervals, far exceeds the storage capacity of our technical system. Another serious problem would be that BASE has changed the index structure several times and in parallel has updated the software of the core system very frequently which would result in incompatibility for the query behavior. To compensate for this deficit BASE has stored relevant facets of query responses (for single repositories and countries). Available attributes include language, type, publication date, Open Access status, license information and DDC class codes (this information can be used to assign topics). Vocabularies for repository and publication types were derived for the BASE normalization process and clear use of search facets. For this purpose, the terms of certain fields from the available metadata were sorted by frequency and then the content-matching values were defined. It was therefore logical that this expertise should support the subsequent definition activities at national (DINI) and international level (OpenAIRE, COAR).
The index facets at country and repository levels for the normalized contents of BASE index fields have been stored as snapshots for evaluation purposes on a monthly rate since 2016. This situation needs to be reconsidered in order to complement further useful issues and to extend the extraction of data with more complex requests.
A log file of the harvesting processes is available for the period since 2009. The log file data contain status and error messages of the harvesting processes and can provide additional information on the technical behavior of the interfaces (availability, stability, error situations, etc.).
For this point, properties describing the technical behavior of the OAI-PMH interface (performance measuring, comprehensiveness and quality of metadata, batch size) can be extracted from the harvesting response files. It is planned to integrate this information and store it at periodic intervals (monthly).
In order to be able to connect the evaluation figures for the repository network with a broader understanding of the academic infrastructure, especially at the country level it makes sense to relate them with figures describing the scientific infrastructure. Those indicators have to be discovered, prepared and then related to the basic repository-related metadata. Such key figures for the academic landscape include the general population size, the number of scientists and students and expenditures in research per country. Those data are available from different stakeholders, especially from international organizations such as the OECD. The normalization of these figures is not a simple task, since the periods of evaluation are different and the definition of the data collection differs from country to country. This starting point makes a careful and transparent normalization process necessary in order to make the comparison process fair and well-founded.
At this point, it must be mentioned that differences in the comprehensiveness of metadata, the normalization processes and quality deficits of metadata can produce a certain degree of vagueness in general.
2.3. Visualization Tools
The task of extracting and normalizing data is one side of the data science oriented activities described above. The other field in this area is the task of visualizing the available data infrastructure. Over time, a portfolio of solutions for different purposes has been developed. This has resulted in certain tools that have been further developed and for which extensive expertise has been built. It turned out to be advantageous that such approaches in the field of knowledge management where also available in other areas of Bielefeld University Library. This led to a synergetic exchange with other projects (especially in the area of Digital Humanities), which resulted in numerous further implementations.
In the first phase, the focus was on country-based evaluations. Because of the ease of use, Google Charts [19
] was used, especially with regard to the output of maps. The results [Figure 3
and Figure 4
] proved to be quite usable, so that the possible analyses were increasingly extended and provided with a form-based application for internal purposes. This allows the setting of graphical parameters like color scale selection and value range, as well as the definition of content-based filter settings like OA status, repository type or percentage representation. The experience gained during the implementation has been directly fed back into the system design and its optimization.
This work has been increasingly supplemented by efforts to create graphical analyses for different countries and for individual repositories. The aim is to create convincing visual implementations using appropriate techniques. The open source software D3js [20
The next world map (Figure 4
) focuses on a much more demanding implementation since it covers the temporal change with regard to the decrease or increase of document numbers per country from January 2019 to January 2020. In contrast to the previous map, this map illustrates growth rates per country in percentages.
Based on metadata describing repositories, aggregating on the country level can give valuable insights. The following diagrams (Figure 5
, Figure 6
and Figure 7
) use D3js as a visualization framework for descriptive statistics. Figure 5
shows a country profile for the United Kingdom repository landscape, displaying the number of repositories and documents. Additionally, the percentage of OA documents and the distribution of repository types is visualized.
Time series data open up particularly evaluation aspects as they capture developments and provide support for strategic conclusions. Figure 6
demonstrates the usage of a timeline-based visualization for Indonesia. The diagram to the left shows the number of newly indexed repositories per year since 2008, and the one to the right shows the development of the absolute number of repositories and the number of documents. These diagrams rely on aggregated repository figures and give a taste of the significance of timeline-related information. Those diagrams can be computed and displayed for all countries with visible and indexed repositories.
The database is fed with specific metadata fields drawn from the BASE index via storing the facets of specific automatically processed search requests. Figure 7
shows the current distribution of the index fields language and publication type for the repositories from New Zealand (as of 1 February 2020). Both fields have been normalized during the pre-processing step before indexing. For language, the ISO 639-3 standard [22
] and for publication types, a specific internal vocabulary derived from the metadata content is used. The pie diagrams are designed with the D3js [20
] framework and integrate tooltips for detailed figures per value as well.
A similar profile can be computed on the repository level for each resource. Figure 8
shows the basic information for the Bielefeld University repository PUB and demonstrates the distribution of OA and restricted documents and the timeline development for the absolute number of documents and the OA percentage. Obviously, such information together with content evaluation results on repository level can help repository managers getting insight information to optimize their system.