1. Introduction
There is a growing demand to make research freely available to everyone, and this has resulted in several developments over the last few years. First, governments, funders, and institutions are increasingly mandating open access (OA) to research publications. The Norwegian Government introduced a mandate on OA to scholarly articles in August 2017 [
1] with the aim of making all Norwegian publicly funded research OA by 2024. Secondly, the Max Planck Digital Library initiative, OA2020 [
2], which seeks to transform the publishing system by replacing the subscription model with an OA model, has been followed by a rise in OA big deals with publishers. Third, social sharing networks have increasingly made scholarly publications freely available, often in breach of copyright regulations [
3]. Finally, the website Sci-Hub illegally hosts more than 70 million research articles, providing access to the majority of recently-published articles worldwide [
4]. In total, these initiatives have resulted in a rapid increase in research publications that are free to read on the Internet.
This is confirmed by large-scale studies on the state of openness (none of these studies include Sci-Hub in their analysis). A study by ScienceMetrix that analyzed articles published between 1996 and 2013 found a global open availability level of over 45% [
5]. That study was conducted using a custom automated system that would later become the proprietary database 1findr. Martín-Martín et al. [
6] used Google Scholar to analyze levels of openness to articles published between 2009 and 2014 across all countries and fields of research and found that 54.6% were freely available. A similar study by Mikki [
7] using Google Scholar, but limited to scholarly articles from Norway between 2011 and 2015, found that 68% were freely available. Piwowar et al. [
8] used the Unpaywall database (formerly known as oaDOI) to analyze OA levels. The Unpaywall database only matches articles with DOIs and adheres to a stricter definition of openness by only including articles that are available on journals’ websites or in repositories. Based on three different samples of 100,000 articles, the study by Piwowar et al. [
8] estimated that at least 28% of the scholarly literature is OA with levels up to 45% for 2015. By using the Unpaywall data through the Web of Science (WoS) interface, Bosman and Kramer [
9] analyzed articles published between 2010 and 2017. They found increasing levels of OA of up to almost 30% in 2016. Lower levels than reported by Piwowar et al. [
8] are due to a policy decision by WoS to exclude non-peer-reviewed article versions (typically in pre-print archives), which are provided by Unpaywall.
The most influential definition of OA comes from the 2002 Budapest Open Access Initiative (BOAI), which defines OA (to peer-reviewed literature) as “free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers” [
10]. In a later recommendation, the BOAI separates between gratis and libre OA, where gratis OA removes price barriers, while libre OA also removes permission barriers [
11]. The understanding of OA as “free to read” (gratis OA) is commonly held and is also valid for the freely available articles in this study. However, there are questions concerning legality and permanency. Many articles that are currently available might be removed due to copyright violations, and others that publishers have made freely available might be put behind paywalls. Based on these concerns, the term OA will only be used on subsets of the findings in this study. In general, we will use the terms freely available or openly available articles.
The citation impact of open and closed articles has been discussed repeatedly, and a significant advantage has been found for open articles [
5,
7,
12,
13]. Previous methodological issues in study design, such as citation normalization, OA time lags, and possible selection bias, the latter arguing that high-quality papers are preferably published openly [
14], have since been taken into consideration and more carefully handled. Thanks to Unpaywall, we are now also able to distinguish between types of openness and to determine which types result in different degrees of impact. So far, gold articles seem to be the lowest cited (17% below average), which is right behind closed articles (10% below average) [
8], and open journals still struggle with their reputations.
In this study, we investigated the three influential services (Google Scholar, oaDOI, and 1findr) and their ability to provide the community with full texts. We based our investigation on the total scholarly article output of all higher education institutions in Norway. The data from oaDOI and 1findr were collected during the winter 2017/2018, and we are aware that recent changes and improvements have been made to these services, which might lead to different results.
We aimed to answer the following questions:
How large is the share of open availability according to oaDOI?
What type of open availability is provided?
How does the citation impact vary by type of open availability?
How does this share compare to Google Scholar’s free share?
How does this share compare to 1findr’s free share?
How open are the different subject fields?
2. Applied Services, Data, and Method
We assumed that most of the scholarly peer-reviewed publication output of Norwegian higher education institutions is registered in Cristin, the Current Research Information System in Norway. The institutions and researchers have a strong incentive to register their publications in Cristin because governmental funding of higher education institutions in Norway is partly based on publication output. The funding is distributed among institutions based on the number and type of publications registered and reported through Cristin.
To be reported, publications must fulfil the requirements listed in the reporting instructions [
15]. In general, this limits the reported publications to original research, including articles, review articles, academic monographs, and anthologies. The institutions have the responsibility to ensure that all relevant publications are registered in Cristin and that only publications that fulfil the above-mentioned requirements are reported. Because national requirements on OA are posed on articles only, we limited our studies to these.
Based on metadata provided by Cristin, we queried Google Scholar, oaDOI, and 1findr. Preferably, we used the article DOI as a unique identifier for querying. For articles with missing or erroneous DOIs, we queried CrossRef to enrich our data set.
We soon learned that our CrossRef query, based on ISSN matches, was somewhat too restrictive and that a more flexible combination of metadata, or using other APIs such as those of CrossRef, to match many free-form citations to DOIs would be a better approach. Even though we missed some DOIs, we do not expect the results to have been notably affected.
We scored the results using the open source library SimMetrics, which compares titles by different similarity measures (Cosine and Jaccard) and returns values between 0 and 1 for no and complete matches. We then weighted the returned results differently for Publishing Year (0.5), First Author Name (0.7), and Title (1 for both Cosine and Jaccard), giving greatest weight to the Title, confer values in brackets. According to this weighting, the sum of all similarities was equal to 3.2 for exact matches. After thoroughly testing and manual checking, we decided that records with values larger than 2.2 should be considered exact matches and available metadata added to our set. Applying our similarity algorithm, we managed to enrich our dataset with 2663 DOIs (4%). When multiple results were returned, we selected the result based on max score.
As shown in
Table 1, the share of all documents with a DOI was 82.3%. Of 87,439 articles, 71,953 had or could be allocated a DOI. We also found that the proportion of articles with a DOI increased continuously from 78.8% in 2011 to 89.2% in 2016.
2.1. Applied Services
Of the investigated services, Google Scholar is by far the largest container of scholarly bibliographic records [
16]. According to our earlier study, it covers the entire article output of Norway [
7]. For Google Scholar, no API is available for querying and extracting data. Instead, an in-house script [
17] was developed for automatic web scraping, based either on the article DOI or the article title.
For articles published between 2011 and 2015, we used the data sampled earlier [
7]. For articles published in 2016, we repeated the developed routines, adjusted the script according to changes made by Google Scholar, and manually solved the appearing CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart). Queries were carried out off campus in order to avoid full-text links to library subscriptions. Finally, we carefully compared input and output titles to eliminate false matches.
We only tracked Google Scholar’s best full-text location and disregarded additional versions. This was because we reused data from our earlier study where information on article versions was not collected. In the present study, type of open availability was therefore only investigated through oaDOI and 1findr.
For the provided full-text links, we did not control for copyright compliance, involved usage licenses, or the actual provided content. We are aware of that this might have introduced errors, but for practical and time-saving reasons we had to skip this type of manual control mechanism. Still, Google Scholar is an important driver for open availability and serves here as a valuable comparative measure.
The focus of this study was on oaDOI and its outstanding ability to distinguish between types of openness. Since our investigation, the service has grown and now harvests over 50,000 publishers and includes about 20 million open documents [
18]. The service is run by Impactstory, a nonprofit organization, using open source codes, thus making reuse and further development possible. At the time of our study, we queried the then older and less enhanced version, oaDOI, using the available API through the OpenRefine software. As the name oaDOI indicates, the service builds solely on DOIs, and publications missing DOIs are not included. Considering only articles with assigned DOIs, the coverage in oaDOI for our dataset was almost complete (99%).
The third investigated service was 1findr by 1science. According to their homepage, 1findr consists of 90 million scholarly articles, whereof 27 million are freely available [
19]. At the time of the data collection for this study, 1findr was not free for academic use. However, for the purpose of this study the service kindly provided an API for querying. Similarly as for Google Scholar, we foremost used the DOI in our query, otherwise we used the title. For the title queries, we noticed that the 1findr API had a fuzzy operator (~as suffix), and we used this to increase our number of potential matches. We then compared the results from 1findr with the input data using SimMetrics to identify matching articles.
2.2. Categories of Open Availability
In the following discussion on open availability, we stick to the different classifications given by the investigated services. According to oaDOI, the following evidence for openness are given (
Table 2).
For the aim of this study, the evidence was grouped by
Open Journal: Typically indexed by the DOAJ, the Directory of Open Access Journals
Open Repository: Open full-text in approved OA repositories.
Open Toll: Free via publisher sites, with or without an open license in a toll-access journal.
Closed: All remaining articles.
The oaDOI service specifies a particular category for articles that are open (via free pdf) without being licensed as such. Interestingly, this is the largest “open” category, typically consisting of delayed open content, hidden gold access, and last but not least content opened by the publishers themselves (probably to retain their market position [
20]). The unclarified status of these articles makes this category questionable concerning the permanence of openness and reuse rights.
The categories of availability as defined in 1findr are shown in
Table 3.
The wide categories of GR and UN are problematic because versions of manuscripts might be uploaded in infringement with copyright, and academic social networks (ASNs) might require registration and login before giving access. Nevertheless, we found 1findr useful to determine both coverage and availability.
2.3. Citations and Subject Fields
For the citation analysis, we used article data from 2011 to 2015. Citation counts were extracted from Google Scholar [
7], and all articles were allocated a subject field (in total 85 fields) according to the journal they are published in. These subject fields are defined and approved by the national publishing committee [
21]. In addition to publishing year, citation data are normalized by these subject fields.
4. Conclusions
In our study, we compared open availability by service and found that different services returned considerably different results. Google Scholar was by far the largest full-text provider. It linked 70% of the Norwegian article output to a free version, and corresponding shares for 1findr and oaDOI were 52% and 31%, respectively.
Different shares of openness are primarily caused by the services’ different missions and what they choose to include in their results. For example, Google Scholar and 1findr include ASNs, which might contain material that is in violation of copyright, that has weak or unclear OA policies, or that does not guarantee persistent access. The dispute between publishers and ResearchGate proves the instability of these platforms as sources of full texts [
25].
At the time of our investigation, the open share according to oaDOI was relatively low at 31%. Since then, oaDOI has changed its harvesting routines from mainly harvesting aggregated services to harvesting the original repositories directly, and oaDOI has reorganized its service under the brand Unpaywall. Our first tests seemed to have returned significantly higher recalls than reported in this study, which makes the service more relevant for further use and for build-upon services. We are also pleased to see that the service has added metadata on available document versions (submitted, accepted, or published). All of these measures are important and help the scholarly community to progress in building a true open-access infrastructure.
We only had a short look at 1findr. Apart from our concerns about including ASNs as full-text providers, 1findr returns reasonable open shares (52%). It is praiseworthy that the service recently opened their holdings for institutional use.
In future studies, it will be useful to consider OA policies implemented by funders because these mandates are likely to heavily influence the direction of OA. The European Commission has mandated OA to scientific articles in Horizon 2020 [
26]. They list two ways to meet the requirements: (1) Self-archiving a peer-reviewed version of an article in a repository (with a maximum of a 6–12 month embargo) or (2) publishing an article as OA and then depositing the published version in a repository. These requirements are in line with the Norwegian Government’s mandate and with mandates in other countries.
The requirement that all articles are to be made available through repositories ensures that articles are permanently OA. This will also make it easier to collect data on OA publishing and to monitor ongoing developments.
Although there is a rise in OA publishing and more publications are being made openly available, we have found that the available content is mainly free to read and not necessarily free to reuse and redistribute. Restriction on reuse hinders an effective research flow and harms the idea of open science. In the digital age and with the described services as important drivers towards OA, the time seems overdue to reshape the publishing scene.