Data Descriptor

A Comprehensive Dataset of the Spanish Research Output and Its Associated Social Media and Altmetric Mentions (2016–2020)

Department of Information and Communication, Faculty of Communication and Documentation, University of Granada, 18071 Granada, Spain
Author to whom correspondence should be addressed.
Received: 12 March 2022 / Revised: 4 May 2022 / Accepted: 5 May 2022 / Published: 7 May 2022


This paper presents data on research publications authored by scientists affiliated with Spanish institutions between 2016 and 2020, along with their associated social media and altmetric mentions, and on researchers affiliated with Spanish institutions whose work is highly mentioned on social media and non-academic outlets. The first dataset contains 219,988 records and 24 attributes. Each observation represents a scientific publication (article, review or letter) extracted from the Web of Science database. For each record, we provide bibliographic metadata, its subject area and a battery of altmetric indicators extracted from The second dataset includes 4209 records and four attributes. Each record corresponds to a researcher. For each record, we include their full name, an author identifier (ORCID), their affiliation and their list of publications connecting to the first dataset.
Dataset License: CC-BY.

1. Summary

Altmetrics, alternative metrics based on mentions of scientific publications in social media [1], have been proposed for research evaluation [2]. They still have a long way to go, as there are several limitations attributed to their use [3,4]. However, they offer a different perspective to that offered by citations and can potentially inform on scientific literature consumption beyond academia. The research project “InfluScience—Scientists with social influence: a model to measure knowledge transfer in the digital society” (, accessed on 11 March 2022) was launched with the aim to explore the potential of altmetrics, and to study the social influence of Spanish researchers.
As a result of this project, two datasets have been generated in tab-separated values (tsv) format. The first one includes scientific publications authored by scientists affiliated with Spanish institutions between 2016 and 2020 that were retrieved from Web of Science and InCites, thematically classified, and with their altmetric mentions retrieved from The most influential Spanish researchers are included in the second one.
Using statistical methods, Spanish scientific activity and the attention received in social media can be studied to identify patterns, trends and distributions within the different metrics, differentiating by scientific areas. These data are of interest to researchers working on scientometrics, altmetrics, science of science or science communication interested on analyzing bibliometric and altmetric production at the macro level.

2. Methods

Data were collected from Web of Science, InCites and on the 3 March 2021. We first downloaded the publications (articles, editorial material, letters, and proceedings papers) in which an author with Spanish affiliation is listed published between 2016 and 2020 from Web of Science using the search field Address. This query was limited to the main citation indexes in the Web of Science Core Collection: Science Citation Index Expanded (SCI-Expanded), Social Sciences Citation Index (SSCI), Arts & Humanities Citation Index (A&HCI), and Emerging Sources Citation Index (ESCI). A total of 434,827 records were downloaded and exported to InCites in order to reclassify records categorized as multidisciplinary in Web of Science. Not all publications were reclassified, and 1171 publications had to be assigned manually to a specific Web of Science subject category.
After reassigning multidisciplinary publications to specific categories, these were classified into the 22 research fields included in the Essential Science Indicators (ESI). We created an equivalence scheme in which each of the 254 subject categories from Web of Science were matched with one ESI field [5]. This classification was conducted following the schema proposed by Tan [6]. Subject categories included in the A&HCI are not integrated into the ESI classification, so we included arts and humanities as an extra research field.
To retrieve altmetric mentions, we used the Digital Object Identifier (DOI) assigned to each publication to query and obtain all tracked publications and their mentions. From 406,621 records that included a DOI (93.51% of the total), 238,508 were indexed in (54.85% of the total).

3. Data Description

3.1. Publications Dataset

The first dataset contains scientific publications in which there is at least one author with Spanish affiliation who was mentioned at least once in the database in the 2016–2020 period. It includes 219,988 records, each being a publication, and 25 variables, including bibliographic information and mentions received. Each publication is assigned to a subject area based on the ESI schema (Essential Science Indicators) provided by Clarivate. Additionally, we include the Arts and Humanities subject area. In total, 18 variables including altmetric indicators are provided: an aggregated score (Influscore) and 17 indicators corresponding to different social and non-academic sources (e.g., Twitter mentions, Facebook mentions, news media). The Influscore is the Altmetric Attention Score (AAS) [7] provided by on 3 March 2021. These variables are detailed in Table 1. Figure 1 shows the volume of publications retrieved with and without altmetric mentions differentiated by ESI field to reflect the coverage of the dataset provided.

3.2. Authors Dataset

The second dataset includes the top 250 most influential authors at the general level and for each of the ESI fields based on their Influscore. Their information has been reviewed and normalized to produce the Influscience ranking (, accessed on 11 March 2022). For the disambiguation of authors and institutions, we used the algorithm proposed by Caron and van Eck [8]. The dataset is composed of a total of 4209 observations, each being a researcher affiliated to a Spanish institution, and four variables including bio data and linking to the publication dataset. The variables of the second dataset are detailed in Table 2.

Conceptualization, D.T.-S.; methodology, D.T.-S.; validation, N.R.-G.; data curation, W.A.-M.; writing—original draft preparation, W.A.-M.; writing—review and editing, D.T.-S. and N.R.-G.; visualization, W.A.-M.; supervision, N.R.-G. All authors have read and agreed to the published version of the manuscript.


This work has funded by the Spanish Ministry of Science and Innovation grant numbers PID2019-109127RB-I00/SRA/10.13039/501100011033 and PID2020-117007RA-I00, and Regional Government of Andalusia Junta de Andalucía grant number A-SEJ-638-UGR20. Wenceslao Arroyo-Machado has an FPU Grant (FPU18/05835) from the Spanish Ministry of Universities. Daniel Torres-Salinas is supported by the Reincorporation Programme for Young Researchers from the University of Granada. Nicolas Robinson-Garcia is funded by a Ramón y Cajal grant from the Spanish Ministry of Science and Innovation (REF: RYC2019-027886-I).

Data Availability Statement

The dataset is openly available in figshare at It does not include any data directly extracted from any Clarivate platform (e.g., Web of Science) as they have only been used as intermediary sources to identify publications authored by Spanish researchers and then to recover their mentions from In this sense, the authors’ names of the publications have been normalized and clustered and a new paper subject classification has been implemented. With regard to, a licensing agreement allows us to share this dataset.

Figure 1. Description of the variables of the publication dataset. Distribution of publications of Spanish researchers (2016–2020) with citations by ESI field.
Data 07 00059 g001
Table 1. Description of the variables of the publication dataset.
idIntegerUnique publication identifier
titleCharacterFull title of the publication
yearIntegerYear of publication
typeCharacterDocument type
journalIntegerName of the journal
esi1CharacterESI category of the publication
influscoreIntegerAAS value on 3 March 2021
newsIntegerNumber of mentions in news media
blogsIntegerNumber of mentions in blogs
policyIntegerNumber of mentions in policy reports
patentIntegerNumber of mentions in patent
twitterIntegerNumber of mentions in Twitter
post_peerIntegerNumber of mentions in PubPeer and Publons
weibo2IntegerNumber of mentions in Weibo
facebookIntegerNumber of mentions in Facebook
wikipediaIntegerNumber of mentions in Wikipedia
google2IntegerNumber of mentions in Google+
linkedin2IntegerNumber of mentions in LinkedIn
redditIntegerNumber of mentions in Reddit
pinterest2IntegerNumber of mentions in Pinterest
f1000IntegerNumber of mentions in F1000
stack_overflowIntegerNumber of mentions in Stack Overflow
youtubeIntegerNumber of mentions in YouTube
syllabusIntegerNumber of mentions in Open Syllabus Project
1 In cases of being more than one category, variables are separated by semicolons. 2 Only historical data.
nameCharacterFull name of the researcher
orcidCharacterORCID record
organizationCharacterName of the institution of affiliation
publications1CharacterList of publication identifiers (id) connecting with the publication dataset
1 In cases of being in more than one category, variables are separated by semicolons.
