Big Data in Education. A Bibliometric Review

The handling of a large amount of data to analyze certain behaviors is reaching a great popularity in the decade 2010–2020. This phenomenon has been called Big Data. In the field of education, the analysis of this large amount of data, generated to a greater extent by students, has begun to be introduced in order to improve the teaching–learning process. In this paper, it was proposed as an objective to analyze the scientific production on Big Data in education in the databases Web of Science (WOS), Scopus, ERIC, and PsycINFO. A bibliometric study was carried out on a sample of 1491 scientific documents. Among the results, the increase in publications in 2017 and the configuration of certain journals, countries and authors as references in the subject matter stand out. Finally, potential explanations for the study findings and suggestions for future research are discussed.


Introduction
Big Data is a concept that is currently in fashion and has been in specialized literature for more than a decade, alluding to the large amount of data that is generated at every moment as a result of technological evolution and the interactions of people in digital spaces (Waller and Fawcett 2013). However. it is only recently that it has had its greatest apogee and impact as an object of research as a result of technological advances and the development of platforms for interaction between users and these with the content, leading to an enormous amount of data (Ghani et al. forthcoming). Specifically, Big Data refers to the large volume of data generated because of the development of technology and the continuous actions and interactions of users in digital environments (Hussain and Cambria 2018). Other concepts related to Big Data are data learning mining or learning analytics. Data learning mining is all those techniques and procedures used to extract useful and relevant information from the large amount of data reported from educational platforms (Menon et al. 2017). On the other hand, learning analytics is a construct that is derived from data mining and alludes to the management, processing, and analysis of students' educational data, which are studied with the purpose of improving and optimizing the learning process (Liang et al. 2016).
That is why, today, society is in what experts call the Big Data era, promulgating new challenges and benefits through the analysis of all data generated in environments characterized by high quantification (Pugna et al. 2019).
Since the arrival of the new millennium, services such as the Internet and the development of the Web began to record data from users, their movements and interactions, creating a large bank of useful and relevant information, whose analysis reports great potentialities to study the needs and demands of people (H. Chen et al. 2012;Khan et al. 2018).
Technological development and the emergence of popular social networks have led people to become active agents in digital media, exponentially multiplying the amount of data generated (Ni et al. 2016).
All this has led to a great interest on the part of researchers in studying all aspects concerning the enormous presence of data in all aspects of people's lives (Williamson 2015;Williams et al. 2017). Thus, the European Commission stated that the Horizon 2020 report would be a major step towards the study of Big Data, with the aim of developing strategies to conduct research and innovation in this field of knowledge (Jin et al. 2015).
The purpose of Big Data analysis is to collect a set of data from various electronic sources to be transformed into relevant information in order to improve the services which the user habitually accesses (Jagadish 2016).
Big Data is nourished by an era marked by the connectivity of people (Veltri 2017), where the action of creating contents, sharing and interacting with the rest of users in the community are the order of the day (Hussain and Cambria 2018). This provides a great opportunity to know-in addition to the needs-the psychological state of people and their behaviour in virtual spaces (Eichstaedt et al. 2015).
Given the peculiarities of the society in which we live, the data are growing at great speed (Nuaimi et al. 2015). So much so, that volume, speed, variety, veracity, and value are already spoken of as fundamental characteristics of the data and that are inherent to Big Data. They present a disorganized structure and are in various formats such as text, image, voice, and video (Injadat et al. 2016).
In order to analyse all the data in the digital environment, the concept of data science arises with the intention of managing and interpreting each and every one of the data by means of specialised programmes with high processing capacity (Hicks and Irizarry 2018). These developments have led to the evolution of predictive analytics (Waller and Fawcett 2013), to adapt services to current trends demanded by the user (Saiki et al. 2018). Therefore, the data are used to predict and make decisions about the future (Ghani et al. forthcoming), based on a strategic design that analyzes the requirements of the audience (Perlado-Lamo-de-Espinosa et al. 2019).
According to Moreno-Carriles (2018), the literature reveals that the treatment of Big Data has expanded into different fields of action, such as security, customer service, public services, preservation of the environment, the economy, finance, in addition to education, which is the field that interests us in this study.
The Big Data that has mainly been exploited in the business world today is already being widely used in education (García-Aretio 2017), finding us in a new phase of teaching and learning based on the study of data generated by students (Gibson 2017).
All the data derived from the different educational agents (teachers and learners) are currently being processed in order to improve the quality and experience of learning processes in digital environments (Liang et al. 2016).
Likewise, the data source produced by educational content management platforms is being used to develop tools and services adapted to the singularities of contemporary education, highly conditioned by the development of educational technology (Merceron et al. 2015). The immersion of the students in a distance and ubiquitous education has caused a great flow of data about their developed activity (Seufert et al. 2019).
However, experts such as Menon et al. (2017) consider that data mining techniques in the field of education-to this day-are not completely successful, so not all meaningful and valuable information is extracted. This is due to the fact that the handling and treatment of Big Data requires the collaboration of teachers with specialists, with the objective of being able to obtain the relevant information from the data reported by the use of tools and digital resources of an educational nature (Huda et al. 2017). This allows learners to perform all kinds of actions in virtual spaces, whose generated data are used to obtain knowledge about their activity, performance and satisfaction (Elia et al. 2019).
An effective analysis of Big Data contributes to the promotion of new and better educational experiences (Reidenberg and Schaub 2018), to an improvement of didactic programming tasks on the part of teachers with the help of scientists specializing in data analysis, to an efficient selection of strategies and decision making to approach the formative process, adequate to the demands of a learning group increasingly familiar with technology, seeking innovative learning as a result of the study of data (Huda et al. 2018), and all of this based on a predictive analysis of the data collected (Daniel 2015;Daniel 2017).
Therefore, Big Data and analytics of the interactions of educational agents in virtual environments are positioned as new ways to solve the shortcomings of the educational system (Picciano 2012), in such a way as to improve productivity, innovation (Sanchez et al. 2015), and the personalisation of learning (Dishon 2017). As a result, it was proposed as an objective to analyze the scientific output, understood as the published articles on Big Data in education in the Web of Science (WOS), Scopus, ERIC, and PsycINFO databases. Consequently, the following research questions were identified: RQ1. What is the state of scientific production over time? RQ2. Which journals and countries concentrate the greatest scientific production on Big Data in education?
RQ3. Which are the articles of greater impact in the area of Big Data in education? RQ4. What are the main lines of research in this field that are derived from the keywords of scientific articles?

Method
This study is characterized by following a bibliometric analysis methodology (Glänzel and Schoepflin 1999). So, following the guidelines and criteria of bibliometrics (Ardanuy 2012), was first established the combination of keywords: "Big Data" AND education. This combination was introduced in the search engine of the different databases. Thus, the scientific production is collected, in article format, from 2010 to 2018. The search took place during the second quarter of 2019, so all indexed literature is included in the year 2018.

Sample
The unit of analysis was composed by the scientific articles indexed in WOS, Scopus, ERIC, and PsycINFO that included in the title, abstract, or keywords the terms Big Data and education. Finally, the sample consisted of journal articles that met the inclusion and exclusion criteria. Inclusion criteria were considered: (i) scientific articles published in journals and peer-reviewed; (ii) year of publication since the term appears in the literature until 2018; (iii) search descriptors appear in the title, abstract or keywords; (iv) published in English language. Instead, the exclusion criteria were: (i) documents not subject to exhaustive peer review (reviews, theses, books, book chapters, conference proceedings, or technical reports); (ii) articles that did not belong to the limited time period; (iii) descriptors are not included in the title, abstract, or keywords; (iv) the language of publication is not English.
From its application, the sample of analysis was composed into 1491 documents: 491 in WOS; 706 in Scopus; 174 in ERIC; and 120 in PsycINFO.

Data Analysis
Data analysis was performed from information extracted from the four databases. Excel and VOSviewer version 1.6.7 (Centre for Science and Technology, Leiden University, Leiden, The Netherlands) programs were used to support the analysis and graphical representation of the data.
On the other hand, the analysis variables were established from the review of previous bibliometric studies in the area of social sciences, with a topic similar to the object of study (Batanero et  Publications by year.
Journals and countries with the highest number of articles. Articles with greater impact. Keywords.
In addition, the bibliometric laws of Price and Bradford were applied to verify diachronic productivity, i.e., productivity over the years (Price 1986) and to establish the nucleus formed by the journals with the largest number of articles (Urbizagástegui 2016).

Results
The publications per year are mostly concentrated between 2016 and 2018, covering 75.92% of the articles published on Big Data in education. Likewise, its origin in literature begins in 2010, although the flow of publications begins in 2012 (Table 1). Price's law establishes that, after 10 years, the scientific literature tends to double, at the same time that it fixes three stages in the development and consolidation of the literature (Price 1986). In this respect, the premise of the duplicity of literature is confirmed, since by the year 2018 literature is much higher than in 2010 (Figure 1). Looking at the graph, the development stages of Price are set between 2010 and 2012 (precursors stage) and from 2012 to the present day, in the exponential growth stage. On the other hand, Bradford's law indicates that, by making an equitable distribution of articles by zones, a small cluster of journals (centre) is formed which collect an equivalent quantity of articles to the rest of zones (Urbizagástegui 2016). This is confirmed in the literature published on Big Data in education, where a small group of journals are the ones that collect the most articles on this topic ( Figure 2). Specifically, in WOS the total is 346 journals and 491 articles, distributed in five areas with the same number of articles approximately (M = 98.2). In this sense, it is observed that the nucleus conformed by 16 journals contains a similar amount of documents to the rest of the zones. In Scopus, 419 journals and 706 articles are collected, distributed in five other zones (M = 141.2); the centre consists of six journals. In ERIC, there are 112 journals and 174 articles, grouped into five zones (M = 34.8) and with a core made up of four journals. Finally, PsycINFO groups 83 journals and 120 articles into five zones (M = 24) and with a core of five journals. These journals, which make up the core, have a much higher than average number of articles. Among them, coinciding in the core of WOS and Scopus: Agro Food Industry and International Journal of Emerging Technologies in Learning (iJET). In WOS and ERIC: Theory and Research in Education. Moreover, in WOS and PsycINFO: Behaviour & Information Technology (BIT) ( Table 2). In relation to the countries with the highest scientific output, the top 10 are collected in each database. The United States stands out above all others, being the country with the largest amount of documents (30.65% of total publications on Big Data in education). China presents the second-largest collection of articles (21.66%) and the United Kingdom is in the third position (10.26%). Below are Australia (4.35%), Canada (3.62 %%), Germany (2.34%), India (2.74%), Italy (1.81%), Sweden (1.40%). %), Saudi Arabia (1.20%), South Korea (2.41%), Japan (1.34%), and Brazil (0.60%) ( Table 3). As for the articles with the greatest impact, depending on the number of citations, it was taken as a criterion that they had more than 100 citations. Thus, six articles are established that met this criterion (Table 4). The first of these presents 2921 citations in WOS and Scopus, entitled "Business intelligence and analytics: From Big data to big impact", in which the authors analyse how big data influences society and specifically business, challenges, and opportunities associated with the company research and education are identified (H. Chen et al. 2012). Behind it is "Data Science, Predictive Analytics, and Big Data: A Revolution That Will Transform Supply Chain Design and Management", a study conducted on how Big data helps improve supply chain management, they show that these terms are relevant to supply chain research and education (Waller and Fawcett 2013), with a sum of 562 citations. "Psychological Language on Twitter Predicts County-Level Heart Disease Mortality", the authors gather analysis of Twitter data to predict cardiovascular disease, the education is significant in relation to big data analysis to predict such pathology (Eichstaedt et al. 2015), with 332 citations. "Applications of big data to smart cities", a study is collected that applies big data to improve urban services, highlighting improvements in the public education service (Al Nuaimi et al. 2015), with 230 citations. "Big Data and analytics in higher education: Opportunities and challenges", the authors carry out an analysis of the advantages and challenges of applying big data in university education (Daniel 2015), with 179 citations. "The evolution of big data and learning analytics in American higher education", a study is carried out that gathers the advances in analytical data technology in higher education in America (Picciano 2012), with 125 citations.

Reference
Year Finally, the networks map between keywords reflects the relations generated between them ( Figure 3). The size of the words indicates their frequency of appearance and a greater amount of connections with other descriptors. There are also three distinct clusters, each with a different colour (red, blue, and green). The red cluster is led by the concept "approach", with the descriptors "analytic", "science", "article", and "technique" prevailing. The blue cluster is headed by "platform", highlighting the keywords "innovation", "service", "person", "factor", "health", and "industry". Finally, in the green cluster, "student" stands out and includes descriptors linked mainly to education: "teaching", "university", "problem", "content", "attention", "training", and "algorithm".

Discussion and Conclusions
Coinciding with the important technological revolution we have been witnessing in recent decades and, in particular, with the rise of the so-called information and communication technologies, a scenario of constant change has been articulated in which the generation of data and the tools responsible for its treatment and management are increasingly important. Moreover, as it could not be otherwise, education cannot remain alien to all this reality. After some years of profound reflection and analysis, professionals and scholars of education are beginning to realise that all this data will make it possible to obtain very substantial, valuable, and detailed information about the way in which the agents involved (students, teachers, and families) are developing the teachinglearning processes, so that they are able to determine the way in which these processes are being implemented in each of their phases and levels, with which it will also be possible to articulate the corrective measures and mechanisms needed to achieve high levels of quality and efficiency. This is without forgetting the possibility of being able to individualize it and adapt it to the characteristics, needs, and interests of each student, in order to achieve high levels of efficiency and quality (Asur and Huberman 2010;M. Chen et al. 2014;Provost and Fawcett 2013).
In spite of the great potentialities of Big Data, it seems clear that, at present, the field of education is not getting all the performance that would be desirable, in terms of data collection, individualization, and improvement of quality and efficiency of teaching-learning processes. As this is such a young and technological stream of thought, it requires the mastery and implementation of a wide repertoire of computer and technological skills and competencies. Unfortunately, they are not available to the majority of teachers, which is often leading to their very poor and inappropriate use, with the consequent damage to the efficiency and significance of student learning (Genevieve et al. 2015;Shum and Ferguson 2012).
At this point, it seems appropriate to insist on the need to articulate specific training and qualification plans oriented towards the knowledge of the main technological skills, abilities, and competencies (Gorospe et al. 2015;Correa 2015;Dussel 2012).
As an answer to the first of the questions posed in this study (What is the state of scientific production over time?), it should be noted that this is a young phenomenon and, therefore, one that has only recently come into being. This is demonstrated by the fact that the first scientific publications related to the subject do not begin to see the light until 2010. Although, it is no less true that since 2012 and up to the present time they have increased exponentially, as a result of the boom that this phenomenon has been experienced in the field of business, social networks and education (Bennett 2015).
Most of the research related to the study of Big Data (the second question of the study presented here) is concentrated, as far as the publication and dissemination of scientific results are concerned, in very specific journals located in countries or environments with a clear English or Anglo-Saxon tradition. This casual circumstance is relates to the fact that these are some of the major environments in which these new currents of thought. In these environments, they are more widely developed, rooted, and consolidated, to the point that, in recent years, they have begun to become an outstanding element of dissemination to the rest of the developed countries of all these new approaches in the treatment and management of data, as elements of clear individualization, improvement, and efficiency of the teaching-learning processes (Sierra-Caballero 2013).
By countries, and at a high level of agreement with the ideas outlined in the preceding paragraph, it should be noted that the United States is the country with the highest production, followed by China, the United Kingdom, and Canada, as far as scientific publications related to Big Data are concerned. Once again, there is evidence of the progressive development that trend of technological thought related to Big Data has been experiencing in Anglo-Saxon countries, to the point that they have become the main cultural window for the development of all these tools, especially in the field of education and, more specifically, the teaching-learning processes (DatAnalysis 15M 2013).
The only exception, with respect to the hegemonic countries in the use and disclosure of Big Data, is China, which appears in second place. This fact, although it goes a little out of the basic pattern, because it is not a country of Anglo-Saxon culture or English-speaking, is not surprising. It is well known that China is configured as a leading and technologically advanced nation that even becomes a pioneer in the development and implementation of many of the most important technological advances that end up reaching the main developed countries, including the United States itself (Medici 2009).
Although Germany does not end up occupying a predominant role with respect to the use, handling, and expansion of the technological tools assigned to Big Data, it is configured as the great gateway to Europe of the main technological advances related to the treatment of the complex data and information chains. As in the case of China, this result is not surprising either, because, with regard to the European continent, Germany represents the flagship of economic and technological prosperity. It is well above most of the countries that make up the European Union, and therefore ends up becoming the main introducer and engine of all kinds of advances, as well as a clear model to imitate (Hernández 2012).
With regard to the articles with the greatest impact linked to Big Data (the third question in the study), those that deal with topics clearly related to the field of business and production processes stand out, followed by those linked to the health field, in particular to the improvement of people's health levels and quality of life in elderly or elderly individuals. However, recent studies try to analyze the benefits of Big Data as a tool for the collection of data to analyze the development, design, and implementation of teaching-learning processes in their different phases and levels are gaining much prominence. With the idea of providing them with greater quality, efficiency, and significance, as well as to articulate the means and strategies that make it possible to individualize them and adapt them to the characteristics, needs, and interests of the students, in this way, we guarantee that each student, during their development, receives all that they need and that, therefore, we offer them the possibility of carrying out the whole teaching-learning process with high doses of efficiency and quality, contributing to the significance of the learning (Área 2011; Salazar 2016).
Ultimately, the main lines of research linked to the phenomenon of Big Data (fourth question of the study) show an almost absolute coincidence with the most important topics that have been working scientific articles of greater impact. This is evidenced by the fact that some of the main lines of research regarding Big Data are those that are found as a central topic in clusters. However, it also appears as a promising and very current line of research, which focuses on the figure of the student to place special emphasis on the knowledge of all those methodologies and strategies of a didactic nature. It has been insisting on the convenience and the need to evaluate the way in which the teaching-learning processes are being developed for the articulation, design, and implementation of perfectly individualized intervention procedures adapted to the needs of the student (Dussel 2014;Martín-Barbero 2012).