Previous Article in Journal
Directed Energy Deposition: A Scientometric Study and Its Practical Implications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field

Urban ‘Big Data’ Centre, University of Glasgow, Glasgow G12 8RZ, UK
*
Author to whom correspondence should be addressed.
Metrics 2025, 2(3), 15; https://doi.org/10.3390/metrics2030015
Submission received: 13 March 2025 / Revised: 22 July 2025 / Accepted: 31 July 2025 / Published: 13 August 2025

Abstract

In the ongoing ‘data revolution’, the ubiquity of digital data in society underlines a transformative era. This is mirrored in the sciences, where ‘big data’ has emerged as a major research field. This article significantly extends previous scientometric analyses by tracing the field’s conceptual emergence and evolution across a 30-year period (1993–2022). Bibliometric analysis is based on 17 data categories that co-constitute the conceptual network of ‘big data’ research. Using Scopus, the search query resulted in 70,163 articles and 315,235 author keywords. These are analysed aggregately regarding co-occurrences of the 17 data categories and co-occurrences of data categories with author keywords, and regarding their disciplinary distributions and interdisciplinary reach. Temporal analysis reveals two major development phases: 1993–2012 and 2013–2022. The study demonstrates: (1) the rapid expansion of the research field concentrated on seven main data categories; (2) the consolidation of keyword (co-)occurrences on ‘machine learning’, ‘deep learning’, ‘artificial intelligence’ and ‘cloud computing’; and (3) significant interdisciplinarity across four main subject areas. Scholars can use the findings to combine data categories and author keywords in ways that align scholarly work with specific thematic and disciplinary interests. The findings could also inform research funding, especially concerning opportunities for cross-disciplinary research.

1. Introduction

As we steer through the ‘data revolution’ [1,2], the ubiquity of digital data in society underlines a transformative era. From the 1990s, when a major shift occurred from a mainly analogue to a predominantly digital age [3,4], ‘big data’ started to play an increasingly central role owing to technological innovations that changed how we collect, store, process and interpret data: computers became much faster at processing data, data storage increased exponentially, and cloud computing decentralized the use and management of data [5]. The advent of the World Wide Web in the early 1990s, the introduction of distributed computing frameworks (e.g., Hadoop) in the mid-2000s, the surge of smartphones and Internet of Things (IoT) devices in the 2010s, and more recently, advances in machine learning and artificial intelligence; together, these developments have rendered data increasingly ubiquitous and pervasive across societies. Significantly, the transformation has been as much societal as it has been technological, evidenced by the widespread adoption of digital systems and practices in industry, commerce, work, public policy (health, education, housing, etc.), government, as well as social life.
Reflecting the data revolution taking place in the real world, ‘big data’ has also firmly established itself in the scientific world and become a major research field of its own. Consequently, in this study, we focus on how ‘big data’ has emerged and evolved as an academic field of inquiry. In line with common definitions, we understand ‘big data’ to refer to data that is too large or complex to be dealt with by traditional data-processing application software, and that can be described in terms of ‘3Vs’;—volume, velocity, variety—or ‘5Vs’ if adding veracity and value [6]. Drawing on conceptual network analysis used previously in scientometric studies [7,8], we systematically selected a set of 17 interrelated data categories to identify the conceptual origins and trace the evolution and growth of the research field. Apart from ‘big data’ itself, this includes ‘digital data’, ‘novel data’, and ‘intelligent data’ among other categories (see Section 3.1, below, for the full list). These data categories variously denote digitally sourced data and act as conceptual markers used by scientists to describe and define the collective research field.
Importantly, ‘big data’ has a dual role in academic research: it is both a capacious tool for and an object of inquiry. This study examines both aspects, since they jointly constitute ‘big data’ as a research field. As a tool, ‘big data’ involves using large datasets and novel data types (e.g., online-sourced data, user-generated data) to address diverse research questions and develop solutions and practices across disciplines, from computational biology to urban planning, and from diagnostic medicine to digital arts and humanities. To enable this, much research has focused on the development of new methodological approaches to collecting, analysing, and disseminating ‘big data’. Meanwhile, as an object of inquiry, ‘big data’ enables critical reflection on issues such as data ethics, privacy, and power dynamics, which are core concerns of critical data studies. This dual understanding (tool and object) highlights the importance of ‘big data’ in both shaping and being shaped by scientific, technological, and socio-political realities and, in turn, being central to contemporary knowledge production. It further highlights the importance of ‘big data’ as a methodological innovation process, as the knowledge field has grown essentially through methodological developments and applications. Accordingly, this study encompasses the full range of scientific works that make use of, and relate to, ‘big data’, from computational, engineering, and medical works that employ ‘big data’ to analyse given phenomena to socio-cultural studies that examine and reflect on the impacts of new data policies and practices.
The article builds on several recent bibliometric analyses that have sought to characterize and demarcate ‘big data’ as a research domain [9,10,11,12,13,14]. These studies quantitatively analyze the corpus of scientific literature on ‘big data’, for example by examining publication rates, citation networks and keyword compositions. This allows for the identification of foundational works, emerging trends, and pivotal shifts in the scientific discourse over time. Moreover, it can reveal interdisciplinary linkages and the diffusion of ‘big data’ concepts across different subject areas. Mapping publication trends, conceptual shifts and interdisciplinary linkages, such analyses help to identify and anticipate emerging research themes and underexplored areas, thus enabling researchers and policymakers to prioritize key topics. For instance, scientometric insights can inform funding decisions, shape institutional research strategies and support the development of targeted educational programs. In the context of ‘big data’, these insights are particularly valuable in navigating its rapid evolution and guiding its practical applications across disciplines like healthcare, urban planning, engineering, and social sciences.
Our study significantly extends these earlier scientometric works in two main ways: first, by enlarging the timeframe to a 30-year period, from 1993 to 2022, thus tracing the origins of the field; and second, by expanding the conceptual range by including 17 data categories, thus enabling a comprehensive analysis of ‘big data’ and interrelated digital data categories. We use the term ‘big data’ to refer to the overall research field because, as our findings demonstrate, it has become by far the most widely used data category and, as such, is commonly used as collective term by the research community. At the same time, our research highlights the importance of other data categories as co-constituents of the ‘big data’ research ecosystem. By exploring a more extensive timeframe and a comprehensive range of data categories, we unpack the foundational concepts that paved the way for what we now recognize as the well-established ‘big data’ research domain. From a conceptual progression perspective, the temporal analysis highlights the thematic changes and the interdisciplinary trajectories of ‘big data’ as a growing research field.
The overarching research question guiding this study, therefore, is as follows: What is the overall size and the conceptual and disciplinary shape of the ‘big data’ research field, and how has it grown and changed over the last three decades (1993–2022)? The specific research questions are as follows:
RQ1.
How has the ‘big data’ research field grown across time, as measured by cumulative publications rates and the related occurrences and co-occurrences of 17 data categories?
RQ2.
How is the ‘big data’ research field conceptually characterized, as measured by occurrences of keywords and co-occurrence of data categories and keywords?
RQ3.
How is the ‘big data’ research field characterized in (inter)disciplinary terms, as measured by the distribution of publications, data categories and keywords across four main subject areas?
The article is structured as follows: Section 2 provides a literature review divided into two parts, namely a brief overview of the conceptual emergence and evolution of ‘big data’ research followed by a detailed discussion of existing bibliometric works and the related research gap. Section 3 describes the data and methods, from selecting academic databases to data collection, preprocessing, and each stage of data analysis. Section 4 presents the findings in three subsections: (a) the emergence and co-evolution of 17 data categories during 1993–2022; (b) the conceptual demarcation of the field based on keyword co-occurrence analysis; and (c) disciplinary variations and interdisciplinary connections based on science journal classifications. Section 5 discusses the significance of these findings in relation to existing literature, highlighting major new insights. Section 6 concludes by reflecting on the implications of the study findings for the evolving field of ‘big data’ research and discussing the research limitations and related avenues for future research. In the interest of open research, the full curated dataset is made available as Supplementary Documents.

2. Literature Review

2.1. ‘Big Data’: From Concept to Broad Sociotechnical Application

The concept of ‘big data’, despite its widespread use, lacks a unified definition [15]. This ambiguity stems from varied disciplinary and epistemological approaches to ‘big data’ as both a tool and a phenomenon [1,2]. A foundational conceptualization is the ‘V’s’ framework, initially comprising volume, velocity and variety, and later expanded to include veracity and value [16]. This framework emphasizes both the technical characteristics of, and the importance of deriving actionable insights from, digital data.
The discourse surrounding ‘big data’ has broadened to encompass ethical and methodological challenges, reflecting its role as a cultural, technological, and scholarly phenomenon [17]. Kitchin’s [1,2] notion of a ‘data revolution’ underlines ‘big data’s integral role in shaping academic, policy and societal discourses towards a data-intensive paradigm. This rapid and transformative evolution has been facilitated by advancements in computational power and algorithmic efficiency, enhancing the scalability and processing of large datasets [18].
The incorporation of ‘big data’ analytics in various social and public domains, from healthcare to education and from agriculture to urban planning, has required a deeper interrogation of its societal impacts. This has led to a rich body of literature examining the ethical, privacy, and governance issues associated with ‘big data’ [19,20,21]. These considerations form the backdrop against which our bibliometric analysis of the research field is conducted.

2.2. Mapping the Research Field

Scientometric (or bibliometric) analysis has established itself as a powerful quantitative tool for mapping the evolution of research fields as diverse as food-agriculture [22], dentistry [23], neonatal medicine [24], as well as ‘big data’. (It is different from the systematic review method, which uses a carefully prescribed sampling method to review the scientific literature; see [25] for a recent systematic review of the ‘big data’ research literature based on a sample of 189 articles.) The scientometric method employs statistical analysis of published scientific literature to identify longitudinal trends, key themes, seminal works and patterns of collaboration within a given field [9,11,13,26]. In the context of ‘big data’ research, scientometric approaches have been particularly valuable in tracking the field’s rapid growth and disciplinary diversity, thus offering opportunities to chart the research field and analyse its key dimensions and characteristics [12].
To date, several bibliometric studies have confirmed a continuously strong growth in publications related to ‘big data’ since the late 2000s, a trend observed across various scientific databases including SCOPUS, PubMed, and Web of Science. These studies, while consistently reporting significant growth over recent years, vary theoretically and methodologically, for example, regarding the analysis of thematic clusters, citation trends, geographical influence, author collaboration networks and interdisciplinary engagement.
The ‘big data’ research landscape has transitioned from early conceptualizations of data use toward a broader range of increasingly diverse applications [14,27]. Key areas of focus now include the theoretical development of ‘big data’ analytics tools and algorithms, the expansion of infrastructure encompassing hardware, software, and network architecture, and the application of ‘big data’ in sectors such as healthcare, finance, energy, and transportation. Additionally, related technologies like cloud computing and the Internet of Things have emerged as critical components of this landscape [14]. These thematic developments demonstrate the extensive applicability of ‘big data’ technologies across various fields, highlighting their potential to tackle complex challenges and support sustainable development [28,29].
‘Big data’ research has, then, expanded from its original disciplinary anchoring (as measured by the primary research field of publications) mainly in computer science and engineering to a field marked since the 2010s by increasingly interdisciplinary engagement and a diversifying field of academic journals involved [11]. This is evidenced, for example, by the growing co-authorship of publications across fields such as statistics, data science, and computer science [30]. Thematically, the shift to greater interdisciplinarity is also evidenced by surging interests in cross-boundary topics such as data privacy and ethics, indicative of the convergence of ethical, technical, and regulatory knowledge [31]. Furthermore, there is evidence of academic institutions establishing interdisciplinary research centers, indicating a structural shift towards collaborative approaches in ‘big data’ [32]. This is matched by evidence of an increase in (public) research funding going towards projects that blend methodologies from different fields to address societal challenges [11].
Table 1 (below) provides a summary of the aforementioned scientometric studies. It demonstrates the growing interest in mapping the field’s development and expansion.
At the same time, Table 1 highlights several gaps in existing bibliometric studies. One such gap is the sparsity of long-term analyses: apart from a few studies, scientometric analyses have been limited to relatively short time-periods (post-2000s or even post-2010s). One exception is Tseng et al. [31] who undertook a 30-year analysis from 1983 to 2014 with focus on ‘big data’ and ‘data mining’ as search terms. One key insight from their study is that ‘data mining’ preceded ‘big data’, the latter not making an appearance in the scholarly literature before 1993. On their part, Gupta and Rani [32] undertook a 17-year analysis from 2000 to 2017, with ‘big data’ being the sole search term. The present study fills this gap by providing a 30-year analysis from 1993 to 2022, thus also covering the more recent period. Another, related gap is the narrow focus on ‘big data’ (plus ‘data mining’ in the case of Tseng et al. [31] used as search query. In response, this study considerably extends the inquiry by jointly analysing 17 self-similar data categories that variously capture digitally sourced data. While ‘big data’ turned out to be the dominant concept/category in use from 2012 onwards, other categories—such as ‘digital data’, ‘intelligent data’ and ‘novel data’—were influential in the earlier periods and have continued to shape the field (see results section). Moreover, this study concentrates on the conceptual evolution of the field through data category and author keywords co-occurrence analyses, thus filling another gap in the literature. Altogether, this study seeks to analyse the ‘big data’ research field comprehensively by encompassing a 30-year period and using a full range of closely related digital data categories that variously denote the ‘big data’ research domain.

2.3. Author Keywords, and Interdisciplinarity

In scientometric studies, the analysis of author keywords, such as their frequency and co-occurrence, has gained growing interest, since author keywords provide important insights into conceptual, methodological and disciplinary demarcations of research. Schraven et al. [7] demonstrated how analysing co-occurrences of author keywords with specific terms can reveal conceptual underpinnings, associations and interdependencies between different aspects of sustainable urban development research. Sampagnaro [33] noted that “keywords represent one of the most essential items for filtering the vast amount of research available”, making them invaluable for identifying central concepts in academic publications. This attribute is particularly useful in tracing the evolution of ‘big data’ research: analysing keyword (co-)occurrences and related changes over time can reveal conceptual beginnings, maturations and convergences as well as divergences. Xu et al. [34] used author keyword analysis to understand the formation of interdisciplinary research and, relatedly, the continuity and discontinuity of disciplinary trajectories. The insights from this study could be used, for example, to examine whether and how ‘big data’ research emerged from the computing and engineering sciences and increasingly expanded to, and intersected with, social science, business studies, health science and other disciplines.
With focus on the ‘big data’ research field, Choi et al. [35] showed that analysing keyword networks (through co-occurrence measurement) can be used to identify research trends as well as estimate the evolution of the knowledge structure in the scientific literature. The latter also helps to analyse the (inter)disciplinary nature of the research field. Keyword co-occurrences from different and not mutually exclusive domains—such as ‘neural networks’ (computer science) and ‘genomics’ (biology)—would indicate the field’s interdisciplinary significance. Parlina et al. [14] used author keyword cluster analysis, as part of their bibliometric and text mining analysis of the literature (2009–2018) to reveal the thematic emergence and predominance of specific areas, including ‘big data’ analytics, security and privacy, and integration of social networks and IoT. On their part, Kalantari et al. [12] demonstrated that, alongside co-citation analysis, author keyword analysis is an effective way of tracing the development of ‘big data’ research. The analysis highlighted the increasing diversity of keywords and the emergence of new influential terms as well as the growing interdisciplinarity of the field beyond computer science and engineering as dominant disciplines. Altogether, these studies illustrate the usefulness of author keyword analysis to map and understand the evolution and expansion of an increasingly interdisciplinary research field.
Keyword occurrences can also help to quantify the thematic concentration or diversity of a knowledge field. The classification and ranking of topics/themes based on keyword differentiation, since keyword occurrences reflect thematic variety within a specific topic/theme [33]. Applied to ‘big data’ research, and more specifically when using different data categories, high occurrences of keywords paired with particular labels may indicate established connections or thematic consolidation, while emerging keywords could indicate new trends or conceptual differentiation.
In summary, our research adds to the analysis of ‘big data’ as a research field by: (a) comprehensively analysing the extended scientific literature across a 30-year period based on 17 interrelated digital data categories; (b) examining how the overall field has evolved and expanded conceptually through the co-occurrence analyses of the 17 data categories and author keywords; and (c) identifying both the disciplinary and interdisciplinary composition of the field and related changes across time

3. Materials and Methods

The scientometric analysis entailed four sequential steps (see Figure 1, below): (a) data collection, to assemble a corpus of publications; (b) data processing, to cleanse and standardize the assembled texts; (c) data analysis, including publication trends across 30 years, (co)occurrence analysis of 17 data categories, (co)occurrence analysis of collated keywords, and analysis of subject-specific disciplinary trends; and (d) interpretation of the findings. The various methodological steps are outlined below with a view to enabling research replicability. Also, the assembled dataset is made available as Supplementary Document to allow further analysis.

3.1. Selection of 17 Data Categories

As noted, ‘big data’ has become the key conceptual term used to refer to digitally sourced data and has, consequently, lent the name to the research field overall. At the same time, researchers use other closely related data categories—‘digital data’, ‘novel data’, ‘social media data’ etc.—to denote and demarcate the domain. As such, the ‘big data’ research field can be conceptually analysed in terms of an integrated network of data categories. In taking this approach, this study drew on earlier scientometric analyses of the co-constitution of specific research fields based on conceptual categories and keywords [7,8]. In particular, it drew on Schraven et al.’s methodological protocol [7] (step 1, Box 1) to assemble a comprehensive set of data categories that are variously used in the scientific literature as conceptual carriers signifying digitally sourced data and that, therefore, together conceptually constitute the ‘big data’ research field. Accordingly, we began by listing well-known data categories (‘big data’, ‘digital data’, ‘novel data’, ‘social media data’). By verifying these in Scopus and consulting key publications, we identified additional, less prominent data categories used by authors to describe their research involving digitally sourced data. This iterative process (entailing several rounds of scoping analysis) resulted in a final set of 17 data categories. The strategy was to be comprehensive, given the aim to capture the conceptual research field as fully as possible yet based on clear empirical evidence in the literature. The approach combined relevance (terms that meaningfully indicate forms of digital data) and prevalence (terms with demonstrable presence in the indexed literature). Each pre-selected term was checked in Scopus for presence in the scientific corpus. Altogether, this abductive procedure is consistent with recent methodological recommendations in exploratory scientometric studies [35,36], where iterative sense-making is necessary to capture and define emerging or fragmented conceptual domains. Its comprehensiveness was borne out by the results: only three out of the 17 data categories generated fewer than 100 search results (see Table 2, below). The 17 categories are (in alphabetical order): ‘big data’, ‘clickstream data’, ‘digital data’, ‘digital exhaust data’, ‘digital footprint data’, ‘digital trails data’, ‘emergent data’, ‘intelligent data’, ‘internet footprint data’, ‘mobile phone data’, ‘novel data’, ‘online activity data’, ‘online traces data’, ‘RFID data’, ‘smart data’, ‘social media data’, ‘web browsing data’.
It is worth emphasizing that the data categories, as used in this study, act both as conceptual carriers of information (‘intelligent data’, ‘novel data’, ‘big data’ etc., each relating to digitally sourced data in a particular way) and as methodological tool enabling the bibliometric search process.

3.2. Data Collection and Processing

Data collection was carried out using Scopus, a well-established and widely used scientific databases for academic publications. It is one of the most extensive and varied sources of peer-reviewed academic research [14]. For the period of 1993–2022 concerned, Scopus indexed over 43,638,549 English-language articles. This figure is the result of the following advanced search query in Scopus: (LAN-GUAGE (english) AND (DOCTYPE (ar) OR DOCTYPE (re)) AND (PUBYEAR > 1992 AND PUBYEAR < 2023). It reflects the vast scope of published research lodged in the database. The search was restricted to journal articles and reviews, as these represent the gold standard of peer-reviewed academic output. The search query incorporated all 17 data categories, as follows: (TITLE-ABS (“big data” OR “novel data” OR “social media data” OR “mobile phone data” OR “clickstream data” OR “emergent data” OR “web browsing data” OR “digital footprint data” OR “online activity data” OR “RFID data” OR “smart data” OR “digital data” OR “intelligent data” OR “online traces data” OR “internet footprint data” OR “digital exhaust data” OR “digital trails data”) OR AUTHKEY (“big data” OR “novel data” OR “social media data” OR “mobile phone data” OR “clickstream data” OR “emergent data” OR “web browsing data” OR “digital footprint data” OR “online activity data” OR “RFID data” OR “smart data” OR “digital data” OR “intelligent data” OR “online traces data” OR “internet footprint data” OR “digital exhaust data” OR “digital trails data”) AND DOCTYPE (ar OR re) AND PUBYEAR > 1992 AND PUBYEAR < 2023). Using the advanced [36] search function in Scopus, the search was limited to article Title, Abstract and Author Keywords (thus excluding the main body of article), in line with published scientometric protocols, e.g., [7]. Data collection took place in late 2023, with data cleaning and analysis conducted in 2024. (It should be noted that in Scopus the full set of publications for any given year is not available until the end of the first quarter of the following year due to a time lag in indexing. Hence, in this research the last year of data collection was 2022.) The above research process resulted in a corpus of 70,163 articles harvested for the period 1993–2022.
Scopus was selected as the database due to its comprehensive coverage of peer-reviewed journals across disciplines, including physical sciences, social sciences, life sciences, and health sciences, which align with the interdisciplinary characteristic of ‘big data’ research. Scopus also offers robust bibliometric tools and metadata, facilitating large-scale scientometric analyses making it particularly suitable for tracing conceptual and disciplinary trends over time [37]. While Web of Science and similar databases also offer strong bibliographic resources, this study utilized Scopus due to its broader journal coverage and superior indexing of author keywords. Recent research confirmed that Scopus includes 99.11% of journals indexed in Web of Science, thus being near-identical, while including additional indexed journals not covered by Web of Science [36]. See also previous scientometric studies that provided methodological justification for the use of Scopus as representative database of the scientific literature [7,8]. Google Scholar could yet be another alternative if the goal is to include the non-peer reviewed literature and the grey literature. However, it must be noted that Google Scholar is not an indexed scholarship and, as such, is not as robust and stable as Scopus and Web of Science.
Additional data was integrated from the SCImago journal classification index database, correlating SCOPUS entries and SCImago using ‘Source Title’ as a shared key for merging. The SCImago journal classification data allowed for the matching of the bibliometric data extracted from SCOPUS with their ‘Top-level Subject Areas’, and ‘Specific Subject Areas’ ASJC codes, to enable disciplinary distinction across bibliometric records.
The thus collated dataset was cleaned, including handling missing values and inconsistencies such as repeat entries. Text data was pre-processed, especially in critical columns (Title, Abstract, Author Keywords, Source Title) to ensure consistency and accuracy. The data pre-processing involved converting all text to lowercase, removing special characters and extra spaces, and handling missing values. These steps prepared the dataset for effective text matching, which is essential for accurately counting data categories and author keyword occurrences. Additional pre-processing work included splitting and exploding multi-category fields, which was necessary to handle ‘Top-Level Subject Areas’ and ‘Specific Subject Areas’ columns that contain multiple categories within single cells. These fields were split into individual categories and then exploded into separate rows for a more detailed and accurate analysis of each subject area.

3.3. Data Analysis

The analysis of the corpus of 70,163 articles included a combination of quantitative techniques using the open-source Python programming language and the free JupyterLabs (version 3.6) software as analytics platform.

3.3.1. (Co-)Occurrence of 17 Data Categories and Author Keywords

It is worth emphasising that the data categories, as used in this study, act both as conceptual carriers of information (‘intelligent data’, ‘novel data’, ‘big data’ etc., each relating to digitally sourced data in a particular way) and as methodological tool enabling the bibliometric search process.
The first part of analysis focused on the 17 data categories, using the following steps:
  • Data category frequencies. Across the entire corpus, we calculated in how many articles each of the 17 data categories appeared at least once in the title, abstract or author keywords. Each article was counted only once per data category regardless of multiple mentions of a given data category in title, abstract and author keywords in the same article.
  • Temporal segmentation. Initially, 5-year intervals were applied to analyse the temporal appearances of the 17 data categories. After observing a major spike in overall publication numbers in around 2012 with exponential growth thereon largely due to the surge in articles mentioning ‘big data’, the temporal analysis was concentrated in two main periods: 1993–2011 (first period) and 2012–2022 (second period). This highlights which data categories gained traction in the initial phase and how they fared in the second phase once ‘big data’ established itself as the dominant category. The temporal segmentation into two phases also reflects observed thematic transitions in the co-occurrence of data categories and keywords, with ‘big data’ and related concepts emerging as central to the field after 2012.
  • Focus on seven main data categories. Following the initial analysis of the 17 data categories, we proceeded to concentrate further analysis on seven most significant data categories that showed a sustained presence across the 30-year period and/or a total frequency of at >1000. These are: ‘big data’, ‘novel data’, ‘digital data’, ‘social media data’, ‘intelligent data’, ‘mobile phone data’, ‘smart data’.
  • Network analysis. Using social network analysis (SNA) methodology, we analyzed the data categories to identify conceptual connections in the literature. We created network graphs based on the Fruchterman–Reingold “spring” layout algorithm [38], using Python’s NetworkX library in JupyterLabs. This force-directed method models nodes as mutually repelling bodies and edges as springs pulling connected nodes together, and iteratively minimizes a global “energy” function so that (a) highly connected nodes cluster centrally, (b) less-connected nodes repel to the periphery, and (c) edge crossings are reduced). We thus visualized co-occurrence relationships. Data categories (and keywords—see below) represent nodes, and their co-occurrence in the same article represents an edge, creating a network that visualizes the interconnectedness of concepts [7,8]. A threshold of at least five articles was used to count co-occurrences, thereby balancing the need to capture meaningful relationships while filtering out weak or incidental links. Lower thresholds would increase network density, but risk overemphasizing noise. Higher thresholds could exclude significant yet less frequent associations. This approach—informed by a previous scientometric study that used author keyword co-occurrence analysis [8]—ensures clarity and the identification of robust relationships between data categories.
Next, we analysed the collated author keywords as follows:
  • Frequency and co-occurrence analysis. Across the entire corpus, a total of 315,235 Author Keywords were collected. The occurrence of each author keyword was measured across the entire dataset and across the specified time periods. The analysis also included author keyword pairs, that is, two keywords appearing together in the same article [39]. As a threshold, pairs were counted if they appeared in at least five articles.
  • Network analysis. As outlined above, the same social network analysis technique was used to analyse and visualize author keyword co-occurrences.
  • Keyword density mapping. To complement the keyword co-occurrence analysis, continuous 2-D density maps of author-keyword usage were generated for the two main phases (1993–2012; 2013–2022). This overlays a density surface on a two-dimensional embedding of the top author keywords [40]. Bubble size represents overall keyword frequency, and the background heat highlights regions where terms most densely co-occur, thus visualizing both core and peripheral themes at a glance. Procedurally, first, the 50 most frequent keywords in each phase were identified, followed by measuring how often each pair of keywords appeared in the same paper. The resulting pairwise counts were used to arrange the keywords on a two-dimensional map so that closely related terms sit near one another. Finally, a smooth ‘heat’ layer was overlaid to highlight regions where many keywords clump together, making it easy to see the field’s main thematic ‘hotspots’ [40].

3.3.2. Co-Occurrence of Data Categories with Author Keywords

To further analyse the conceptual relationships across the ‘big data’ research field, a co-occurrence analysis of data categories and author keywords—that is, data categories and author keywords appearing together in the same article—was carried out. This was followed by calculating the top co-occurrences: for each data category, the top 20 co-occurring author keywords were identified.

3.3.3. Interdisciplinary Analysis Using ASJC Codes

Finally, to examine the (inter)disciplinary dimensions of ‘big data’ research, the All Science Journal Classification (ASJC) codes were incorporated into the analysis as follows:
  • Data segmentation. The consolidated dataset was filtered by each of the top seven data categories, creating individual datasets (data frames) for each label.
  • Disciplinary Categorization. The ASJC codes categorized articles into four main disciplinary segments: Physical Sciences, Life Sciences, Health Sciences, and Social Sciences.
  • Keyword co-occurrence by discipline and data category. For each of these seven data frames (corresponding to each data category), the data was (i) further segmented by the four ASJC top-level subject areas, and (ii) the top 20 author keywords were calculated in each of these disciplinary segments.
  • Thematic trajectory analysis: For each data category, the thematic trajectory was analysed across different disciplinary contexts. The examination of the top co-occurring keywords for each data category within each disciplinary segment provides insight into how ‘big data’ concepts are applied and understood in various disciplines as well as across the research field overall.

4. Results

This section presents the findings of our scientometric analysis of the ‘big data’ research field. First, in Section 4.1 and Section 4.2, we address RQ.1, namely how the ‘big data’ research field grew across three decades (1993–2022) as measured by cumulative publication rates and the related occurrences and co-occurrences of the 17 data categories.

4.1. Publication Output 1993–2022

In total, our search query of 17 data categories generated a corpus of 70,163 research and review articles. Figure 2 and Table 2 (below) show the number of publications for each of the thirty years measured. For the first two decades (1993–2012), the number of publications increased slowly but steadily in their lower hundreds year-on-year, with the last couple of years (2011, 2012) showing accelerated growth. 2013 saw the beginning of exponential growth that has endured throughout the third decade surveyed. Figure 2 illustrates the dramatic change in growth rate from 2013 onwards. Indeed, 64,332 (91.7%) out of the total of 70,163 outputs were published in the period of 2013–2022.
In total, our search query of 17 data categories generated a corpus of 70,163 research and review articles. Figure 2 and Table 2 (below) show the number of publications for each of the thirty years measured. For the first two decades (1993–2012), the number of publications increased slowly but steadily in their lower hundreds year-on-year, with the last couple of years (2011, 2012) showing accelerated growth. 2013 saw the beginning of exponential growth that has endured throughout the third decade surveyed. Figure 2 illustrates the dramatic change in growth rate from 2013 onwards. Indeed, 64,332 (91.7%) out of the total of 70,163 outputs were published in the period of 2013–2022.
One can, therefore, divide the 30-year period into two distinct growth phases. Research activity in phase 1 (1993–2012) produced a total of 5831 publications, representing 8.3% of the total output for the entire period. In other words, less than one tenth of output was produced in the first two decades. Annual publications were in their hundreds, rather than thousands. In contrast, research activity in phase 2 (2013–2022) expanded significantly, each year producing a multiplier of the previous year’s publication output and culminating in over 10,000 publications in each of the last couple of years surveyed. Over nine tenths of output was produced in the last decade. The expansion of the research field is starkly illustrated by the numbers at both ends: in 1993, a total of 127 scientific articles were published worldwide; in 2022, the figure was 10,360 articles, representing an increase of over 8000%.
The division of the 30-year period into two phases was informed by both quantitative and qualitative factors. Quantitatively, the growth in publication numbers remained relatively modest during the first phase, followed by a surge starting around 2013. This rapid increase was driven by the widespread adoption of ‘big data’ as a central concept in the field. Qualitatively, the first phase is marked by foundational research focusing on ‘digital data’ and ‘novel data’, while the second phase reflects a thematic shift towards interdisciplinary applications, particularly with the rise of artificial intelligence and machine learning as pivotal topics.
To further analyse publication trends, we calculated the smoothed relative growth rates of publications relating to each of the 17 data categories (corresponding graphs can be viewed in the Supplementary Document). This sheds additional light on publication trajectories. For phase 1 (1993–2012), the analysis confirms the gradual and foundational expansion of the field. While publications mentioning ‘digital data’ and ‘novel data’ were dominant early in the phase, both experiencing steady relative growth, publications containing ‘big data’ only began to emerge with noticeable spikes toward the end of the period. The relative growth of publications containing ‘clickstream data’ and ‘smart data’ also suggests that foundational methods and applications were beginning to influence research. However, overall growth during Phase 1 reflects the incipient and exploratory stage of ‘big data’ research, with slower adoption across disciplines. For phase 2 (2013–2022), the dynamics shifted dramatically as publications related to ‘big data’ solidified their dominance, reflected in high early growth rates that gradually stabilized. Publications tied to new data sources, such as ‘mobile phone data’, ‘web browsing data’, and ‘online activity data’, experienced spikes, particularly toward the latter part of the phase. By the end of Phase 2, the convergence of growth rates across most categories reflects the field’s conceptual and methodological maturation. This suggests that ‘big data’ research has transitioned from an exploratory phase to one marked by steady interdisciplinary integration and a more balanced contribution from diverse data categories.

(Co-)Occurrences of 17 Data Categories

There is one key reason for the exponential growth in phase 2: the sudden uptake of ‘big data’ as a main reference point by the scientific community. Before 2013, ‘big data’ was not a significant category used to refer to digitally sourced data. It only became popular in 2012 and then quickly began to dominate the scene. Figure 3 (below) traces the relative growth of the 17 data categories across the 30-year period, again illustrating the exponential growth taking place in the third decade (2013–2022). Table 2 provides the underlying data in tabulated format.
Looking more closely at phase 1, it becomes apparent that ‘digital data’ was the leading category used by the scholarly community: it was referenced in 2666 publications, representing 52.2% of the entire output in that period. This was followed by ‘novel data’, with 1920 publications, representing 37.6% of output. The next most frequent data category was ‘intelligent data’ (390 publications; 7.6%). ‘social media data’ and ‘mobile phone data’ only emerged towards the end of phase 1, coinciding with the introduction and growing availability of the Internet and mobile telephony. Given the noted shift from an analogue to a digital world in the 1990s (see introduction), it is not surprising that ‘digital data’ initially was the main category: it encapsulates the new kind of data being referred to in a clear, literal way. On its part, ‘novel data’ puts greater emphasis on the new opportunities and applications afforded by digital data. ‘intelligent data’ shifts the focus onto how large-scale digital data can be analysed and transformed into intelligent insights. In Section 4.2 and Section 4.3, the conceptual distinctiveness of these categories will be further elaborated.
Turning to phase 2, ‘big data’ rapidly assumed a dominant position, accounting for 49,713 occurrences or 76.4% of total publication count in this period. Its growth is remarkable, rising from 607 publications in 2012 to 8095 in 2022, an increase of 1334%. Other labels remained relevant but could not match the meteoric rise of ‘big data’; they were effectively relegated to second tier status. ‘digital data’ continued its steady increase from phase 1, ending in over 500 publications p.a. at the end of phase 2. However, its share of the total decreased dramatically, from 52.2% in phase 1 to 5.5% in phase 2. ‘novel data’ accrued 6337 occurrences or 12.7% of total share in phase 2, a significant decrease from 32.6% in phase 1. At the same time, it overtook ‘digital data’ in the second phase. ‘intelligent data’ enjoyed further steady yet modest growth; its relative share similarly decreased from 7.6% (phase 1) to 1.15% (751 publications) in the latter phase. In contrast, not unexpectedly, social media data’ was on the rise, albeit modestly, with 2461 occurrences or 3.8% of the total share in phase 2.
The preponderance of ‘big data’ clearly stands out: not only did it rapidly climb to be the most important data category used by the scientific community in phase 2, but it also ended up dominating the field across the entire 30 years: it accounts for 71.23% of publications collected for the period of 1993 to 2022.
Analyzing the position of data categories relative to one another can shed useful additional insight into the conceptual shape of the research field. This can be achieved by calculating co-occurrences of data categories; that is, how often two data categories are mentioned concurrently in the same article (title, abstract, author keywords). Figure 4a–c provides a visualization thereof in the form of three network graphs for phase 1 (1993–2012), phase 2 (2013–2022) and the overall period (1993–2022).
In phase 1 (Figure 4a), there is but a weak interrelationship among data categories with just a handful of co-occurrences. ‘novel data’ has three co-occurrences with ‘digital data’ (3 publications), ‘intelligent data’ (2 publications) and ‘RFID data’ (1 publication). ‘digital data’ has a further co-occurrence with ‘smart data’. The remaining eight data categories, including ‘big data’, show no co-occurrences. Altogether, it is remarkable that among the 5831 publications captured in phase 1, only very few include concurrent mention of two data categories. This suggests that at this developmental stage there was yet limited conceptual overlap between the 17 data categories.
In phase 2 (Figure 4b), ‘big data’ is prominently shown to be at the centre of the conceptual field both in terms of its size (occurrence) and central position (co-occurrence). In descending order, it is strongly networked with ‘social media data’ (368 co-occurrences), ‘digital data’ (317), ‘novel data’ (186), ‘mobile phone data’ (116), ‘intelligent data’ (98) and ‘smart data’ (94). This demonstrates that ‘big data’ has become a central, universalizing data category with several distinct associations with other prevalent data categories. Overall, phase 2 sees the emergence of a densely clustered network with ‘big data’ at its centre surrounded by other key categories. These include ‘digital data’, ‘novel data’ and ‘intelligent data’—the leading categories in phase 1—as well as ‘social media data’ and ‘mobile phone data’ that began their ascendency at the end of phase 1, plus the newly emergent ‘smart data’ category. Other categories, including ‘digital footprint data’, ‘web browsing data’ and ‘online activity data’, appear at the periphery of the network, indicating that they are in a minority and conceptually more niche.
The network graph for the entire 30-year period (Figure 4c) is strikingly like phase 2 (Figure 4b), chiefly because 91.7% of outputs were published in phase 2. It again underlines the centrality of ‘big data’ at the heart of a dense cluster with individual connections to more peripheral data categories.
Not visible in the three network graphs (Figure 4a–c) are four data categories that do not have any co-occurrences: ‘digital exhaust data’, ‘digital trails data’, ‘internet footprint data’ and ‘online traces data’. In fact, they also have very low occurrences: across the entire 30-year period, fewer than 10 publications mention them in each case: they only have very marginal influence on the research field, although some (possibly ‘digital exhaust data’) might yet emerge as significant in future (see Section 5).

4.2. Conceptual Analysis Through Author Keywords

Having so far established the conceptual network of the ‘big data’ research field based upon the 17 data categories, the analysis can be further deepened by considering the frequency of author keywords for the research field overall and in relation to individual data categories. In doing so, this section addresses RQ.2.

4.2.1. Author Keyword Occurrences

Table 3 (below) lists the 20 most frequent keywords across the overall corpus of 315,235 keywords for both phases 1 and 2 and the entire 30-year period. The keywords listed under phase 1 (Table 3) mainly relate to foundational aspects of data collection and manipulation techniques, such as ‘data fusion’, ‘data compression’, ‘metadata’ and ‘watermarking’. ‘Data mining’ is the most frequent author keywords with mention in 120 publications followed by ‘GIS’ in 64 publications (alongside ‘geographic information systems’ in 22 publications). With 57 publications, ‘big data’ comes in third position owing to its uptake in the last two years of the period. ‘RFID’ (radio frequency identification) appears in 52 publications and ‘remote sensing’ in 44 publications. While ‘data mining’ remains prominent in phase 2 (1566 occurrences), it is nevertheless relegated to sixth place. On their part, ‘GIS’ (and ‘geographic information systems’), ‘RFID’ and ‘remote sensing’ drop off the list of top 20 keywords in the second phase (and the overall period). Evidently, as established, commonplace techniques, they no longer merit high-level mention. Regarding phase 1, it is also interesting to note the position of ‘digital data’ as the sixth most frequent author keyword (41 occurrences); this reinforces the prominence of this data category in the initial development phase of the research field (see above section). At the same time, it subsequently disappears from the author keyword list in phase 2 (and the overall period), further confirming its decline relative to ‘big data’.
The top 20 author keywords concerning both phase 2 (Table 3) and the entire 30-year period (Table 3) are almost identical: there is only minimal variation in the order of keywords and only one different keyword in each list, ‘‘big data’ analysis’ is ranked 14 in 2013–2022 and ‘privacy’ is ranked 16 in the same period, whereas in the 1993–2022 period these terms are transposing their ranks. In both cases, ‘big data’ is the most frequent author keyword, and it has been mentioned in over 17,000 publications. (In comparison, ‘big data’ was ranked in 3rd position, with 57 occurrences, in phase 1). Together with the centrality of ‘big data’ in the network of data categories (see Section 4.1), these results underline the overwhelming presence of ‘big data’. Equally significant, ‘machine learning’, ‘artificial intelligence’ and ‘deep learning’ are among the top five author keywords in phase 2, signaling their rapidly growing influence on, and transformation of, the research field. They are accompanied, further down the list, by author keywords, such as ‘cloud computing’, ‘Hadoop’ and ‘MapReduce’, that represent the technical infrastructure needed to enable these advanced data analytics capabilities.
‘Social media’, too, ranks in the top five author keywords in phase 2 (and the overall phase): it serves as both an essential tool for data analytics and a site of analysis. Its most prominent exemplar is ‘Twitter’, ranked 13th, a popular data source for large-scale analysis of social sentiments and public discourse (see also ‘sentiment analysis’ ranked 18th). The appearance of ‘COVID’ in rank 12 is notable insofar as it was only present in the last three years (2020–2022) of the 30-year measurement. It underscores the importance of data analytics and the rapid publication of related studies in the worldwide efforts to tackle the pandemic.

4.2.2. Keyword Density Overview

Figure 5a,b presents density maps of the top 50 author keywords for phase 1 (1993–2012) and phase 2 (2013–2022), embedding keywords via Multidimensional Scaling (MDS) and smoothing via Kernel Density Estimation (KDE). Bubble sizes reflect overall frequency; background heat reflects local co-occurrence density.
The two density maps visually corroborate the rank-order results in Table 3. In phase 1 (Figure 5a), the hottest zone centres on ‘data mining,’ ‘machine learning,’ and ‘clustering,’ with peripheral peaks in GIS- and sensor-based terms (e.g., ‘remote sensing,’ ‘geographic information system’). In phase 2 (Figure 5b), the epicentre shifts to ‘big data’, surrounded by large sub-clusters in ‘deep learning’, ‘cloud computing’, ‘Internet of Things’, and emergent themes including ‘COVID-19’ and ‘social media.’

4.2.3. Co-Occurrences of Author Keywords with Data Categories

Table 4 (1993–2012; 2013–2022) below list the 20 top author keywords for the seven most influential data categories; i.e., the labels with the highest occurrences that form the core cluster in the data category network (see Figure 4b,c). Indicated in brackets are the number of co-occurrences (for example, ‘digital data’ and ‘GIS’ appeared together in 54 articles; see Table 4). Analysing author keyword compositions in this way provides further insights into the conceptual similarities and differences among these data categories and related changes across time.
In phase 1 (Table 4), author keyword co-occurrences are generally low, reflecting the fact that the share of publications was 8.3% of the total output (see Section 4.1). As this phase represents the emergence of the ‘big data’ research field, the author keywords provide a useful clue as to how the different data categories were originally defined. ‘digital data’, the most prominent category in phase 1, is chiefly cast regarding spatial analysis and related geological and geographic studies: the top five keywords centre upon geographic information systems and remote sensing. Further down the list, the disciplinary anchoring becomes apparent with keywords such as ‘geology’ (rank 9), ‘groundwater’ (14), ‘limestone’ (15) and ‘natural hazards’ (17). Given that ‘digital data’ was the defining category in phase 1 featuring in 52.2% of publications, this indicates that the research field was initially occupied largely by the physical (Earth) sciences.
‘Novel data’, the second most frequent data category in phase 1 with a share of 37.6% of publications, is defined in two distinct, yet related ways: among the top ten author keywords, five refer to analytical techniques (‘data mining’, ‘data hiding’, ‘data fusion’, ‘classification’, ‘data structures’), while the other five relate to medical and life sciences (‘inflammation’, ‘gene expression’, ‘apoptosis’, ‘obesity’, ‘atherosclerosis’). The same duality is apparent further down the list. ‘Novel’ data’, therefore, seemed to be the favoured term used by biologists and medical researchers in the early days of data analytics.
‘Intelligent data’, with a smaller share of 7.6% of the corpus in phase 1, is also conceptually distinct: it is strongly associated with ‘intelligent data analysis’ (rank 2) and ‘intelligent systems’ (rank 16) in the form of ‘machine learning’ (rank 3). Closely related author keywords include ‘support vector machine’ (rank 11), ‘neural networks’ (rank 13) and ‘artificial neural networks’ (rank 17), all describing supervised models with learning algorithms. As such, this fledgling data category contained what has since become omnipresent in the ‘big data’ world: artificial intelligence (AI). The other four data categories, owing to their small number of occurrences, each have only a few keyword co-occurrences. Author keywords associated with ‘mobile phone data’ (including ‘reality mining’, ‘human activity’ and ‘human mobility’) and ‘social media data’ (including ‘social media’, ‘Twitter’ and ‘discovery analytics’) indicate the opportunity to generate new data analytical insights about various social and behavioural issues. ‘big data’ is loosely associated with technical processes (e.g., ‘data compression’, ‘data integration’) as well as the commercial sector (‘stock market prediction’, ‘online auctions’, ‘smart money’). On its part, ‘smart data’ is again loosely associated with, on one hand, electricity infrastructure (‘smart grid’, ‘grid computing’) and, on the other, service functions (‘service-oriented architecture’).
In phase 2 (Table 4), the number of author keyword co-occurrences are much higher, reflecting the exponentially increased publication volume. By far the highest co-occurrence (3273 articles) exists between ‘machine learning’ and ‘big data’. Indeed, ‘machine learning’ appears in the top five author keywords across all seven main data categories. Given the centrality of the ‘big data’ data category, its five top author keywords—‘machine learning’, ‘artificial intelligence’, ‘deep learning’, ‘cloud computing’, ‘‘big data’ analytics’—present across over 9600 articles are instructive: they confirm that the field has become centrally defined by the rapid advances in AI. This is further confirmed regarding ‘novel data’ (the second most common data category in phase 2), which, in contrast to phase 1, where it had a strong association with biological and medical sciences, features ‘machine learning’ and ‘deep learning’ at the top of author keywords co-occurrences (biological and medical references slipped further down the ranking). Similarly, ‘digital data’ exhibits strong association with ‘machine learning’ (rank 2) and ‘artificial intelligence’ (rank 3). ‘intelligent data’ is exclusively associated with AI-related author keywords, which is not surprising given its genesis in phase 1. ‘social media data’ and ‘mobile phone data’ retain their strong characterization in terms of various social media and mobile phone data analytics, such as ‘sentiment analysis’, ‘natural language processing’ and ‘text mining’, albeit interwoven with AI techniques including ‘machine learning’ and ‘deep learning’. ‘smart data’, the smallest of the seven key data categories, has again a strong AI component albeit linked to ‘smart cities’ and ‘Industry 4.0’.
As further illustration of these results, the co-occurrence analysis of Phase 2 (2013–2022) points to several significant trends and emerging themes in ‘big data’ research. For instance, key AI-related techniques such as ‘machine learning’ and ‘deep learning’ consistently co-occur with ‘‘big data’, ‘novel data’, and ‘digital data’, emphasizing the field’s reliance on advanced computational models. These techniques play a pivotal role in enabling large-scale natural language processing, as reflected in the emergence of ‘artificial intelligence’ and applications like ‘sentiment analysis’ and ‘social network analysis’. Policy-related themes also feature prominently. Keywords like ‘COVID’ co-occur with nearly every data category except ‘intelligent data’, emphasizing the critical role of ‘big data’ analytics in tracking and responding to the pandemic. Moreover, the hardware and infrastructure components of ‘big data’ research are evident in the strong co-occurrence of ‘Hadoop’, ‘cloud computing’, and ‘MapReduce,’ which underpin the distributed systems essential for processing vast datasets. Together, these examples highlight the diverse and interdisciplinary focus of ‘big data’ research in Phase 2.
Extending the analysis to the 100 most frequent keywords for each of the seven main data categories (30-year period) provides further insight into specific application fields associated with ‘big data’ research (the full set of keywords is available from the Supplementary Document). The media domain is most strongly represented featuring in all data categories except ‘smart data’, with a total of 2185 related keywords including ‘social media’, ‘digital media’, ‘social media mining’ and ‘geotagged social media’. Also prominent is the (public) health domain which likewise features in all data categories except ‘smart data’. A total of 774 keywords variously refer to ‘mental health’, ‘public health’, ‘healthcare’, ‘structural health monitoring’ and ‘electronic health records’, among others. The strong presence of additional health-related keywords, such as ‘epidemiology’, ‘cancer’, ‘diabetes’, ‘pregnancy’, ‘nutrition’ and ‘COVID-19’, further underlines the significance of health as an important application area of ‘big data’. The next most-referenced domain is industry under the ‘smart data’, ‘big data’ and ‘intelligent data’ categories. The related 563 keywords are all ‘Industry 4.0’ which refers to automation and data exchange in manufacturing technologies and processes (4.0 denoting the fourth industrial revolution). The business domain features under the ‘big data’ and ‘social media data’ categories, with a total of 218 keywords referencing ‘business intelligence’.
The energy domain is associated with the ‘smart data’, ‘intelligent data’, ‘big data’ and ‘novel data’ categories and includes keywords, such as ‘energy efficiency’, ‘energy consumption’ and ‘energy management’ (total of 214 keywords). Another application area is mobility which is associated with the ‘mobile phone data’ and ‘social media data’ categories and encompasses terms, such as ‘urban mobility’, ‘spatial mobility’ ‘human mobility’ and ‘mobility patterns’ (total of 124 keywords). Closely related is the transport/travel domain, which features in the ‘mobile phone data’ and ‘smart data’ categories and includes the keywords ‘urban transport’, ‘intelligent transport’, ‘travel surveys’, ‘travel behavior’ and ‘travel demands’ (21 keywords in total). Again, the association with ‘COVID-19’ underlines the significance of ‘big data’ research for transport planning and travel behavior management resulting from the pandemic. Education is not yet a significant application domain, with only one entry (total of 13 keywords, all ‘education’). Similarly, agriculture has minimal coverage, with only six keywords—all ‘precision agriculture’—in the ‘smart data’ and ‘intelligent data’ categories. Completely absent from the 7 × 100 most frequent keyword list are the domains of built environment (including buildings and housing), finance and banking, and culture.

4.3. (Inter)Disciplinary Boundaries

The final results address RQ.3 by examining the presence of different disciplines across the ‘big data’ research field and their influence on the field’s conceptual demarcations. Figure 6 (below) shows the distribution of the corpus of publications across the four top-level subjects used to categorize academic journals. The top-level subjects—Social Sciences, Health Sciences, Physical Sciences, and Life Sciences—aggregate a number of more specific subject areas based on the All Science Journal Classification (ASJC) system. For example, Social Sciences includes Arts and Humanities, Business, Management and Accounting, and Decision Sciences, among other disciplines; Health Sciences includes Medicine and Health Professions, while Physical Sciences includes Engineering, Chemical Engineering, and Computer Science, among others.
A significant majority of articles (45,808, or 65% of total output) are associated with the Physical Sciences. This subject, therefore, has exerted the greatest disciplinary influence on the ‘big data’ research field overall. This makes sense given the origins of data science and data analytics in mathematics, statistics, computing and engineering, the early adoption of data analytics in physical geography and geology (see Section 4.1), and more recently the growing prominence of AI and related techniques and computing infrastructure.
In second position is the Social Sciences subject which encompasses 19,202 articles (27%). Its strong showing can be explained, on one hand, by the fact that ‘big data’ analytics has been increasingly adopted as research tool by social scientists and, on the other, by growing engagement with the implications of the ‘data revolution’ in the arts, humanities, education, business studies, political sciences, etc. Health Sciences (9257 articles; 13%), and Life Sciences (8786; 12.5%) are similarly strongly represented. Concerning the temporal development, both the Physical Sciences and Social Sciences subjects were the major drivers of the exponential growth of output in phase 2, more significantly so than the Health Sciences and Life Sciences (see Supplementary Document for underlying data).
Remarkably, 17,056 (24%) of articles were published in journals that have two or more concurrent subject designations, thus indicating sizeable cross-disciplinary engagement. Among these, 9110 articles straddle the Physical Sciences and Social Sciences; 3116 the Physical Sciences and Life Sciences; 2091 articles the Life Sciences and Health Sciences; and 1094 articles the Physical Sciences and Health Sciences. Smaller numbers of articles appear in journals crossing three—or, in the case of ten articles, even all four—top-level subjects.
Figure 7 (below) shows, in the form of a heatmap, the distribution of the seven most prominent categories across the four top-level subjects. Broadly, the data categories are distributed in proportion of the publication totals of the top-level subjects, thus confirming that they are widely used across the main academic disciplines. Notably, ‘big data’ demonstrates strong representation across all main subjects, reflecting its preeminence as key conceptual, methodological and analytical reference used by scientists from diverse disciplines. In contrast, two data categories to a significant degree exhibit disciplinary distinction: ‘novel data’ is proportionally more strongly represented by the Health Sciences and Life Sciences subjects. This further confirms the earlier observation that ‘novel data’ is a preferred term used by biological and medical researchers. ‘Social media data’ is interesting insofar as it is relatively less prominent in the Physical Sciences, while having strong showing in the Social Sciences and Health Sciences subjects. The latter indicates the use of social media as a platform and tool to investigate health-related topics. Strategically, in choosing to refer to one or the other data category, or indeed several categories, in article titles, abstracts and keywords, authors can communicate (inter)disciplinary orientations.
Finally, Table 5 (below) lists the 20 top author keywords for each of the four main subjects. The results again underscore the preponderance of ‘big data’, which is by far the most frequent keyword across all four subjects. They also reconfirm the earlier observation that ‘machine learning’, ‘artificial intelligence’ and ‘deep learning’ have become key common denominators in each subject and the overall field. Similarly, ‘social media’ is a common denominator found across all four subjects. At the same time, the results reveal clear differentiation aligned with disciplinary interests: for example, the top keywords in the Health Sciences and Life Sciences subjects are a good indicator of how ‘big data’, social media and data analytics are used in support of biomedical and public health research. The top keywords in the Social Sciences indicate a preoccupation with ethical and privacy aspects of ‘big data’, especially pertaining to social media data and related techniques, such as sentiment analysis. Not unexpectedly, the top keywords in the Physical Sciences are predominantly related to advanced data techniques and infrastructure.
In summary of Section 4, the key findings of this study are as follows: (a) the ‘big data’ research field has evolved over the last three decades from a niche area into a major scientific domain which has grown exponentially since the early 2010s; (b) it is conceptually co-constituted through seven main data categories that form a close-knit network with ‘big data’ at its centre, plus several more peripheral data categories; (c) it is further characterized by the prevalence of a cluster of author keywords, shared among the network of data categories, that signal the growing focus on AI; (d) other author keywords are more specific to particular data categories, thus creating conceptual demarcations; (e) while the physical sciences have the greatest representation among the four top-level subjects, the distribution of data categories and author keywords is spread fairly even suggesting a significant degree of interdisciplinarity while at the same time exhibiting disciplinary specificities.

5. Discussion

Kitchin [1,2] aptly characterized the rapid advancements in digital data analytics across society over the last decade as a ‘data revolution’. A similar revolution concurrently occurred in the scientific realm where ‘big data’ research rapidly morphed into a major domain spanning all major disciplines, with ‘big data’ acting as both a tool for research and a topic of inquiry. Our study confirms and significantly extends recent scientometric analyses charting the meteoric rise of ‘big data’ in the scientific literature [11,25,32,41]. Importantly, while there is broad agreement that the ‘(big) data revolution’ began in the early 2010s—our study pinpoints the start of the exponential growth phase in the scientific literature in 2012–2013—the emergence of ‘big data’ research can be traced at least two decades further back. Tseng et al. [31] reported evidence of ‘big data’-labelled research as early as 1993, without however providing numeric information. Our study identified 1998 as the first publication year captured in Scopus (see Table 2). In short, ‘big data’ as a research niche emerged in around the mid-1990s. It is, therefore, essential to recognize that the roots of ‘big data’ research go back a considerable length of time during which foundational scientific work was carried out under the radar of wider public attention.
In providing a long-term perspective, this study also reveals that ‘big data’, as a key concept and category in research, has co-existed in close relationship with other formative data categories. As such, it co-constitutes a network of data categories that scientists variably mobilize when describing and designating their research. We identified six data categories that closely intertwine with ‘big data’ as the main category. Crucially, four of these categories can be traced even further back than ‘big data’: ‘digital data’, ‘novel data’, ‘intelligent data’ and ‘smart data’ were all mentioned in publications as early as 1993. On their part, ‘mobile phone data’ and ‘social media data’ began to appear in the early to late 2000s. In other words, these early data categories laid the foundation of the emergent conceptual data network in which ‘big data’ became the centre point during the ensuing ‘data revolution’.
In applying author keyword analysis, which has gained growing recognition in scientometrics [7,8,33,35], to the ‘big data’ research field, this study provides new conceptual insights at aggregate level. As Choi et al. [35] noted, ‘the popularity of some keywords serve as an indicator of the importance of the research themes they represent’. On this basis, one can discern some conceptual distinctiveness, or heterogeneity, among the main data categories, especially in phase 1 (1993–2012). At the same time, the number of shared keywords is significant, thus further confirming a strong conceptual network at work. Significantly, as the research field began to grow exponentially and ‘big data’ became the dominant category (see Figure 5a,b), so the top-level author keywords across the seven data categories developed greater homogeneity along three main strands: (a) the preponderance of AI (‘machine learning’, ‘deep learning’, etc.) as the focus and driver of ‘big data’ research; (b) the importance of software and hardware infrastructure network (‘cloud computing’, ‘Hadoop’, ‘MapReduce’) as enabler of large-scale distributed analysis of ‘big data’ sets; and (c) the influence of social media (‘Twitter’, etc.) as both source and object of ‘big data’ analysis. These developments have been instrumental in addressing scalability challenges inherent to ‘big data’ research, particularly in processing, storing, and analyzing increasingly vast datasets [42]. Additionally, emerging methods such as deep learning have enhanced scalability by processing unstructured and heterogeneous datasets while adapting to dynamic research demands, especially in IoT applications [43]. Such innovations reflect the integral role of scalability in supporting the conceptual and technological evolution of ‘big data’ as a research field. Looking ahead, the issue of scalability will remain critical as datasets continue to grow in size and complexity, particularly with the rapid adoption of AI-related large models (e.g., LLM), edge computing, and real-time analytics.
The encompassing nature of the ‘big data’ research field is further on display in terms of the disciplinary distribution of publications. Earlier studies had identified broad disciplinary engagement, albeit for shorter timeframes (e.g., 2004–2015 in the case of [44,45]. Our study confirms for the 30-year period that the Physical Sciences have the greatest share, as mathematics, computing and engineering have from the start been the disciplinary sources of the development and advancement of ‘big data’ analytical capabilities. At the same time, the other top-level subjects enthusiastically embraced ‘big data’ as both a research tool and an object of analysis. The Social Sciences are a case in point, where digital data analytics is an increasingly common research tool and the political, social and cultural applications and implications of ‘big data’ is subject to critical examination. Likewise, the Health and Life Sciences routinely embed ‘big data’ analytics in their research practices. The degree of interdisciplinarity between these top-level subjects is significant: as ‘big data’ tools and techniques are increasingly commonly used across disciplines, they take on the significance of a shared research methodological language; and as ‘big data’ is examined from different angles as a socio-technical object, it forms a shared conceptual research interest.
‘Big data’ is a rapidly and continuously evolving knowledge field. It is, therefore, important to stay attuned to the latest developments. Consequently, future research should extend analysis to the most recent period (2023 onwards) once complete publication records become available in Scopus (or Web of Science) following indexing. Indicatively, looking at a range of outputs (including survey and review articles) published in the last couple of years, it becomes apparent that latest advances in, among others, ‘big data’ predictive analytics, distributed deep learning, pre-trained large models and spatio-temporal ‘big data’ analytics further accelerate the application of ‘big data’ analytics across various subject areas, policy domains and industries [42,43,46,47]. For example, industrial ‘big data’ (IBD) is an emergent field with significant potential but also inherent challenges including multi-source, heterogeneous data and lack of mutual trust in data sharing given the competitive nature of industry [48]. In healthcare, transformer models (a type of deep neural network) originally developed for natural language processing tasks are now being used to process biological sequences, and federated learning (a machine learning application) is being developed to improve data sharing and intelligence across healthcare systems and organizations while at the same time safeguarding sensitive patient information [49,50]. However, the move to carrying out ‘big data’ analytics in the Cloud poses methodological and organizational challenges, namely how to ensure data security and privacy in terms of platform interoperability, access control, real-time processing, algorithmic efficiency, data storage, and secure and private data analytics [27,51]. Some of the aforementioned methods and techniques, particularly transformer and pre-trained large models, can reduce the need for massive datasets by leveraging transfer learning and fine-tuning. In turn, this may lead to a shift in emphasis from ‘big data’ to various deep learning models. ‘big data’ might thus soon be replaced by ‘machine learning’ or ‘deep learning’ as top keyword across the knowledge field.
Taken together, the bibliometric mapping presented in this study offers grounded insight into plausible future trajectories of ‘big data’ research. The temporal and category-based keyword analysis (Section 4.2) reveals a shift in emphasis from more generic data descriptors (e.g., ‘satellite imagery’, ‘GIS’, ‘digital data’) to platform-centric terms and advanced data-related techniques (‘social media’, ‘Twitter’, ‘machine learning’, ’deep learning’). This suggests an increasing embedding of data in new everyday practices (e.g., assisted driving) and infrastructures (e.g., ‘smart grids’). Meanwhile, the concentration and confluence of author keywords across the seven most frequent data categories (Section 4.2.3) demonstrates both conceptual clustering and thematic convergence, such as illustrated by the clusters centred upon ‘mobile phone data’ and ‘social media data’. Across top-level subject areas (Table 5) further reveals differentiated thematic preoccupations—privacy and sentiment analysis in Social Sciences, infrastructure and AI techniques in Physical Sciences, and public health applications in Health Sciences. Importantly, our category-level analysis (Table 4) shows that categories like ‘mobile phone data’ and ‘social media data’ are increasingly populated by terms such as ‘mobility’, ‘COVID-19’, ‘sentiment analysis’, and ‘Twitter’, suggesting the growing relevance of data-intensive methods in urban, health, and social media. Rather than predicting specific developments, this mapping enables a plausible reading of how the conceptual and disciplinary contours of ‘big data’ research are evolving. It offers a grounded framework for identifying emerging alignments, potential gaps, and the shifting boundaries of relevance within the field. It highlights plausible trajectories and blind spots that merit scholarly attention—such as the normative implications of cross-category data use or the conceptual fuzziness emerging from category convergence.

6. Conclusions

The aim of this study was to analyse the overall size and conceptual and disciplinary shape of the ‘big data’ research field, against the background of an ongoing ‘data revolution’ transforming wider society. Building on previous scientometric work, the study makes a novel contribution by (a) examining the evolution of the research field over an extended 30-year period (1993–2022) and (b) focusing on the field’s conceptual dimensions based upon 17 data categories and associated author keywords. As a result, in response to RQ.1, the findings highlight the unprecedented growth of the field in the third decade (2013–2022), to a total of 70,163 articles, following two decades of formative development (1993–2012). Seven data categories (‘big data’, ‘digital data’, ‘novel data’, ‘intelligent data’, ‘smart data’, ‘mobile phone data’, ‘social media data’) co-constitute the core conceptual network, with ‘big data’ firmly at its centre. Moreover, in response to RQ.2, the analysis of 315,235 author keyword harvested demonstrates a growing influence of several thematic nodes, including artificial intelligence and machine learning, distributed hardware and software networks, and social media as tool for and object of ‘big data’ research. Lastly, in response to RQ.3, the findings show that the research field encompasses all four top-level subjects, with the biggest contribution made by the Physical Sciences followed by the Social Sciences, the Health Sciences and the Life Sciences. A significant proportion of publications straddle two or more subjects, thus indicating considerable interdisciplinarity.
In sum, looking across the 30 years, the research field has transitioned from a focus on foundational data collection techniques to the application of advanced computational methods, the development of data-intensive systems, the increase in computational and storage capabilities, the integration of social media, the pursuit of topical interested applications (social media, health, mobility, spatial analysis), and the growing recognition of ethical and governance considerations. In all of this, ‘big data’ has become by far the most dominant and central category used by scientists across various disciplines, thus justifying the name given to the field overall.
There are several implications of these findings for how the scientific community may engage with ‘big data’ as a research domain. First, scholars would be well advised to reference their work with ‘big data’ in order to try to reach a large, receptive scientific audience. Even if ‘big data’ may seem somewhat generic or no longer novel, the findings clearly demonstrate that it has become the main classifier across conceptual and disciplinary boundaries. Second, relatedly, scholars should be cognizant of the co-existence of a core group of seven data categories, each with subtly different conceptual connotations, and the related opportunity to combine data categories and author keywords in ways that aligns scholarly work with specific thematic and disciplinary interests. Third, scholars should also be aware of presently more peripheral data categories (e.g., digital footprint data, smart data), as they point to potential niches in the research landscape from which future innovations might emerge. Fourth, scholars should feel encouraged to engage in interdisciplinary research, given the evidence of significant cross-boundary work; again, they would be well advised to mobilize relevant data categories and author keywords to align their work with the desired interdisciplinary orientation. Our taxonomy of 17 data categories offers a multifaceted overview of the ‘big data’ research field, with various entry points to explore the rapidly evolving epistemic terrain. It can, thus, help researchers position their work in relation to diverse configurations of digital data types, themes, methods, and application domains.
Apart from the scholarly community, policymakers may find the findings useful in that they confirm the continuous growth of the research field, thus meriting ongoing investment in data research and related infrastructure, as well as the presence of ‘big data’ research across all four main subject areas (physical, social, medical, and life sciences) and the related value in furthering interdisciplinarity. Moreover, policymakers and research funders might be interested in the emergence of new data categories as an indicator of future trends.
Finally, it is important to note the following research limitations: as the research was conducted in 2023–2024, the last year of fully indexed publications was 2022 (given the time lag between publication and indexing). Therefore, the most recent period (2023 onwards) will need to be analysed in follow-up research. It is also important to note that we deliberately restricted the scientometric analysis to peer-reviewed journal and review articles as they present the ‘gold standard’ of scientific output. Non-peer-reviewed research, such as conference proceedings, and the grey literature were, therefore, not included. For the latter, one would need to use, e.g., Google Scholar, although this is not a controlled database and as such is inherently less stable in contrast to Scopus and Web of Science. Furthermore, not all the data collected could be analysed owing to the limited space of this paper. However, the full dataset is made available for open access (see Supplementary Document) to enable other researchers to (re)analyse the data. For example, the author keyword analysis could be further refined by, say, looking more closely at keyword distributions across the 27 disciplines within the four top-level subjects. Likewise, more recent, low-key data categories (e.g., ‘clickstream data’, ‘digital footprint data’, ‘digital exhaust data’) are worth further analysis in terms of their disciplinary origins and future potential. This is particularly relevant in light of ongoing shifts in data architecture, ethical frameworks, and analytic techniques (e.g., large models, AI agents), which can be expected to reshape and expand the contours of what qualifies as ‘big data’ research. Finally, as this study focused on the conceptual and disciplinary shape of the research field, additional bibliometric inquiries are worth pursuing. For example, it could be worth analysing more closely authorship patterns (e.g., major reference works, key authors, international collaborations) as well as geographic distributions (e.g., publication outputs across global regions). Again, the Supplementary Document provides rich source material for other researchers to pursue these analytical avenues.

Supplementary Materials

The full, curated dataset can be accessed at: http://bit.ly/4kFxcdx. It contains: the full set of curated data concerning the 70,163 articles as well as the 315,235 author keywords collected as part of this study. The data is presented in Excel format, suitable for further analysis.

Author Contributions

Conceptualization, S.J. and I.P.K.; methodology, I.P.K. and S.J.; software, I.P.K.; validation, I.P.K. and S.J.; formal analysis, I.P.K. and S.J.; investigation, I.P.K. and S.J.; resources, S.J.; data curation, I.P.K.; writing—original draft preparation, S.J. and I.P.K.; writing—review and editing, S.J. and I.P.K.; visualization, I.P.K.; supervision, S.J.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received from the UK Economic & Social Research Council, grant number ES/S007105/1.

Institutional Review Board Statement

Not applicable (no human participants and no personal and social media data involved).

Informed Consent Statement

Not applicable.

Data Availability Statement

The full, curated dataset underlying this study is made publicly available for use by other researchers/analysts at: http://bit.ly/4kFxcdx. See ‘Supplementary Materials’ (above) for further information.

Acknowledgments

We are grateful for the support of the Urban ‘big data’ Centre, University of Glasgow, where this study was conducted.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kitchin, R. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences; Sage: Hong Kong, China, 2014; ISBN 1-4739-0826-4. [Google Scholar]
  2. Kitchin, R. The Data Revolution: A Critical Analysis of Big Data, Open Data and Data Infrastructures; Sage: Hong Kong, China, 2021; ISBN 1-5297-6677-X. [Google Scholar]
  3. Castells, M. The Rise of the Network Society. In The Information Age/Anuel Castells, 2nd ed.; with a new preface, [reprint]; Wiley-Blackwell: Malden, MA, USA, 2011; ISBN 978-1-4051-9686-4. [Google Scholar]
  4. Negroponte, N. Being Digital, 1st ed.; Hodder & Stoughton: London, UK, 1995; ISBN 978-0-340-64525-3. [Google Scholar]
  5. Fernández, A.; del Río, S.; López, V.; Bawakid, A.; del Jesus, M.J.; Benítez, J.M.; Herrera, F. Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. WIREs Data Min. Knowl. Discov. 2014, 4, 380–409. [Google Scholar] [CrossRef]
  6. Kitchin, R.; McArdle, G. What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets. Big Data Soc. 2016, 3, 2053951716631130. [Google Scholar] [CrossRef]
  7. Schraven, D.; Joss, S.; de Jong, M. Past, Present, Future: Engagement with Sustainable Urban Development through 35 City Labels in the Scientific Literature 1990–2019. J. Clean. Prod. 2021, 292, 125924. [Google Scholar] [CrossRef]
  8. De Jong, M.; Joss, S.; Schraven, D.; Zhan, C.; Weijnen, M. Sustainable–Smart–Resilient–Low Carbon–Eco–Knowledge Cities; Making Sense of a Multitude of Concepts Promoting Sustainable Urbanization. J. Clean. Prod. 2015, 109, 25–38. [Google Scholar] [CrossRef]
  9. Abdian, S.; Shahri, M.H.; Khadivar, A. A Bibliometric Analysis of Research on Big Data and Its Potential to Value Creation and Capture. Interdiscip. J. Manag. Stud. 2023, 16, 1–24. [Google Scholar]
  10. Ahmad, I.; Ahmed, G.; Shah, S.A.A.; Ahmed, E. A Decade of Big Data Literature: Analysis of Trends in Light of Bibliometrics. J. Supercomput. 2020, 76, 3555–3571. [Google Scholar] [CrossRef]
  11. Chavez, H.; Albornoz, M.B.; Martín, F. ‘Big Data’ Research: A Bibliometric Analysis of the Scopus Database, 2009–2019. J. Scientometr. Res. 2022, 11, 64–78. [Google Scholar] [CrossRef]
  12. Kalantari, A.; Kamsin, A.; Kamaruddin, H.S.; Ale Ebrahim, N.; Gani, A.; Ebrahimi, A.; Shamshirband, S. A Bibliometric Approach to Tracking Big Data Research Trends. J. Big Data 2017, 4, 30. [Google Scholar] [CrossRef]
  13. Liu, X.; Sun, R.; Wang, S.; Wu, Y.J. The Research Landscape of Big Data: A Bibliometric Analysis. Libr. Hi Tech 2020, 38, 367–384. [Google Scholar] [CrossRef]
  14. Parlina, A.; Ramli, K.; Murfi, H. Theme Mapping and Bibliometrics Analysis of One Decade of Big Data Research in the Scopus Database. Information 2020, 11, 69. [Google Scholar] [CrossRef]
  15. Beer, D. How Should We Do the History of Big Data? Big Data Soc. 2016, 3, 2053951716646135. [Google Scholar] [CrossRef]
  16. Mayer-Schönberger, V.; Cukier, K. Big Data: A Revolution That Will Transform How We Live, Work, and Think; Houghton Mifflin Harcourt: Boston, MA, USA, 2013; ISBN 0-544-00269-5. [Google Scholar]
  17. Boyd, D.; Crawford, K. Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Inf. Commun. Soc. 2012, 15, 662–679. [Google Scholar] [CrossRef]
  18. Sokiyna, M.Y.; Aqel, M.J.; Naqshbandi, O.A. Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming. J. Inf. Technol. Manag. 2020, 12, 100–113. [Google Scholar] [CrossRef]
  19. Amoore, L. Cloud Ethics: Algorithms and the Attributes of Ourselves and Others; Duke University Press: Durham, UK, 2020; ISBN 978-1-4780-0778-4. [Google Scholar]
  20. Crawford, K.; Schultz, J. Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms. BCL Rev. 2014, 55, 93. [Google Scholar]
  21. Guay, R.; Birch, K. A Comparative Analysis of Data Governance: Socio-Technical Imaginaries of Digital Personal Data in the USA and EU (2008–2016). Big Data Soc. 2022, 9, 205395172211129. [Google Scholar] [CrossRef]
  22. Cantorani, J.R.H.; de Oliveira, M.R.; Pilatti, L.A.; de Sousa, T.B. Agri-Food Sector: Contemporary Trends, Possible Gaps, and Prospective Directions. Metrics 2025, 2, 3. [Google Scholar] [CrossRef]
  23. Colangelo, M.T.; Guizzardi, S.; Galli, C. Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines. Metrics 2024, 1, 3. [Google Scholar] [CrossRef]
  24. Vieira, E.d.S. A Bibliometric Analysis of Neonatal Condition Research in Africa: Volume, Impact, Themes, and Collaboration. Metrics 2025, 2, 2. [Google Scholar] [CrossRef]
  25. Raban, D.R.; Gordon, A. The Evolution of Data Science and Big Data Research: A Bibliometric Analysis. Scientometrics 2020, 122, 1563–1581. [Google Scholar] [CrossRef]
  26. Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to Conduct a Bibliometric Analysis: An Overview and Guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
  27. Tosi, D.; Kokaj, R.; Roccetti, M. 15 Years of Big Data: A Systematic Literature Review. J. Big Data 2024, 11, 73. [Google Scholar] [CrossRef]
  28. Lyu, X.; Costas, R. Studying the Characteristics of Scientific Communities Using Individual-Level Bibliometrics: The Case of Big Data Research. Scientometrics 2021, 126, 6965–6987. [Google Scholar] [CrossRef]
  29. MacFeely, S. The Big (Data) Bang: Opportunities and Challenges for Compiling SDG Indicators. Glob. Policy 2019, 10, 121–133. [Google Scholar] [CrossRef]
  30. Singh, V.K.; Banshal, S.K.; Singhal, K.; Uddin, A. Scientometric Mapping of Research on ‘Big Data’. Scientometrics 2015, 105, 727–741. [Google Scholar] [CrossRef]
  31. Tseng, S.F.; Won, Y.L.; Yang, J.M. A Bibliometric Analysis on Data Mining and Big Data. Int. J. Electron. Bus. 2016, 13, 38. [Google Scholar] [CrossRef]
  32. Gupta, D.; Rani, R. A Study of Big Data Evolution and Research Challenges. J. Inf. Sci. 2019, 45, 322–340. [Google Scholar] [CrossRef]
  33. Sampagnaro, G. Keyword Occurrences and Journal Specialization. Scientometrics 2023, 128, 5629–5645. [Google Scholar] [CrossRef]
  34. Xu, J.; Bu, Y.; Ding, Y.; Yang, S.; Zhang, H.; Yu, C.; Sun, L. Understanding the Formation of Interdisciplinary Research from the Perspective of Keyword Evolution: A Case Study on Joint Attention. Scientometrics 2018, 117, 973–995. [Google Scholar] [CrossRef]
  35. Choi, J.; Yi, S.; Lee, K.C. Analysis of Keyword Networks in MIS Research and Implications for Predicting Knowledge Evolution. Inf. Manag. 2011, 48, 371–381. [Google Scholar] [CrossRef]
  36. Singh, V.K.; Singh, P.; Karmakar, M.; Leta, J.; Mayr, P. The Journal Coverage of Web of Science, Scopus and Dimensions: A Comparative Analysis. Scientometrics 2021, 126, 5113–5142. [Google Scholar] [CrossRef]
  37. Baas, J.; Schotten, M.; Plume, A.; Côté, G.; Karimi, R. Scopus as a Curated, High-Quality Bibliometric Data Source for Academic Research in Quantitative Science Studies. Quant. Sci. Stud. 2020, 1, 377–386. [Google Scholar] [CrossRef]
  38. Fruchterman, T.M.J.; Reingold, E.M. Graph Drawing by Force-Directed Placement. Softw. Pract. Exp. 1991, 21, 1129–1164. [Google Scholar] [CrossRef]
  39. Narong, D.K.; Hallinger, P. A Keyword Co-Occurrence Analysis of Research on Service Learning: Conceptual Foci and Emerging Research Trends. Educ. Sci. 2023, 13, 339. [Google Scholar] [CrossRef]
  40. Kipper, L.M.; Furstenau, L.B.; Hoppe, D.; Frozza, R.; Iepsen, S. Scopus Scientific Mapping Production in Industry 4.0 (2011–2018): A Bibliometric Analysis. Int. J. Prod. Res. 2020, 58, 1605–1627. [Google Scholar] [CrossRef]
  41. Halevi, G.; Moed, H. The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature. Res. Trends 2012, 1, 2. [Google Scholar]
  42. Jamarani, A.; Haddadi, S.; Sarvizadeh, R.; Haghi Kashani, M.; Akbari, M.; Moradi, S. Big Data and Predictive Analytics: A Systematic Review of Applications. Artif. Intell. Rev. 2024, 57, 176. [Google Scholar] [CrossRef]
  43. Selmy, H.A.; Mohamed, H.K.; Medhat, W. Big Data Analytics Deep Learning Techniques and Applications: A Survey. Inf. Syst. 2024, 120, 102318. [Google Scholar] [CrossRef]
  44. Hu, J.; Zhang, Y. Discovering the Interdisciplinary Nature of Big Data Research through Social Network Analysis and Visualization. Scientometrics 2017, 112, 91–109. [Google Scholar] [CrossRef]
  45. Hu, J.; Zhang, Y. Measuring the Interdisciplinarity of Big Data Research: A Longitudinal Study. Online Inf. Rev. 2018, 42, 681–696. [Google Scholar] [CrossRef]
  46. Berloco, F.; Bevilacqua, V.; Colucci, S. Distributed Analytics for Big Data: A Survey. Neurocomputing 2024, 574, 127258. [Google Scholar] [CrossRef]
  47. Liang, H.; Zhang, Z.; Hu, C.; Gong, Y.; Cheng, D. A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and Applications. IEEE Trans. Big Data 2024, 10, 174–193. [Google Scholar] [CrossRef]
  48. Liu, L.; Li, J.; Lv, J.; Wang, J.; Zhao, S.; Lu, Q. Privacy-Preserving and Secure Industrial Big Data Analytics: A Survey and the Research Framework. IEEE Internet Things J. 2024, 11, 18976–18999. [Google Scholar] [CrossRef]
  49. Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer Models in Biomedicine. BMC Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar] [CrossRef]
  50. Babar, M.; Qureshi, B.; Koubaa, A. Review on Federated Learning for Digital Transformation in Healthcare through Big Data Analytics. Future Gener. Comput. Syst. 2024, 160, 14–28. [Google Scholar] [CrossRef]
  51. Amaithi Rajan, A.; Vetriselvi, V. Systematic Survey: Secure and Privacy-Preserving Big Data Analytics in Cloud. J. Comput. Inf. Syst. 2024, 64, 136–156. [Google Scholar] [CrossRef]
Figure 1. Methodological design and four-step procedure.
Figure 1. Methodological design and four-step procedure.
Metrics 02 00015 g001
Figure 2. Growth in publications mentioning at least one data category 1993–2022.
Figure 2. Growth in publications mentioning at least one data category 1993–2022.
Metrics 02 00015 g002
Figure 3. Number of articles per data category (mentioned at least once) 1993–2022.
Figure 3. Number of articles per data category (mentioned at least once) 1993–2022.
Metrics 02 00015 g003
Figure 4. (Co-occurrence of data categories (a) phase 1: 1993–2012 (b) x phase 2: 2013–2022 (c) x across 30-years: 1993–2022.
Figure 4. (Co-occurrence of data categories (a) phase 1: 1993–2012 (b) x phase 2: 2013–2022 (c) x across 30-years: 1993–2022.
Metrics 02 00015 g004aMetrics 02 00015 g004b
Figure 5. Keyword density map of top 50 author keywords (a) phase 1: 1993–2012, (b) , phase 2: 2013–2022.
Figure 5. Keyword density map of top 50 author keywords (a) phase 1: 1993–2012, (b) , phase 2: 2013–2022.
Metrics 02 00015 g005
Figure 6. Distribution of all publications (1993–2022) across four top-level subject areas.
Figure 6. Distribution of all publications (1993–2022) across four top-level subject areas.
Metrics 02 00015 g006
Figure 7. Co-occurrences of 7 main data categories with 4 top-level subject areas (health sciences; life sciences; physical sciences; social sciences).
Figure 7. Co-occurrences of 7 main data categories with 4 top-level subject areas (health sciences; life sciences; physical sciences; social sciences).
Metrics 02 00015 g007
Table 1. Overview of previous bibliometric studies on ‘big data’ research.
Table 1. Overview of previous bibliometric studies on ‘big data’ research.
Authors (Year)FocusPeriodSearch QueryKey Findings/Insights
Halevi and Moed (2012)Evolution of ‘big data’ as research2000–2012‘big data’Origins of ‘big data’ research
Singh et al. (2015)Mapping of ‘big data’ research2010–2014‘big data’Clusters, themes and collaborations
Tseng et al. (2016)Co-evolution of data mining and ‘big data’1983–2014‘data mining’, ‘big data’Data mining research (1983-) preceding ‘big data’ research (1993-)
Kalantari et al. (2017)Analysing ‘big data’ research trends1983–2014‘data analytics’, ‘Hadoop’, ‘machine learning’, ‘MapReduce’, ‘large dataset’, ‘big data’, ‘data warehouse’, ‘predictive analytics’, ‘NoSQL’, ‘unstructured data’, ‘data science’, ‘sentiment analysis’, ‘data center’Evolution of research themes
Gupta and Rani (2019)Evolution of ‘big data’ research2000–2017‘big data’Main growth occurring post-2010
Ahmad et al. (2020)Trends in ‘big data’ literature2008–2017‘big data’Key authors, journals, research trends
Liu et al. (2020)‘big data’ research landscape2013–2018‘big data’Conceptual organization of research field
Raban &
Gordon (2020)
Data science and ‘big data’ evolution2006–2019‘big data’, ‘data science’Expansion and diversification into various domains
Lyu and Costas (2021)‘big data’ research communities2008–2017‘big data’Formation of social structure (research communities) of the field
Chavez et al. (2022)‘big data’ research trends (2009–2019)2009–2019‘big data’Main research areas and growing interdisciplinary applications
Table 2. Annual publication outputs * for top 13 data categories (1993–2022). (* Number of articles that include given data category at least once in title, abstract and/or keywords).
Table 2. Annual publication outputs * for top 13 data categories (1993–2022). (* Number of articles that include given data category at least once in title, abstract and/or keywords).
Big DataNovel DataDigital DataSocial Media DataIntelligent DataMobile Phone DataSmart DataRFIDClickstream DataEmergent DataWeb Browsing DataDigital Footprint DataOnline Activity DataTotal
1993091080901000000127
19940161150801000000140
199501712301001000000151
199603015501700000000202
199703014701601000000194
199812412201702000000166
199923914001500000000196
200014514701701000100212
200125014001602050000215
200215812801212070000209
200338015301500152000259
2004310113302404471000277
2005310117301902321000304
2006611118303204562000349
20072141174031041014000367
2008115213801814914100329
2009101931791234417121000444
20109208156327142254000439
201129249165336333264100531
201219726617416281272460000730
2013607390186314218731851001326
20141602385215474130718950002359
201525333742439228491723560113372
20163633467310146474925171491114720
20174360513316229414827141580025573
201855726703362616270301314120017041
201971106954383098770481629160208820
2020777189648240711211062924142319893
2021843010035544971431135313351022310,858
2022809594453044214894491524855110,360
Total49,983825765632484114167337229624011614141070,163
Table 3. (a)–(c). Top 20 author keywords for phase 1, phase 2 and entire 30-year period. (Ranking 1–20; frequencies in brackets.).
Table 3. (a)–(c). Top 20 author keywords for phase 1, phase 2 and entire 30-year period. (Ranking 1–20; frequencies in brackets.).
3.a: 1993–20123.b 2013–20223.c 1993–2022
1Data mining (120)‘big data’ (17706)‘big data’ (17763)
2GIS (64)Machine learning (3803)Machine learning (3820)
3Big data (57)Artificial intelligence (2035)Artificial intelligence (2043)
4RFID (52)Deep learning (1884)Deep learning (1884)
5Remote sensing (44)Social media (1711)Social media (1728)
6Digital data (41)Data mining (1544)Data mining (1664)
7Intelligent data analysis (30)Cloud computing (1539)Cloud computing (1551)
8Internet (28)‘big data’ analytics (1423)‘big data’ analytics (1424)
9Data hiding (26)Internet of Things (1215)Internet of Things (1222)
10Classification (25)Mapreduce (902)Mapreduce (909)
11Neural networks (22)Hadoop (897)Hadoop (905)
12Geographic information system (22)COVID-19 (811)COVID-19 (811)
13Image processing (20)Twitter (666)Twitter (674)
14Inflammation (19)Industry 4.0 (588)Privacy (589)
15Data fusion (19)‘big data’ analysis (588)Industry 4.0 (588)
16Metadata (18)Privacy (583)‘big data’ analysis (588)
17Social media (17)Sentiment analysis (573)Clustering (587)
18Machine learning (17)Clustering (571)Sentiment analysis (575)
19Visualization (17)Data analytics (564)Data analytics (565)
20Data analysis (17)Data science (535)Classification (553)
Table 4. (a) Top author keywords for seven main data categories (1993–2012). (Ranking 1–20; frequencies in brackets). (b) Top author keywords for seven main data categories (2013–2022). (Ranking 1–20; frequencies in brackets).
Table 4. (a) Top author keywords for seven main data categories (1993–2012). (Ranking 1–20; frequencies in brackets). (b) Top author keywords for seven main data categories (2013–2022). (Ranking 1–20; frequencies in brackets).
‘Big Data’‘Novel Data’Digital DataSocial Media DataIntelligent DataMobile Phone DataSmart Data
(a)
1Hadoop (8)Data mining (53)Gis (61)Social media (10)Data mining (41)Human mobility (2)Data mining (4)
2Cloud computing (8)Data hiding (20)Remote sensing (42)Twitter (4)Intelligent data analysis (30)Reality mining (2)GSM (2)
3Mapreduce (6)Inflammation (19)Geographic information system (22)Web search (2)Machine learning (10)Milan urban region (1)Fuzzy logic (2)
4Data mining (5)Data fusion (13)Image processing (18)Community detection (2)Data analysis (6)Mobility mapping (1)Middleware (2)
5Data management (4)Apoptosis (13)Geographic information systems (16)Data collection (2)Decision support systems (6)Italy (1)Database density (1)
6Data analysis (4)Classification (12)Watermarking (14)Virtual worlds (2)Visualization (6)GIS (1)Frequent pattern list (fpl) (1)
7Visualization (4)Atherosclerosis (11)1:250’000 geological map (14)Information retrieval (1)Information systems (5)Spatial analysis (1)Transaction pattern list (tpl) (1)
8Internet of Things (3)Gene expression (11)Digital (13)Diversity (1)Data fusion (5)Human dynamics (1)Frequent itemsets (1)
9Data compression (3)Clustering (10)Metadata (13)Medical social media (1)Intelligent data carrier (5)Multi-agent model (1)Smart das (1)
10Analytics (3)Development (9)Internet (11)Web data sharing (1) Organizational dynamics (1)M2M (1)
11Deep analysis (3)Immunohisto-chemistry (9)Landslides (11)Social media recommendation (1)Intelligent systems (4)Social computing (1)SMS (1)
12Communication studies (3)Cancer (9)Engineering geogology (11)Common preference group (1)Classification (4)Living labs (1)Smart GIS (1)
13Social media (3)Mass spectrometry (9)Digital radiography (11)Locality sensitive hashing (1)Data collection (4)Sensor networks (1)Bus-charging current (1)
14Twitter (3)Cytokines (8)Groundwater (10)Multiple neighbourhood similarity (1)Support vector machine (4)Stochastic process (1)Disconnector (1)
15Gartner research (2)Pregnancy (8)Limestone (9)Visual categorization (1)Olap (4)Switch data (1)TEV (1)
16Security (2)Neural networks (8)Digital data acquisition (9)Nearest neighbour method (1)Neural networks (4)Data mining (1)Anti-electromagnetism interference (1)
17Smart city (2)Genetic algorithm (8)Natural hazards (9)Network analysis (1)Decision trees (3)Extract traffic conditions (1)Transmission lline (1)
18Web 2.0 (2)Obesity (8)Holographic data storage (9)Computational social science (1)Artificial intelligence (3)Traffic data (1)Data acquiring systems (1)
19Virtualization (2)Data structures (8)Security (8)Digital methods (1)Feature selection (3)Maximal information coefficient (1)Real time fault analysis (1)
20Geographic information science (2)Wireless sensor networks (7)Telemedicine (8)Politics (1)Simulation (3)Geographic information science (1)Grid computing (1)
(b)
1Machine learning (3772)Machine learning (172)‘big data’ (170)Social media (957)Machine learning (67)Human mobility (61)‘big data’ (40)
2Artificial intelligence (1876)Deep learning (130)Machine learning (82)Twitter (310)Intelligent data analysis (53)‘big data’ (48)Internet of Things (24)
3Deep learning (1578)Data mining (100)Artificial intelligence (60)Sentiment analysis (219)Deep learning (45)COVID-19 (30)Machine learning (20)
4Cloud computing (1458)‘big data’ (92)Social media (53)‘big data’ (193)Artificial intelligence (35)Mobile phone (24)Smart cities (8)
5‘big data’ analytics (1423)Data augmentation (78)COVID-19 (44)Machine learning (164)Internet of Things (30)Machine learning (20)Artificial intelligence (8)
6Data mining (1316)Data-driven (69)Data (43)COVID-19 (114)‘big data’ (28)Call detail records (17)Smart city (8)
7Internet of Things (1121)Inflammation (54)Watermarking (40)Natural language processing (93)Data mining (27)Mobility (17)Deep learning (8)
8Hadoop (871)Data fusion (50)Security (39)Deep learning (76)Fault diagnosis (15)Mobile phones (15)Industry 4.0 (7)
9Mapreduce (870)Clustering (46)Encryption (38)Data mining (71)Intelligent data processing (14)Activity space (13)Smart data pricing (6)
10Social media (784)COVID-19 (40)Privacy (32)Text mining (67)Cloud computing (13)Urban mobility (10)Crowdsourcing (6)
11‘big data’ analysis (588)Classification (38)GIS (32)Social media analytics (65)Clustering (11)Social networks (9)IoT (5)
12COVID-19 (579)Wireless sensor networks (32)Digital (30)Social networks (51)Prediction (11)Commuting (9)Data management (5)
13Industry 4.0 (561)Artificial intelligence (30)Digital health (28)Facebook (45)Feature extraction (11)Estonia (9)Cloud computing (5)
14Data analytics (516)Fault diagnosis (29)Digital watermarking (27)Topic modelling (41)Support vector machine (10)Segregation (9)Data mining (5)
15Privacy (512)Data hiding (29)Data mining (26)Artificial intelligence (32)Data fusion (10)Land use (9)Energy efficiency (5)
16Data science ((497)Social media (28)Cloud computing (26)Social media analysis (30)Classification (10)Social network (8)Wireless sensor network (5)
17IoT (483)Oxidative stress (27)Digitalization (26)Social network analysis (30)Wireless sensor networks (10)Travel behaviour (8)COVID-19 (5)
18Clustering (472)Convolutional neural network (27)Deep learning (26)Cultural ecosystem services (29)Data analysis (9)Classification (7)Data (4)
19Blockchain (425)Feature selection (26)Datafication (26)Content analysis (29)IoT (9)Data fusion (7)Data analytics (4)
20Classification (420)Optimization (26)Surveillance (25)Opinion mining (28)Internet of Things (IoT) (9)Mobile phone data analysis (7)Open data (4)
Table 5. (a)–(d) Top author keywords per top-level subject areas. (Ranking 1–20; frequencies in brackets).
Table 5. (a)–(d) Top author keywords per top-level subject areas. (Ranking 1–20; frequencies in brackets).
5.a: Physical Sciences5.b: Social Sciences5.c: Health Sciences5.d: Life Sciences
1Big data (11,477)Big data (6376)Big data (2047)Big data (1665)
2Machine learning (2581)Social media (960)Machine learning (682)Machine learning (562)
3Deep learning (1502)Machine learning (881)Artificial intelligence (526)Artificial intelligence (278)
4Cloud computing (1368)Big data analytics (664)Social media (310)Deep learning (220)
5Data mining (1268)Artificial intelligence (601)COVID-19 (307)Bioinformatics (159)
6Big data’ analytics (1002)Data mining (404)Deep learning (218)Precision science (128)
7Artificial intelligence (981)Twitter (392)Precision medicine (195)Data mining (120)
8Internet of Things (959Internet of Things (298)Data mining (170)Genomics (112)
9Social media (919)Privacy (294)Epidemiology (164)Social media (103)
10Mapreduce (846)COVID-19 (288)Public health (134)Internet of Things (83)
11Hadoop (825)Deep learning (282)Electronic health records (126)Cloud computing (76)
12Clustering (525)Cloud computing (253)Twitter (114)Cancer (75)
13Classification (476)Data analytics (249)Digital health (101)COVID-19 (74)
14IoT (457)Sentiment analysis (248)Natural language processing (96)Proteomics (72)
15Big data analysis (434)Industry 4.0 (227)Personalized medicine (92)Systems biology (70)
16Industry 4.0 (410)Data science (223)Data science (88)Hadoop (66)
17Sentiment analysis (404)Analytics (202)Genomics (86)Inflammation (64)
18Data analytics (400)Text mining (198)Healthcare (78)Classification (60)
19Twitter (377)Ethics (174)Ethics (78)Data science (60)
20Security (372)Hadoop (158)Bioinformatics (74)Personal medicine (58)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Perez Karich, I.; Joss, S. Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics 2025, 2, 15. https://doi.org/10.3390/metrics2030015

AMA Style

Perez Karich I, Joss S. Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics. 2025; 2(3):15. https://doi.org/10.3390/metrics2030015

Chicago/Turabian Style

Perez Karich, Ignacio, and Simon Joss. 2025. "Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field" Metrics 2, no. 3: 15. https://doi.org/10.3390/metrics2030015

APA Style

Perez Karich, I., & Joss, S. (2025). Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics, 2(3), 15. https://doi.org/10.3390/metrics2030015

Article Metrics

Back to TopTop