Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field
Abstract
1. Introduction
- RQ1.
- How has the ‘big data’ research field grown across time, as measured by cumulative publications rates and the related occurrences and co-occurrences of 17 data categories?
- RQ2.
- How is the ‘big data’ research field conceptually characterized, as measured by occurrences of keywords and co-occurrence of data categories and keywords?
- RQ3.
- How is the ‘big data’ research field characterized in (inter)disciplinary terms, as measured by the distribution of publications, data categories and keywords across four main subject areas?
2. Literature Review
2.1. ‘Big Data’: From Concept to Broad Sociotechnical Application
2.2. Mapping the Research Field
2.3. Author Keywords, and Interdisciplinarity
3. Materials and Methods
3.1. Selection of 17 Data Categories
3.2. Data Collection and Processing
3.3. Data Analysis
3.3.1. (Co-)Occurrence of 17 Data Categories and Author Keywords
- Data category frequencies. Across the entire corpus, we calculated in how many articles each of the 17 data categories appeared at least once in the title, abstract or author keywords. Each article was counted only once per data category regardless of multiple mentions of a given data category in title, abstract and author keywords in the same article.
- Temporal segmentation. Initially, 5-year intervals were applied to analyse the temporal appearances of the 17 data categories. After observing a major spike in overall publication numbers in around 2012 with exponential growth thereon largely due to the surge in articles mentioning ‘big data’, the temporal analysis was concentrated in two main periods: 1993–2011 (first period) and 2012–2022 (second period). This highlights which data categories gained traction in the initial phase and how they fared in the second phase once ‘big data’ established itself as the dominant category. The temporal segmentation into two phases also reflects observed thematic transitions in the co-occurrence of data categories and keywords, with ‘big data’ and related concepts emerging as central to the field after 2012.
- Focus on seven main data categories. Following the initial analysis of the 17 data categories, we proceeded to concentrate further analysis on seven most significant data categories that showed a sustained presence across the 30-year period and/or a total frequency of at >1000. These are: ‘big data’, ‘novel data’, ‘digital data’, ‘social media data’, ‘intelligent data’, ‘mobile phone data’, ‘smart data’.
- Network analysis. Using social network analysis (SNA) methodology, we analyzed the data categories to identify conceptual connections in the literature. We created network graphs based on the Fruchterman–Reingold “spring” layout algorithm [38], using Python’s NetworkX library in JupyterLabs. This force-directed method models nodes as mutually repelling bodies and edges as springs pulling connected nodes together, and iteratively minimizes a global “energy” function so that (a) highly connected nodes cluster centrally, (b) less-connected nodes repel to the periphery, and (c) edge crossings are reduced). We thus visualized co-occurrence relationships. Data categories (and keywords—see below) represent nodes, and their co-occurrence in the same article represents an edge, creating a network that visualizes the interconnectedness of concepts [7,8]. A threshold of at least five articles was used to count co-occurrences, thereby balancing the need to capture meaningful relationships while filtering out weak or incidental links. Lower thresholds would increase network density, but risk overemphasizing noise. Higher thresholds could exclude significant yet less frequent associations. This approach—informed by a previous scientometric study that used author keyword co-occurrence analysis [8]—ensures clarity and the identification of robust relationships between data categories.
- Frequency and co-occurrence analysis. Across the entire corpus, a total of 315,235 Author Keywords were collected. The occurrence of each author keyword was measured across the entire dataset and across the specified time periods. The analysis also included author keyword pairs, that is, two keywords appearing together in the same article [39]. As a threshold, pairs were counted if they appeared in at least five articles.
- Network analysis. As outlined above, the same social network analysis technique was used to analyse and visualize author keyword co-occurrences.
- Keyword density mapping. To complement the keyword co-occurrence analysis, continuous 2-D density maps of author-keyword usage were generated for the two main phases (1993–2012; 2013–2022). This overlays a density surface on a two-dimensional embedding of the top author keywords [40]. Bubble size represents overall keyword frequency, and the background heat highlights regions where terms most densely co-occur, thus visualizing both core and peripheral themes at a glance. Procedurally, first, the 50 most frequent keywords in each phase were identified, followed by measuring how often each pair of keywords appeared in the same paper. The resulting pairwise counts were used to arrange the keywords on a two-dimensional map so that closely related terms sit near one another. Finally, a smooth ‘heat’ layer was overlaid to highlight regions where many keywords clump together, making it easy to see the field’s main thematic ‘hotspots’ [40].
3.3.2. Co-Occurrence of Data Categories with Author Keywords
3.3.3. Interdisciplinary Analysis Using ASJC Codes
- Data segmentation. The consolidated dataset was filtered by each of the top seven data categories, creating individual datasets (data frames) for each label.
- Disciplinary Categorization. The ASJC codes categorized articles into four main disciplinary segments: Physical Sciences, Life Sciences, Health Sciences, and Social Sciences.
- Keyword co-occurrence by discipline and data category. For each of these seven data frames (corresponding to each data category), the data was (i) further segmented by the four ASJC top-level subject areas, and (ii) the top 20 author keywords were calculated in each of these disciplinary segments.
- Thematic trajectory analysis: For each data category, the thematic trajectory was analysed across different disciplinary contexts. The examination of the top co-occurring keywords for each data category within each disciplinary segment provides insight into how ‘big data’ concepts are applied and understood in various disciplines as well as across the research field overall.
4. Results
4.1. Publication Output 1993–2022
(Co-)Occurrences of 17 Data Categories
4.2. Conceptual Analysis Through Author Keywords
4.2.1. Author Keyword Occurrences
4.2.2. Keyword Density Overview
4.2.3. Co-Occurrences of Author Keywords with Data Categories
4.3. (Inter)Disciplinary Boundaries
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kitchin, R. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences; Sage: Hong Kong, China, 2014; ISBN 1-4739-0826-4. [Google Scholar]
- Kitchin, R. The Data Revolution: A Critical Analysis of Big Data, Open Data and Data Infrastructures; Sage: Hong Kong, China, 2021; ISBN 1-5297-6677-X. [Google Scholar]
- Castells, M. The Rise of the Network Society. In The Information Age/Anuel Castells, 2nd ed.; with a new preface, [reprint]; Wiley-Blackwell: Malden, MA, USA, 2011; ISBN 978-1-4051-9686-4. [Google Scholar]
- Negroponte, N. Being Digital, 1st ed.; Hodder & Stoughton: London, UK, 1995; ISBN 978-0-340-64525-3. [Google Scholar]
- Fernández, A.; del Río, S.; López, V.; Bawakid, A.; del Jesus, M.J.; Benítez, J.M.; Herrera, F. Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. WIREs Data Min. Knowl. Discov. 2014, 4, 380–409. [Google Scholar] [CrossRef]
- Kitchin, R.; McArdle, G. What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets. Big Data Soc. 2016, 3, 2053951716631130. [Google Scholar] [CrossRef]
- Schraven, D.; Joss, S.; de Jong, M. Past, Present, Future: Engagement with Sustainable Urban Development through 35 City Labels in the Scientific Literature 1990–2019. J. Clean. Prod. 2021, 292, 125924. [Google Scholar] [CrossRef]
- De Jong, M.; Joss, S.; Schraven, D.; Zhan, C.; Weijnen, M. Sustainable–Smart–Resilient–Low Carbon–Eco–Knowledge Cities; Making Sense of a Multitude of Concepts Promoting Sustainable Urbanization. J. Clean. Prod. 2015, 109, 25–38. [Google Scholar] [CrossRef]
- Abdian, S.; Shahri, M.H.; Khadivar, A. A Bibliometric Analysis of Research on Big Data and Its Potential to Value Creation and Capture. Interdiscip. J. Manag. Stud. 2023, 16, 1–24. [Google Scholar]
- Ahmad, I.; Ahmed, G.; Shah, S.A.A.; Ahmed, E. A Decade of Big Data Literature: Analysis of Trends in Light of Bibliometrics. J. Supercomput. 2020, 76, 3555–3571. [Google Scholar] [CrossRef]
- Chavez, H.; Albornoz, M.B.; Martín, F. ‘Big Data’ Research: A Bibliometric Analysis of the Scopus Database, 2009–2019. J. Scientometr. Res. 2022, 11, 64–78. [Google Scholar] [CrossRef]
- Kalantari, A.; Kamsin, A.; Kamaruddin, H.S.; Ale Ebrahim, N.; Gani, A.; Ebrahimi, A.; Shamshirband, S. A Bibliometric Approach to Tracking Big Data Research Trends. J. Big Data 2017, 4, 30. [Google Scholar] [CrossRef]
- Liu, X.; Sun, R.; Wang, S.; Wu, Y.J. The Research Landscape of Big Data: A Bibliometric Analysis. Libr. Hi Tech 2020, 38, 367–384. [Google Scholar] [CrossRef]
- Parlina, A.; Ramli, K.; Murfi, H. Theme Mapping and Bibliometrics Analysis of One Decade of Big Data Research in the Scopus Database. Information 2020, 11, 69. [Google Scholar] [CrossRef]
- Beer, D. How Should We Do the History of Big Data? Big Data Soc. 2016, 3, 2053951716646135. [Google Scholar] [CrossRef]
- Mayer-Schönberger, V.; Cukier, K. Big Data: A Revolution That Will Transform How We Live, Work, and Think; Houghton Mifflin Harcourt: Boston, MA, USA, 2013; ISBN 0-544-00269-5. [Google Scholar]
- Boyd, D.; Crawford, K. Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Inf. Commun. Soc. 2012, 15, 662–679. [Google Scholar] [CrossRef]
- Sokiyna, M.Y.; Aqel, M.J.; Naqshbandi, O.A. Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming. J. Inf. Technol. Manag. 2020, 12, 100–113. [Google Scholar] [CrossRef]
- Amoore, L. Cloud Ethics: Algorithms and the Attributes of Ourselves and Others; Duke University Press: Durham, UK, 2020; ISBN 978-1-4780-0778-4. [Google Scholar]
- Crawford, K.; Schultz, J. Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms. BCL Rev. 2014, 55, 93. [Google Scholar]
- Guay, R.; Birch, K. A Comparative Analysis of Data Governance: Socio-Technical Imaginaries of Digital Personal Data in the USA and EU (2008–2016). Big Data Soc. 2022, 9, 205395172211129. [Google Scholar] [CrossRef]
- Cantorani, J.R.H.; de Oliveira, M.R.; Pilatti, L.A.; de Sousa, T.B. Agri-Food Sector: Contemporary Trends, Possible Gaps, and Prospective Directions. Metrics 2025, 2, 3. [Google Scholar] [CrossRef]
- Colangelo, M.T.; Guizzardi, S.; Galli, C. Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines. Metrics 2024, 1, 3. [Google Scholar] [CrossRef]
- Vieira, E.d.S. A Bibliometric Analysis of Neonatal Condition Research in Africa: Volume, Impact, Themes, and Collaboration. Metrics 2025, 2, 2. [Google Scholar] [CrossRef]
- Raban, D.R.; Gordon, A. The Evolution of Data Science and Big Data Research: A Bibliometric Analysis. Scientometrics 2020, 122, 1563–1581. [Google Scholar] [CrossRef]
- Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to Conduct a Bibliometric Analysis: An Overview and Guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
- Tosi, D.; Kokaj, R.; Roccetti, M. 15 Years of Big Data: A Systematic Literature Review. J. Big Data 2024, 11, 73. [Google Scholar] [CrossRef]
- Lyu, X.; Costas, R. Studying the Characteristics of Scientific Communities Using Individual-Level Bibliometrics: The Case of Big Data Research. Scientometrics 2021, 126, 6965–6987. [Google Scholar] [CrossRef]
- MacFeely, S. The Big (Data) Bang: Opportunities and Challenges for Compiling SDG Indicators. Glob. Policy 2019, 10, 121–133. [Google Scholar] [CrossRef]
- Singh, V.K.; Banshal, S.K.; Singhal, K.; Uddin, A. Scientometric Mapping of Research on ‘Big Data’. Scientometrics 2015, 105, 727–741. [Google Scholar] [CrossRef]
- Tseng, S.F.; Won, Y.L.; Yang, J.M. A Bibliometric Analysis on Data Mining and Big Data. Int. J. Electron. Bus. 2016, 13, 38. [Google Scholar] [CrossRef]
- Gupta, D.; Rani, R. A Study of Big Data Evolution and Research Challenges. J. Inf. Sci. 2019, 45, 322–340. [Google Scholar] [CrossRef]
- Sampagnaro, G. Keyword Occurrences and Journal Specialization. Scientometrics 2023, 128, 5629–5645. [Google Scholar] [CrossRef]
- Xu, J.; Bu, Y.; Ding, Y.; Yang, S.; Zhang, H.; Yu, C.; Sun, L. Understanding the Formation of Interdisciplinary Research from the Perspective of Keyword Evolution: A Case Study on Joint Attention. Scientometrics 2018, 117, 973–995. [Google Scholar] [CrossRef]
- Choi, J.; Yi, S.; Lee, K.C. Analysis of Keyword Networks in MIS Research and Implications for Predicting Knowledge Evolution. Inf. Manag. 2011, 48, 371–381. [Google Scholar] [CrossRef]
- Singh, V.K.; Singh, P.; Karmakar, M.; Leta, J.; Mayr, P. The Journal Coverage of Web of Science, Scopus and Dimensions: A Comparative Analysis. Scientometrics 2021, 126, 5113–5142. [Google Scholar] [CrossRef]
- Baas, J.; Schotten, M.; Plume, A.; Côté, G.; Karimi, R. Scopus as a Curated, High-Quality Bibliometric Data Source for Academic Research in Quantitative Science Studies. Quant. Sci. Stud. 2020, 1, 377–386. [Google Scholar] [CrossRef]
- Fruchterman, T.M.J.; Reingold, E.M. Graph Drawing by Force-Directed Placement. Softw. Pract. Exp. 1991, 21, 1129–1164. [Google Scholar] [CrossRef]
- Narong, D.K.; Hallinger, P. A Keyword Co-Occurrence Analysis of Research on Service Learning: Conceptual Foci and Emerging Research Trends. Educ. Sci. 2023, 13, 339. [Google Scholar] [CrossRef]
- Kipper, L.M.; Furstenau, L.B.; Hoppe, D.; Frozza, R.; Iepsen, S. Scopus Scientific Mapping Production in Industry 4.0 (2011–2018): A Bibliometric Analysis. Int. J. Prod. Res. 2020, 58, 1605–1627. [Google Scholar] [CrossRef]
- Halevi, G.; Moed, H. The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature. Res. Trends 2012, 1, 2. [Google Scholar]
- Jamarani, A.; Haddadi, S.; Sarvizadeh, R.; Haghi Kashani, M.; Akbari, M.; Moradi, S. Big Data and Predictive Analytics: A Systematic Review of Applications. Artif. Intell. Rev. 2024, 57, 176. [Google Scholar] [CrossRef]
- Selmy, H.A.; Mohamed, H.K.; Medhat, W. Big Data Analytics Deep Learning Techniques and Applications: A Survey. Inf. Syst. 2024, 120, 102318. [Google Scholar] [CrossRef]
- Hu, J.; Zhang, Y. Discovering the Interdisciplinary Nature of Big Data Research through Social Network Analysis and Visualization. Scientometrics 2017, 112, 91–109. [Google Scholar] [CrossRef]
- Hu, J.; Zhang, Y. Measuring the Interdisciplinarity of Big Data Research: A Longitudinal Study. Online Inf. Rev. 2018, 42, 681–696. [Google Scholar] [CrossRef]
- Berloco, F.; Bevilacqua, V.; Colucci, S. Distributed Analytics for Big Data: A Survey. Neurocomputing 2024, 574, 127258. [Google Scholar] [CrossRef]
- Liang, H.; Zhang, Z.; Hu, C.; Gong, Y.; Cheng, D. A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and Applications. IEEE Trans. Big Data 2024, 10, 174–193. [Google Scholar] [CrossRef]
- Liu, L.; Li, J.; Lv, J.; Wang, J.; Zhao, S.; Lu, Q. Privacy-Preserving and Secure Industrial Big Data Analytics: A Survey and the Research Framework. IEEE Internet Things J. 2024, 11, 18976–18999. [Google Scholar] [CrossRef]
- Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer Models in Biomedicine. BMC Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar] [CrossRef]
- Babar, M.; Qureshi, B.; Koubaa, A. Review on Federated Learning for Digital Transformation in Healthcare through Big Data Analytics. Future Gener. Comput. Syst. 2024, 160, 14–28. [Google Scholar] [CrossRef]
- Amaithi Rajan, A.; Vetriselvi, V. Systematic Survey: Secure and Privacy-Preserving Big Data Analytics in Cloud. J. Comput. Inf. Syst. 2024, 64, 136–156. [Google Scholar] [CrossRef]
Authors (Year) | Focus | Period | Search Query | Key Findings/Insights |
---|---|---|---|---|
Halevi and Moed (2012) | Evolution of ‘big data’ as research | 2000–2012 | ‘big data’ | Origins of ‘big data’ research |
Singh et al. (2015) | Mapping of ‘big data’ research | 2010–2014 | ‘big data’ | Clusters, themes and collaborations |
Tseng et al. (2016) | Co-evolution of data mining and ‘big data’ | 1983–2014 | ‘data mining’, ‘big data’ | Data mining research (1983-) preceding ‘big data’ research (1993-) |
Kalantari et al. (2017) | Analysing ‘big data’ research trends | 1983–2014 | ‘data analytics’, ‘Hadoop’, ‘machine learning’, ‘MapReduce’, ‘large dataset’, ‘big data’, ‘data warehouse’, ‘predictive analytics’, ‘NoSQL’, ‘unstructured data’, ‘data science’, ‘sentiment analysis’, ‘data center’ | Evolution of research themes |
Gupta and Rani (2019) | Evolution of ‘big data’ research | 2000–2017 | ‘big data’ | Main growth occurring post-2010 |
Ahmad et al. (2020) | Trends in ‘big data’ literature | 2008–2017 | ‘big data’ | Key authors, journals, research trends |
Liu et al. (2020) | ‘big data’ research landscape | 2013–2018 | ‘big data’ | Conceptual organization of research field |
Raban & Gordon (2020) | Data science and ‘big data’ evolution | 2006–2019 | ‘big data’, ‘data science’ | Expansion and diversification into various domains |
Lyu and Costas (2021) | ‘big data’ research communities | 2008–2017 | ‘big data’ | Formation of social structure (research communities) of the field |
Chavez et al. (2022) | ‘big data’ research trends (2009–2019) | 2009–2019 | ‘big data’ | Main research areas and growing interdisciplinary applications |
Big Data | Novel Data | Digital Data | Social Media Data | Intelligent Data | Mobile Phone Data | Smart Data | RFID | Clickstream Data | Emergent Data | Web Browsing Data | Digital Footprint Data | Online Activity Data | Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1993 | 0 | 9 | 108 | 0 | 9 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 127 |
1994 | 0 | 16 | 115 | 0 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 140 |
1995 | 0 | 17 | 123 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 151 |
1996 | 0 | 30 | 155 | 0 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 202 |
1997 | 0 | 30 | 147 | 0 | 16 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 194 |
1998 | 1 | 24 | 122 | 0 | 17 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 166 |
1999 | 2 | 39 | 140 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 196 |
2000 | 1 | 45 | 147 | 0 | 17 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 212 |
2001 | 2 | 50 | 140 | 0 | 16 | 0 | 2 | 0 | 5 | 0 | 0 | 0 | 0 | 215 |
2002 | 1 | 58 | 128 | 0 | 12 | 1 | 2 | 0 | 7 | 0 | 0 | 0 | 0 | 209 |
2003 | 3 | 80 | 153 | 0 | 15 | 0 | 0 | 1 | 5 | 2 | 0 | 0 | 0 | 259 |
2004 | 3 | 101 | 133 | 0 | 24 | 0 | 4 | 4 | 7 | 1 | 0 | 0 | 0 | 277 |
2005 | 3 | 101 | 173 | 0 | 19 | 0 | 2 | 3 | 2 | 1 | 0 | 0 | 0 | 304 |
2006 | 6 | 111 | 183 | 0 | 32 | 0 | 4 | 5 | 6 | 2 | 0 | 0 | 0 | 349 |
2007 | 2 | 141 | 174 | 0 | 31 | 0 | 4 | 10 | 1 | 4 | 0 | 0 | 0 | 367 |
2008 | 1 | 152 | 138 | 0 | 18 | 1 | 4 | 9 | 1 | 4 | 1 | 0 | 0 | 329 |
2009 | 10 | 193 | 179 | 1 | 23 | 4 | 4 | 17 | 12 | 1 | 0 | 0 | 0 | 444 |
2010 | 9 | 208 | 156 | 3 | 27 | 1 | 4 | 22 | 5 | 4 | 0 | 0 | 0 | 439 |
2011 | 29 | 249 | 165 | 3 | 36 | 3 | 3 | 32 | 6 | 4 | 1 | 0 | 0 | 531 |
2012 | 197 | 266 | 174 | 16 | 28 | 12 | 7 | 24 | 6 | 0 | 0 | 0 | 0 | 730 |
2013 | 607 | 390 | 186 | 31 | 42 | 18 | 7 | 31 | 8 | 5 | 1 | 0 | 0 | 1326 |
2014 | 1602 | 385 | 215 | 47 | 41 | 30 | 7 | 18 | 9 | 5 | 0 | 0 | 0 | 2359 |
2015 | 2533 | 374 | 243 | 92 | 28 | 49 | 17 | 23 | 5 | 6 | 0 | 1 | 1 | 3372 |
2016 | 3633 | 467 | 310 | 146 | 47 | 49 | 25 | 17 | 14 | 9 | 1 | 1 | 1 | 4720 |
2017 | 4360 | 513 | 316 | 229 | 41 | 48 | 27 | 14 | 15 | 8 | 0 | 0 | 2 | 5573 |
2018 | 5572 | 670 | 336 | 261 | 62 | 70 | 30 | 13 | 14 | 12 | 0 | 0 | 1 | 7041 |
2019 | 7110 | 695 | 438 | 309 | 87 | 70 | 48 | 16 | 29 | 16 | 0 | 2 | 0 | 8820 |
2020 | 7771 | 896 | 482 | 407 | 112 | 110 | 62 | 9 | 24 | 14 | 2 | 3 | 1 | 9893 |
2021 | 8430 | 1003 | 554 | 497 | 143 | 113 | 53 | 13 | 35 | 10 | 2 | 2 | 3 | 10,858 |
2022 | 8095 | 944 | 530 | 442 | 148 | 94 | 49 | 15 | 24 | 8 | 5 | 5 | 1 | 10,360 |
Total | 49,983 | 8257 | 6563 | 2484 | 1141 | 673 | 372 | 296 | 240 | 116 | 14 | 14 | 10 | 70,163 |
3.a: 1993–2012 | 3.b 2013–2022 | 3.c 1993–2022 | |
---|---|---|---|
1 | Data mining (120) | ‘big data’ (17706) | ‘big data’ (17763) |
2 | GIS (64) | Machine learning (3803) | Machine learning (3820) |
3 | Big data (57) | Artificial intelligence (2035) | Artificial intelligence (2043) |
4 | RFID (52) | Deep learning (1884) | Deep learning (1884) |
5 | Remote sensing (44) | Social media (1711) | Social media (1728) |
6 | Digital data (41) | Data mining (1544) | Data mining (1664) |
7 | Intelligent data analysis (30) | Cloud computing (1539) | Cloud computing (1551) |
8 | Internet (28) | ‘big data’ analytics (1423) | ‘big data’ analytics (1424) |
9 | Data hiding (26) | Internet of Things (1215) | Internet of Things (1222) |
10 | Classification (25) | Mapreduce (902) | Mapreduce (909) |
11 | Neural networks (22) | Hadoop (897) | Hadoop (905) |
12 | Geographic information system (22) | COVID-19 (811) | COVID-19 (811) |
13 | Image processing (20) | Twitter (666) | Twitter (674) |
14 | Inflammation (19) | Industry 4.0 (588) | Privacy (589) |
15 | Data fusion (19) | ‘big data’ analysis (588) | Industry 4.0 (588) |
16 | Metadata (18) | Privacy (583) | ‘big data’ analysis (588) |
17 | Social media (17) | Sentiment analysis (573) | Clustering (587) |
18 | Machine learning (17) | Clustering (571) | Sentiment analysis (575) |
19 | Visualization (17) | Data analytics (564) | Data analytics (565) |
20 | Data analysis (17) | Data science (535) | Classification (553) |
‘Big Data’ | ‘Novel Data’ | Digital Data | Social Media Data | Intelligent Data | Mobile Phone Data | Smart Data | |
---|---|---|---|---|---|---|---|
(a) | |||||||
1 | Hadoop (8) | Data mining (53) | Gis (61) | Social media (10) | Data mining (41) | Human mobility (2) | Data mining (4) |
2 | Cloud computing (8) | Data hiding (20) | Remote sensing (42) | Twitter (4) | Intelligent data analysis (30) | Reality mining (2) | GSM (2) |
3 | Mapreduce (6) | Inflammation (19) | Geographic information system (22) | Web search (2) | Machine learning (10) | Milan urban region (1) | Fuzzy logic (2) |
4 | Data mining (5) | Data fusion (13) | Image processing (18) | Community detection (2) | Data analysis (6) | Mobility mapping (1) | Middleware (2) |
5 | Data management (4) | Apoptosis (13) | Geographic information systems (16) | Data collection (2) | Decision support systems (6) | Italy (1) | Database density (1) |
6 | Data analysis (4) | Classification (12) | Watermarking (14) | Virtual worlds (2) | Visualization (6) | GIS (1) | Frequent pattern list (fpl) (1) |
7 | Visualization (4) | Atherosclerosis (11) | 1:250’000 geological map (14) | Information retrieval (1) | Information systems (5) | Spatial analysis (1) | Transaction pattern list (tpl) (1) |
8 | Internet of Things (3) | Gene expression (11) | Digital (13) | Diversity (1) | Data fusion (5) | Human dynamics (1) | Frequent itemsets (1) |
9 | Data compression (3) | Clustering (10) | Metadata (13) | Medical social media (1) | Intelligent data carrier (5) | Multi-agent model (1) | Smart das (1) |
10 | Analytics (3) | Development (9) | Internet (11) | Web data sharing (1) | Organizational dynamics (1) | M2M (1) | |
11 | Deep analysis (3) | Immunohisto-chemistry (9) | Landslides (11) | Social media recommendation (1) | Intelligent systems (4) | Social computing (1) | SMS (1) |
12 | Communication studies (3) | Cancer (9) | Engineering geogology (11) | Common preference group (1) | Classification (4) | Living labs (1) | Smart GIS (1) |
13 | Social media (3) | Mass spectrometry (9) | Digital radiography (11) | Locality sensitive hashing (1) | Data collection (4) | Sensor networks (1) | Bus-charging current (1) |
14 | Twitter (3) | Cytokines (8) | Groundwater (10) | Multiple neighbourhood similarity (1) | Support vector machine (4) | Stochastic process (1) | Disconnector (1) |
15 | Gartner research (2) | Pregnancy (8) | Limestone (9) | Visual categorization (1) | Olap (4) | Switch data (1) | TEV (1) |
16 | Security (2) | Neural networks (8) | Digital data acquisition (9) | Nearest neighbour method (1) | Neural networks (4) | Data mining (1) | Anti-electromagnetism interference (1) |
17 | Smart city (2) | Genetic algorithm (8) | Natural hazards (9) | Network analysis (1) | Decision trees (3) | Extract traffic conditions (1) | Transmission lline (1) |
18 | Web 2.0 (2) | Obesity (8) | Holographic data storage (9) | Computational social science (1) | Artificial intelligence (3) | Traffic data (1) | Data acquiring systems (1) |
19 | Virtualization (2) | Data structures (8) | Security (8) | Digital methods (1) | Feature selection (3) | Maximal information coefficient (1) | Real time fault analysis (1) |
20 | Geographic information science (2) | Wireless sensor networks (7) | Telemedicine (8) | Politics (1) | Simulation (3) | Geographic information science (1) | Grid computing (1) |
(b) | |||||||
1 | Machine learning (3772) | Machine learning (172) | ‘big data’ (170) | Social media (957) | Machine learning (67) | Human mobility (61) | ‘big data’ (40) |
2 | Artificial intelligence (1876) | Deep learning (130) | Machine learning (82) | Twitter (310) | Intelligent data analysis (53) | ‘big data’ (48) | Internet of Things (24) |
3 | Deep learning (1578) | Data mining (100) | Artificial intelligence (60) | Sentiment analysis (219) | Deep learning (45) | COVID-19 (30) | Machine learning (20) |
4 | Cloud computing (1458) | ‘big data’ (92) | Social media (53) | ‘big data’ (193) | Artificial intelligence (35) | Mobile phone (24) | Smart cities (8) |
5 | ‘big data’ analytics (1423) | Data augmentation (78) | COVID-19 (44) | Machine learning (164) | Internet of Things (30) | Machine learning (20) | Artificial intelligence (8) |
6 | Data mining (1316) | Data-driven (69) | Data (43) | COVID-19 (114) | ‘big data’ (28) | Call detail records (17) | Smart city (8) |
7 | Internet of Things (1121) | Inflammation (54) | Watermarking (40) | Natural language processing (93) | Data mining (27) | Mobility (17) | Deep learning (8) |
8 | Hadoop (871) | Data fusion (50) | Security (39) | Deep learning (76) | Fault diagnosis (15) | Mobile phones (15) | Industry 4.0 (7) |
9 | Mapreduce (870) | Clustering (46) | Encryption (38) | Data mining (71) | Intelligent data processing (14) | Activity space (13) | Smart data pricing (6) |
10 | Social media (784) | COVID-19 (40) | Privacy (32) | Text mining (67) | Cloud computing (13) | Urban mobility (10) | Crowdsourcing (6) |
11 | ‘big data’ analysis (588) | Classification (38) | GIS (32) | Social media analytics (65) | Clustering (11) | Social networks (9) | IoT (5) |
12 | COVID-19 (579) | Wireless sensor networks (32) | Digital (30) | Social networks (51) | Prediction (11) | Commuting (9) | Data management (5) |
13 | Industry 4.0 (561) | Artificial intelligence (30) | Digital health (28) | Facebook (45) | Feature extraction (11) | Estonia (9) | Cloud computing (5) |
14 | Data analytics (516) | Fault diagnosis (29) | Digital watermarking (27) | Topic modelling (41) | Support vector machine (10) | Segregation (9) | Data mining (5) |
15 | Privacy (512) | Data hiding (29) | Data mining (26) | Artificial intelligence (32) | Data fusion (10) | Land use (9) | Energy efficiency (5) |
16 | Data science ((497) | Social media (28) | Cloud computing (26) | Social media analysis (30) | Classification (10) | Social network (8) | Wireless sensor network (5) |
17 | IoT (483) | Oxidative stress (27) | Digitalization (26) | Social network analysis (30) | Wireless sensor networks (10) | Travel behaviour (8) | COVID-19 (5) |
18 | Clustering (472) | Convolutional neural network (27) | Deep learning (26) | Cultural ecosystem services (29) | Data analysis (9) | Classification (7) | Data (4) |
19 | Blockchain (425) | Feature selection (26) | Datafication (26) | Content analysis (29) | IoT (9) | Data fusion (7) | Data analytics (4) |
20 | Classification (420) | Optimization (26) | Surveillance (25) | Opinion mining (28) | Internet of Things (IoT) (9) | Mobile phone data analysis (7) | Open data (4) |
5.a: Physical Sciences | 5.b: Social Sciences | 5.c: Health Sciences | 5.d: Life Sciences | |
---|---|---|---|---|
1 | Big data (11,477) | Big data (6376) | Big data (2047) | Big data (1665) |
2 | Machine learning (2581) | Social media (960) | Machine learning (682) | Machine learning (562) |
3 | Deep learning (1502) | Machine learning (881) | Artificial intelligence (526) | Artificial intelligence (278) |
4 | Cloud computing (1368) | Big data analytics (664) | Social media (310) | Deep learning (220) |
5 | Data mining (1268) | Artificial intelligence (601) | COVID-19 (307) | Bioinformatics (159) |
6 | Big data’ analytics (1002) | Data mining (404) | Deep learning (218) | Precision science (128) |
7 | Artificial intelligence (981) | Twitter (392) | Precision medicine (195) | Data mining (120) |
8 | Internet of Things (959 | Internet of Things (298) | Data mining (170) | Genomics (112) |
9 | Social media (919) | Privacy (294) | Epidemiology (164) | Social media (103) |
10 | Mapreduce (846) | COVID-19 (288) | Public health (134) | Internet of Things (83) |
11 | Hadoop (825) | Deep learning (282) | Electronic health records (126) | Cloud computing (76) |
12 | Clustering (525) | Cloud computing (253) | Twitter (114) | Cancer (75) |
13 | Classification (476) | Data analytics (249) | Digital health (101) | COVID-19 (74) |
14 | IoT (457) | Sentiment analysis (248) | Natural language processing (96) | Proteomics (72) |
15 | Big data analysis (434) | Industry 4.0 (227) | Personalized medicine (92) | Systems biology (70) |
16 | Industry 4.0 (410) | Data science (223) | Data science (88) | Hadoop (66) |
17 | Sentiment analysis (404) | Analytics (202) | Genomics (86) | Inflammation (64) |
18 | Data analytics (400) | Text mining (198) | Healthcare (78) | Classification (60) |
19 | Twitter (377) | Ethics (174) | Ethics (78) | Data science (60) |
20 | Security (372) | Hadoop (158) | Bioinformatics (74) | Personal medicine (58) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Perez Karich, I.; Joss, S. Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics 2025, 2, 15. https://doi.org/10.3390/metrics2030015
Perez Karich I, Joss S. Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics. 2025; 2(3):15. https://doi.org/10.3390/metrics2030015
Chicago/Turabian StylePerez Karich, Ignacio, and Simon Joss. 2025. "Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field" Metrics 2, no. 3: 15. https://doi.org/10.3390/metrics2030015
APA StylePerez Karich, I., & Joss, S. (2025). Emergence and Evolution of ‘Big Data’ Research: A 30-Year Scientometric Analysis of the Knowledge Field. Metrics, 2(3), 15. https://doi.org/10.3390/metrics2030015