An Empirical Study on Visualizing the Intellectual Structure and Hotspots of Big Data Research from a Sustainable Perspective

Big data has been extensively applied to many fields and wanted for sustainable development. However, increasingly growing publications and the dynamic nature of research fronts pose challenges to understand the current research situation and sustainable development directions of big data. In this paper, we visually conducted a bibliometric study of big data literatures from the Web of Science (WoS) between 2002 and 2016, involving 4927 effective journal articles in 1729 journals contributed by 16,404 authors from 4137 institutions. The bibliometric results reveal the current annual publications distribution, journals distribution and co-citation network, institutions distribution and collaboration network, authors distribution, collaboration network and co-citation network, and research hotspots. The results can help researchers worldwide to understand the panorama of current big data research, to find the potential research gaps, and to focus on the future sustainable development directions.


Introduction
With the growing popularity of mobile terminals, Internet of Things (IoT), social networks, cloud computing and mobile commerce, myriad data are generated, and the era of big data is coming.The advent of big data has promoted the revolution of data-driven thinking and decision making.Governments, industry and academia have paid great attention to big data strategy, technologies and applications.More and more people worldwide have made tremendous efforts in large-scale heterogeneous data collection, organization, storage, analysis, mining and applications under the big data environment.Big data has become a hot topic of discussion.For example, Nature and Science published special issues "Big Data" in 2008 and "Dealing with Data" in 2011 respectively.In May 2011, the McKinsey global institute (MGI) released the research report "Big Data: The Next Frontier for Innovation, Competition, and Productivity" [1].In March 2012, U.S. President Office of Science and Technology Policy declared in public that the United States government would invest mapping analyses.For example, we use an ad hoc software tool SATI3.2 [40] to clean the data in the preprocessing stage, and apply UCINET6 and CiteSpace V to build networks and visualize scientific mapping.SATI3.2, developed by Qiyuan Liu at Zhejiang University (China), is also applied to field data extraction, item frequency statistics, co-occurrence matrix construction, and visual analysis based on NetDraw.It is freely downloadable at http://sati.liuqiyuan.com.UCINET6 for Windows, developed by Lin Freeman, Martin Everett and Steve Borgatti, is a software package for the analysis of social network data.It comes with the NetDraw network visualization tool and can be downloaded at https://sites.google.com/site/ucinetsoftware/home.CiteSpace V, developed by professor Chaomei Chen at Drexel University (USA), is used to focus on visual analysis and scientific mapping.It is a Java-based information visualization and scientific mapping software package and can be freely available at http://cluster.cis.drexel.edu/~cchen/citespace/.The main functions include co-word networks analysis and co-citation networks analyses of authors, documents, institutions and journals.More importantly, CiteSpace V facilitates the identification of the chronologic patterns of a specific knowledge domain, including research hotspots, intellectual turning points, and citation burst.
The aim of this article is to demonstrate visually the intellectual structure and hotspots in big data research from 2002 to 2016.Particularly, the distribution characteristics, intellectual turning points, and emerging trends are examined from the following perspectives: publications, journals, institutions and authors, as well as keywords analysis.

Publications Distribution
To evaluate the outcomes of big data research between 2002 and 2016, we collected 4927 journal articles from WoS databases and tracked the annual publications distribution of big data research (shown in Figure 1).There were few journal articles on big data research before 2009.However, a growth spurt was generated from 2010 to 2016, when dozens, and eventually thousands, of journal articles emerged.As shown in Figure 1, we roughly divided the development of big data research into two stages.Stage I (2002Stage I ( -2009) is an embryonic stage with few annual articles, which indicate that big data exploration just starts.The topics of big data research mostly are the introductions of theories, techniques, and methods related to big data, such as data-mining application architecture [41], SINFONI [42], MapReduce [43], Hive [44], the pathologies of big data [45], large-scale electrophysiology of big data [46], and so on.Stage II (2010-2016) has a rapid growth spurt in annual research outcomes.In this stage, there were four articles in 2010; by 2016, the number of annual articles sharply increased to 2402, which represented that the number of annual articles had increased 600 times over the past six years.Such a significant change is attributable, to a great extent, to the As shown in Figure 1, we roughly divided the development of big data research into two stages.Stage I (2002Stage I ( -2009) is an embryonic stage with few annual articles, which indicate that big data exploration just starts.The topics of big data research mostly are the introductions of theories, techniques, and methods related to big data, such as data-mining application architecture [41], SINFONI [42], MapReduce [43], Hive [44], the pathologies of big data [45], large-scale electrophysiology of big data [46], and so on.Stage II (2010-2016) has a rapid growth spurt in annual research outcomes.
In this stage, there were four articles in 2010; by 2016, the number of annual articles sharply increased to 2402, which represented that the number of annual articles had increased 600 times over the past six years.Such a significant change is attributable, to a great extent, to the growing research enthusiasm of governments, scholars and enterprises, such as the research report "Big Data: The Next Frontier for Innovation, Competition, and Productivity" [1], the declaration "The Big Data Research and Development Initiative" [2], the book "Big Data: A Revolution That Will Transform How We Live, Work, and Think" [47], and the worldwide opening of the big data subject.All of these promoted effectively the rapid development of scientific research works related to big data.The studies of big data gradually matured.
To further verify the rapid growth trend of research literatures related to big data in Stage II, we develop a curve-fitting, and find that the curve conforms to the exponential distribution: y = 5.076e 1.175t , where y is the amount of annual publications, and t is a time sequence between 2010 and 2016.Moreover, according to goodness of fit test, the closer R 2 (R Square, coefficient of determination) is to 1, the better fitting degree of the regression line.The quantitative result shows that R 2 = 0.974; R 2 is very close to 1.This result indicates the fitting regression curve has a good reliability of forecast and goodness of fit.Therefore, the annual publications of big data research between 2010 and 2016 grow exponentially and big data has become a hot topic.It is worth worldwide scholars to pay more attention.
Figure 2 shows the annual number of authors who published articles from 2002 to 2016.The line in Figure 2 is similar to the annual publications distribution in Figure 1.There were four authors in 2002, and 14 authors in 2010.However, the number of authors sharply increased to 9558 in 2016, which shows that the number of annual authors has increased hundreds of times over the past several years.growing research enthusiasm of governments, scholars and enterprises, such as the research report "Big Data: The Next Frontier for Innovation, Competition, and Productivity" [1], the declaration "The Big Data Research and Development Initiative" [2], the book "Big Data: A Revolution That Will Transform How We Live, Work, and Think" [47], and the worldwide opening of the big data subject.
All of these promoted effectively the rapid development of scientific research works related to big data.The studies of big data gradually matured.
To further verify the rapid growth trend of research literatures related to big data in Stage II, we develop a curve-fitting, and find that the curve conforms to the exponential distribution: y = 5.076e 1.175t , where y is the amount of annual publications, and t is a time sequence between 2010 and 2016.Moreover, according to goodness of fit test, the closer R 2 (R Square, coefficient of determination) is to 1, the better fitting degree of the regression line.The quantitative result shows that R 2 = 0.974; R 2 is very close to 1.This result indicates the fitting regression curve has a good reliability of forecast and goodness of fit.Therefore, the annual publications of big data research between 2010 and 2016 grow exponentially and big data has become a hot topic.It is worth worldwide scholars to pay more attention.
Figure 2 shows the annual number of authors who published articles from 2002 to 2016.The line in Figure 2 is similar to the annual publications distribution in Figure 1.There were four authors in 2002, and 14 authors in 2010.However, the number of authors sharply increased to 9558 in 2016, which shows that the number of annual authors has increased hundreds of times over the past several years.To further evaluate the annual collaboration ratio of researchers in the big data research field, we depicted average participants per article from 2002 to 2016 (shown in Figure 3).However, we excluded 2005 and 2007 because of mathematics.Figure 3   To further evaluate the annual collaboration ratio of researchers in the big data research field, we depicted average participants per article from 2002 to 2016 (shown in Figure 3).However, we excluded 2005 and 2007 because of mathematics.Figure 3

Core Journals Identification
In this section, we examined 1729 different academic journals.According to Price law, core journals must be the journals which published more than N (note: N = 0.749 × square (69) ≈ 7) articles.According to the statistical analysis, there are 154 core journals.Table 1 lists the top 10 academic journals in descending order of publications.The core academic journal with the most publications in big data research is PLoS One (69), followed by IEEE Access (63), and Big Data (52).There is a narrow gap of less than five publications among Cluster Computing the Journal of Networks, Software Tools and Applications (45), Neurocomputing (45), Journal of Supercomputing (43), and Concurrency and Computation: Practice and Experience (41).IEEE Network, Information Sciences, and International Journal of Distributed Sensor Networks have equal publications (31). In

Core Journals Identification
In this section, we examined 1729 different academic journals.According to Price law, core journals must be the journals which published more than N (note: N = 0.749 × square (69) ≈ 7) articles.According to the statistical analysis, there are 154 core journals.Table 1 lists the top 10 academic journals in descending order of publications.The core academic journal with the most publications in big data research is PLoS One (69), followed by IEEE Access (63), and Big Data (52).There is a narrow gap of less than five publications among Cluster Computing the Journal of Networks, Software Tools and Applications (45), Neurocomputing (45), Journal of Supercomputing (43), and Concurrency and Computation: Practice and Experience (41).IEEE Network, Information Sciences, and International Journal of Distributed Sensor Networks have equal publications (31). In

Journals Co-Citation Network
Journals co-citation analyses usually are employed to discover the journals that formed the intellectual base of a knowledge domain.Figure 4   More interestingly, the nodes with purple tree rings around the outer rim indicate that some highly cited journals have high betweenness centrality (betweenness centrality ≥ 0.23), such as Nature (0.54), Proceedings of the National Academy of Sciences of the United States of America (0.56), Nucleic Acids Research (0.31), and PLoS One (0.23).These pivotal journals make connections to others in the journal co-citation network (see Figure 4).Some big nodes with thinner purple rings indicate that high co-citation scores do not necessarily have a high betweenness centrality.For example, Lecture Notes in Computer Science has a high co-citation frequency node (1436) and a lower betweenness centrality (0.04).Moreover, the journals in multidisciplinary sciences and Computer Science received more citations.It means that knowledge from multidisciplinary sciences and computer science is therefore a major intellectual resource for big data scholars.In addition, a significant co-citation burst journal is visualized by the node with red inner tree rings.The size of the red inner tree rings node represents the strength of its burst property.As shown in Figure 4, Big Data Revolution is a journal with red inner tree rings, suggesting that its citations have rapidly increased between 2014 and 2016.More interestingly, the nodes with purple tree rings around the outer rim indicate that some highly cited journals have high betweenness centrality (betweenness centrality ≥ 0.23), such as Nature (0.54), Proceedings of the National Academy of Sciences of the United States of America (0.56), Nucleic Acids Research (0.31), and PLoS One (0.23).These pivotal journals make connections to others in the journal co-citation network (see Figure 4).Some big nodes with thinner purple rings indicate that high co-citation scores do not necessarily have a high betweenness centrality.For example, Lecture Notes in Computer Science has a high co-citation frequency node (1436) and a lower betweenness centrality (0.04).Moreover, the journals in multidisciplinary sciences and Computer Science received more citations.It means that knowledge from multidisciplinary sciences and computer science is therefore a major intellectual resource for big data scholars.In addition, a significant co-citation burst journal is visualized by the node with red inner tree rings.The size of the red inner tree rings node represents the strength of its burst property.As shown in Figure 4, Big Data Revolution is a journal with red inner tree rings, suggesting that its citations have rapidly increased between 2014 and 2016.

Core Institutions Identification
It is significant to study the institutions distribution in a research field.Commonly the number of publications is an important index to measure academic level, scientific research ability, and status of the authors and their institutions in a specific field.Core institutions are important leaders in a research field.However, the names of academic institutions might change over time.Therefore, to avoid inconsistent signatures, we firstly need to standardize the names of academic institutions.In this section, we reserved the top level names, and constructed uniform names of academic institutions.Eventually we achieved 4137 different institutions.
According to Price law, core institutions must be the institutions who published more than N (note: N = 0.749 × square (153) ≈ 10) articles.According to the statistical analysis, there are 265 core institutions in development history of big data research from 2002 to 2016.Table 3

Institutions Collaboration Network
To enhance overall research strength in a scientific field, scientific research collaboration usually is an important means, which allows researchers to play their own academic advantages and share information [48].Moreover, the level of scientific research collaboration is one of important indexes to evaluate the academic level, scientific research ability, and status of institutions in a specific field.To discuss the scientific research collaboration in the big data research field, we constructed a scientific research collaboration network (shown in Figure 5).

Institutions Collaboration Network
To enhance overall research strength in a scientific field, scientific research collaboration usually is an important means, which allows researchers to play their own academic advantages and share information [48].Moreover, the level of scientific research collaboration is one of important indexes to evaluate the academic level, scientific research ability, and status of institutions in a specific field.To discuss the scientific research collaboration in the big data research field, we constructed a scientific research collaboration network (shown in Figure 5).This scientific research collaboration network consists of 142 nodes and 342 links.Each node represents an institution, and is depicted with a series of tree rings across multiple time slices.The size of each node is proportional to the total number of publications in each institution [8].Each link between two nodes represents a scientific research collaboration relationship, and the thickness of a link represents the scientific research collaboration strength [49].As shown in Figure 5, there are a wider scientific research collaboration among different institutions.For example, Chinese Academy of Sciences is a red tree ring node, which has the most publications 153 and cross-connects with University of Sydney, Harbin Institute of Technology, University of Science and Technology China, Peking University, Beijing Normal University, and Otto Von Guericke University.The gold-colored link between University of Sydney represents that the first scientific research collaboration year is between 2014 and 2015.However, the nodes with more publications do not certainly have stronger betweenness centrality scores.As listed in Table 3, compared with Stanford University (0.29), Chinese Academy of Sciences has a weaker betweenness centrality score (0.15).This means that Chinese Academy of Sciences plays a weaker intellectual pivotal role among the institutions collaboration network.Furthermore, University of South Carolina with the highest betweenness centrality score (0.63) has a very low co-occurrence frequency.These results reveal that the current research relationship is rather weak and diffuse.In addition, three thicker lines, which are linked with Otto Von Guericke University (link strength: 0.  This scientific research collaboration network consists of 142 nodes and 342 links.Each node represents an institution, and is depicted with a series of tree rings across multiple time slices.The size of each node is proportional to the total number of publications in each institution [8].Each link between two nodes represents a scientific research collaboration relationship, and the thickness of a link represents the scientific research collaboration strength [49].As shown in Figure 5, there are a wider scientific research collaboration among different institutions.For example, Chinese Academy of Sciences is a red tree ring node, which has the most publications 153 and cross-connects with University of Sydney, Harbin Institute of Technology, University of Science and Technology China, Peking University, Beijing Normal University, and Otto Von Guericke University.The gold-colored link between University of Sydney represents that the first scientific research collaboration year is between 2014 and 2015.However, the nodes with more publications do not certainly have stronger betweenness centrality scores.As listed in Table 3, compared with Stanford University (0.29), Chinese Academy of Sciences has a weaker betweenness centrality score (0.15).This means that Chinese Academy of Sciences plays a weaker intellectual pivotal role among the institutions collaboration network.Furthermore, University of South Carolina with the highest betweenness centrality score (0.63) has a very low co-occurrence frequency.These results reveal that the current research relationship is rather weak and diffuse.In addition, three thicker lines, which are linked with Otto Von Guericke University (link strength: 0. It is interesting to study the core authors distribution in the big data research field.Usually the amount of publications is an important index to evaluate the academic level, advancement, and position of an author in a specific research field.In addition, core authors also are particular important leaders in a research field.However, the names of authors may be full and abbreviated names downloaded from the WoS.The same abbreviated name might stand for different full names.For example, Y ZHANG represent Yin ZHANG, Yi ZHANG, or Yong ZHANG.Similarly, Y WANG may represent Yige WANG, Yi WANG, or Yuhang WANG, et al.Moreover, a same full name may be different authors.For example, Yin ZHANG can be Yin ZHANG who comes from the School of Computer Science or Information Technology at Huazhong University of Science and Technology (HUST), or even Yin ZHANG who comes from the School of Economics and Law at Zhongnan University.They are different persons.To avoid inconsistent signatures, we therefore need to examine seriously the unique full names and affiliated institutions of the authors, count the amount of the articles, and order the different authors in descending articles.Eventually, we got 16,404 different authors who published 4927 articles from 2002 to 2016.It indicates that the average number of collaborator per article is between three and four in the big data research field.This result coincides with the publications distribution (see "publications distribution" section).
According to Price law, core authors must be the authors who published more than M (note: M = 0.749 × square (18) ≈ 3) articles.According to the statistical analysis, there are 229 core authors.Table 4 lists the top 10 most prolific authors by the amount of articles from 2002 to 2016.Among them, Ranjan, Rajiv ranks first with 18 articles, and Zomaya, Albert Y ranks second with 17 articles.If we do not exclude the collaborative articles, the top 10 authors published 138 articles, which account for 2.8% of overall articles published from 2002 to 2016.This means approximately 0.6% of overall authors published 2.8% of overall articles between 2002 and 2016.It conforms to what is known as a "Matthew effect" in core authors distribution.However, the number of all core authors is only 229, which accounts for 1.4% of overall authors.This means that 98.6% authors are not core authors.This result shows that research strengths are still comparatively weak and fragmented.Moreover, from the geographical perspective, the core authors from Australia account for 50% of the top 10 core authors, which means that Australia currently has a stronger research strength in the big data field compared with other countries.

Core Authors Collaboration Network
To deeply understand the current research collaboration of core authors, we also developed the social network analysis based on UCINET (shown in Figure 6).Because the original network has lower density (0.0240), we deleted some isolates and pendants (nodes with degree one) to increase the identifiability of the network.Eventually, the core authors collaboration network consists of 44 nodes and four small networks.The two bigger networks have 25 nodes and 12 nodes separately.This means that more core authors nodes tend to be the isolates or the pendants.In general, the overall core authors collaboration network is relatively decentralized.This result reveals that the research collaboration among core authors is not enough close in the big data field.
As shown in Figure 6, the size of each node represents the between centrality score.According to the between centrality measure, Rajiv Ranjan is the central node with highest between centrality, as it form the densest bridges with other nodes.In addition, Laurence T. Yang, Albert Y. Zomaya, and Kim-Kwang Raymond Choo also have a higher between centrality.More interesting, nine authors (Rajiv Ranjan, Albert Y. Zomaya, Lizhe Wang, Xuyun Zhang, Jinjun Chen, Laurence T. Yang, Chang Liu, Keqin Li, and Samee U. Khan) listed in

Core Authors Collaboration Network
To deeply understand the current research collaboration of core authors, we also developed the social network analysis based on UCINET (shown in Figure 6).Because the original network has lower density (0.0240), we deleted some isolates and pendants (nodes with degree one) to increase the identifiability of the network.Eventually, the core authors collaboration network consists of 44 nodes and four small networks.The two bigger networks have 25 nodes and 12 nodes separately.This means that more core authors nodes tend to be the isolates or the pendants.In general, the overall core authors collaboration network is relatively decentralized.This result reveals that the research collaboration among core authors is not enough close in the big data field.
As shown in Figure 6, the size of each node represents the between centrality score.According to the between centrality measure, Rajiv Ranjan is the central node with highest between centrality, as it form the densest bridges with other nodes.In addition, Laurence T. Yang, Albert Y. Zomaya, and Kim-Kwang Raymond Choo also have a higher between centrality.More interesting, nine authors (Rajiv Ranjan, Albert Y. Zomaya, Lizhe Wang, Xuyun Zhang, Jinjun Chen, Laurence T. Yang, Chang Liu, Keqin Li, and Samee U. Khan) listed in Table 4 have close bonds with each other in the biggest network.

Authors Co-Citation Network
Unlike the core authors collaboration analysis, authors co-citation analysis focuses on the cocited authors who published the co-cited articles.Authors co-citation relationship is critical to understand the academic communication and knowledge base diffusion in a specific research field [11].The more two authors are co-cited, the closer the intellectual relationship is. Figure 7 shows the overall landscape view of authors co-citation network in the big data research field.The top 50 most cited authors in each slice are used to construct the authors co-citation network based on 137,929 valid distinct references.This network consists of 262 nodes and 593 links.Moreover, this network has a very high modularity (0.9102), which can be considered that the specialties in science mapping are clearly defined in terms of co-citation clusters.The mean silhouette score (0.4179) is relatively

Authors Co-Citation Network
Unlike the core authors collaboration analysis, authors co-citation analysis focuses on the co-cited authors who published the co-cited articles.Authors co-citation relationship is critical to understand the academic communication and knowledge base diffusion in a specific research field [11].The more two authors are co-cited, the closer the intellectual relationship is. Figure 7 shows the overall landscape view of authors co-citation network in the big data research field.The top 50 most cited authors in each slice are used to construct the authors co-citation network based on 137,929 valid distinct references.This network consists of 262 nodes and 593 links.Moreover, this network has a very high modularity (0.9102), which can be considered that the specialties in science mapping are clearly defined in terms of co-citation clusters.The mean silhouette score (0.4179) is relatively lower mainly because of the numerous small clusters [15].Therefore, we just need to focus on the major clusters.
lower mainly because of the numerous small clusters [15].Therefore, we just need to focus on the major clusters.As shown in Figure 7, each node with a series of tree rings across multiple time slices represents an author.The size of each node is proportional to the total authors co-citation frequency.Each link between two nodes represents a co-citation relationship, and the thickness of a link shows the cocitation link strengths [49].For example, Dean J is the biggest tree rings node, which has the most cocitation articles ( 493  As shown in Figure 7, each node with a series of tree rings across multiple time slices represents an author.The size of each node is proportional to the total authors co-citation frequency.Each link between two nodes represents a co-citation relationship, and the thickness of a link shows the co-citation link strengths [49].For example, Dean J is the biggest tree rings node, which has the most co-citation articles (493) and cross-connects with White T, Wu XD, Wang Y, Zaharia M, Isard M, and Condie T. The green-colored link with Zaharia M represents that the first co-citation year is 2012.In addition, three thicker lines, which are linked with Zaharia M (link strength: 0.57), Isard M (link strength: 0.53), and Condie T (link strength: 0.53) respectively, indicate some stronger co-citation relationships.Table 5 lists   The node with purple tree rings around the outer rim indicates this co-cited author has a high betweenness centrality, and this author tends to be a pivotal scholar whose work linked different disciplines, research topics, or stages in the big data field.Table 6 lists all authors with high betweenness centrality (betweenness centrality ≥ 0.1).For example, Savage M (0.12) proposed "The Coming Crisis of Empirical Sociology" (2007) and "Contemporary Sociology and the Challenge of Descriptive Assemblage" (2009) to argue the challenges of "social" transactional data and descriptive assemblage.Savage M is a milestone author who argues how to develop sociology within the big data environment.Other authors with a strong betweenness centrality include Manyika J (0.11), Thusoo A (0.11), Schadt EE (0.11), Barabasi AL (0.11), and Chaudhuri S (0.11).Thusoo A (2009; 2010) presented the well-known Hive-a petabyte scale data warehouse using Hadoop.Schadt EE (2010) proposed the computational solutions to large-scale data management and analysis.Barabasi AL (2010) discussed the emergence of scaling in random networks and the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.However, it is not the case that a highly co-cited author positively has a high betweenness centrality.These authors are visualized by the small nodes with thicker purple tree rings, such as Savage M, Thusoo A, Schadt EE, Barabasi AL, and Chaudhuri S.Only a node simultaneously with a high co-citation frequency and a betweenness centrality is the milestone author.For example, as listed in Table 5  In addition, the node with red inner rings in Figure 7 means a significant co-citation burst.It reveals that the co-citation frequency of authors increased rapidly within a given time period.The size of the red inner tree rings node represents the strength of its burst property.As shown in Figure 7, there are 25 nodes with red inner tree rings.It means that there are 25 authors with co-citation bursts in big data research from 2002 to 2016.These authors may have profound impacts on the big data research, and their work should be paid more attention because they may impact the sustainable development directions of big data research.Table 7 lists the top 25 cited authors with strongest citation bursts.Among them, Ghemawat S with the strongest citation burst (11.6177) demonstrated the Google file system, a scalable distributed file system for large distributed data-intensive applications, which guided the big data storage research.Thusoo A, with the second strongest citation burst (9.8427), presented the well-known Hive based on Hadoop.In addition, Hey T, Armbrust M, Wang C, Cohen J, and Buyya R, etc. also made important contributions to the sustainable development of big data research from different perspectives.

Keywords Co-Word Network
Keywords usually provide the core content and principal research methods of each article.Keyword co-word analysis can be applied to identify research topics and monitor research frontiers of a knowledge domain [50].To construct a reasonable keywords co-word network, SATI3.2 was used to extract the high frequency keywords and form keywords co-occurrence matrix.Moreover, commonly keywords must be integrated and unified because of synonymy and polysemy.We removed some broad words (such as algorithm, model, design, analysis, research, etc.), and eventually got the top 80 keywords.Table 8 lists the top 80 high frequency keywords.

Keywords Co-Word Network
Keywords usually provide the core content and principal research methods of each article.Keyword co-word analysis can be applied to identify research topics and monitor research frontiers of a knowledge domain [50].To construct a reasonable keywords co-word network, SATI3.2 was used to extract the high frequency keywords and form keywords co-occurrence matrix.Moreover, commonly keywords must be integrated and unified because of synonymy and polysemy.We removed some broad words (such as algorithm, model, design, analysis, research, etc.), and eventually got the top 80 keywords.Table 8

Keywords Co-Word Network
Keywords usually provide the core content and principal research methods of each article.Keyword co-word analysis can be applied to identify research topics and monitor research frontiers of a knowledge domain [50].To construct a reasonable keywords co-word network, SATI3.2 was used to extract the high frequency keywords and form keywords co-occurrence matrix.Moreover, commonly keywords must be integrated and unified because of synonymy and polysemy.We removed some broad words (such as algorithm, model, design, analysis, research, etc.), and eventually got the top 80 keywords.Table 8

Keywords Co-Word Network
Keywords usually provide the core content and principal research methods of each article.Keyword co-word analysis can be applied to identify research topics and monitor research frontiers of a knowledge domain [50].To construct a reasonable keywords co-word network, SATI3.2 was used to extract the high frequency keywords and form keywords co-occurrence matrix.Moreover, commonly keywords must be integrated and unified because of synonymy and polysemy.We removed some broad words (such as algorithm, model, design, analysis, research, etc.), and eventually got the top 80 keywords.
To understand the relationship among these keywords, we construct the keywords co-word network (shown in Figure 8).Each node represents a keyword.The size of each node is proportional to the betweenness centrality of keywords.It is not surprising that some well-known words, such as well-known topics including data mining, cloud computing, machine learning, MapReduce, Hadoop, social media, and visualization, have higher co-occurrence frequencies and betweenness centrality scores.Besides these topics, we find that data science, including data privacy, data management, data protection, and data quality, etc., also gradually enter the researchers' considerations.In addition, deep learning, algorithm, model, performance, optimization are some interesting findings in big data research.These keywords reveal the popular research hotspots and will have profound impacts on future sustainable development research directions of big data.

Discussion and Conclusions
In this study, we extracted the bibliometric data of 4927 effective journal articles listed in the WoS between 2002 and 2016, visualized the intellectual structure and hotspots of big data research from the bibliometric perspective, and presented the results in terms of publications distribution, journals distribution and co-citation network, institutions distribution and collaboration network, authors distribution, collaboration network and co-citation network, and keywords co-word network.The main findings of this study are as follows:

Discussion and Conclusions
In this study, we extracted the bibliometric data of 4927 effective journal articles listed in the WoS between 2002 and 2016, visualized the intellectual structure and hotspots of big data research from the bibliometric perspective, and presented the results in terms of publications distribution, journals distribution and co-citation network, institutions distribution and collaboration network, authors distribution, collaboration network and co-citation network, and keywords co-word network.The main findings of this study are as follows: According to publications distribution, we found the annual growth trend of big data research outcomes and authors, as well as the changes of co-author numbers in each article.The research outcomes in the embryonic stage (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) were very few, but an exponential growth spurt was generated from 2010 to 2016.In addition, the growth trend of annual authors is similar to the annual publications distribution.Moreover, we found that the average number of participants per article in the big data field were between three and four authors.
The current core journal with the most publications was PLoS One, followed by IEEE Access and Big Data.However, the top five co-citation journals, which contributed to the sustainable intellectual base formation of big data, were Nature, Science, Lecture Notes in Computer Science, PLoS One, and Communications of the ACM.Among them, Nature had the highest betweenness centrality.Moreover, the most categories of top 10 co-citation journals were multidisciplinary sciences and computer science, which is closely related to the nature of big data science.
There was a wider scientific research collaboration among institutions in big data research.The top three core institutions in terms of publications were Chinese Academy of Science, Tsinghua University, and University of California, Los Angeles.However, the institutions with most publications had lower betweenness centrality scores, signifying that these institutions still were scattered and did not get general consent.Hence, the current research relationships among the institutions were rather weak and diffuse in the big data research field.With sustainable development and prosperity of big data research, the research collaboration relationships will be strengthened and increasingly firm.
According to the core authors identification, compared with USA and China, Australia had a current greater research strength in big data research.However, according to authors co-citation analysis, the top 10 most co-cited authors mainly came from USA and China.Moreover, some special authors with most co-citation frequency, high betweenness centrality and strong citation bursts were also identified, such as the most co-citation authors, pivotal scholars or intellectual turning pointers, and the strongest citation burst authors.These authors had contributed to the sustainable development of big data from different perspectives, and have a profound impact on the big data field.More attention should be paid to their work.
Keywords co-word analysis detected the current research hotspots and emerging topics, including not only the well-known research hotspots like data mining, cloud computing, machine learning, MapReduce, Hadoop, social media, and visualization, but also some emerging research topics, such as data science (data privacy, data management, data protection, and data quality, etc.), deep learning, and so on.Moreover, algorithm, model, performance and optimization are also gradually entering researchers' considerations.In addition, keywords co-word analysis also detected the current emerging and sustainable development applications areas of big data, such as social network, smart city, bioinformatics, crowdsourcing, ethics, Genomics, GIS, Healthcare, Education, epidemiology, precision medicine, and energy.
As an emerging hot topic, big data has changed the lives of human beings, and driven some changes in thinking, decision making, and research paradigms.Moreover, big data itself contains important strategic resources for social trends, market changes, scientific and technological development and national security.Many colleges and universities have opened big data disciplines and courses.However, as a new emerging cross-discipline, the sustainable development of big data still faces many very complicated and difficult challenges, such as the heterogeneity and incompleteness of data, the efficiency of big data processing, big data security and privacy protection, high energy consumption, and so on.On the one hand, these challenges indicate some sustainable development directions of future big data research.On the other hand, these challenges are also unprecedented opportunities of big data sustainable development.With the increasing improvement of physical infrastructure constructions and policy making at national and institutional levels, and the further breakthroughs of information technologies (computer networks, distributed systems, cloud computing, data storage, machine learning, and so on), these above issues will be gradually solved.A bright future of big data science is coming.

Figure 1 .
Figure 1.Annual publications distribution of big data research.

Figure 1 .
Figure 1.Annual publications distribution of big data research.
reveals a trend of collaboration among authors in the big data research field.In 2003, the average number of participants per article reached a maximum of five.However, the value hits rock bottom twice at 2004 and 2008 because of having an independent author in each article.After 2008, this number continued to rise, and reached 3.98 in 2016.Moreover, there was only a slight fluctuation from 2012 to 2016, which indicated that the average number of participants per article in the big data field were between three and four authors.The research collaboration, to some extent, ensured the quality of the publications.
reveals a trend of collaboration among authors in the big data research field.In 2003, the average number of participants per article reached a maximum of five.However, the value hits rock bottom twice at 2004 and 2008 because of having an independent author in each article.After 2008, this number continued to rise, and reached 3.98 in 2016.Moreover, there was only a slight fluctuation from 2012 to 2016, which indicated that the average number of participants per article in the big data field were between three and four authors.The research collaboration, to some extent, ensured the quality of the publications.

Figure 3 .
Figure 3. Average participants distribution per article.
addition, according to the Journal Citation Reports in the WoS, IEEE Network simultaneously has the highest impact factor (IF, 7.230) and immediacy index (1.638) in these top 10 most publications core academic journals of big data research.Moreover, the top 10 academic journals published 450 articles, which account for 9.1% of overall published articles from 2002 to 2016.Simply, it indicates that 0.6% of academic journals in the big data research field published 9.1% of overall articles from 2002 to 2016.It conforms to what is known as a "Matthew effect" in academic journals distribution.

Figure 3 .
Figure 3. Average participants distribution per article.
addition, according to the Journal Citation Reports in the WoS, IEEE Network simultaneously has the highest impact factor (IF, 7.230) and immediacy index (1.638) in these top 10 most publications core academic journals of big data research.Moreover, the top 10 academic journals published 450 articles, which account for 9.1% of overall published articles from 2002 to 2016.Simply, it indicates that 0.6% of academic journals in the big data research field published 9.1% of overall articles from 2002 to 2016.It conforms to what is known as a "Matthew effect" in academic journals distribution.

19 3. 2 . 2 .
shows the highly cited journals co-citation network from 2002 to 2016.This network is constructed by the top 50 most cited references in each given time slices based on 337 iterations.It contains 195 journals and 489 links among them.Table 2 lists the top 10 highest co-cited journals from 2002 to 2016.The journals with frequencies more than 1000 include Nature (1899), Science (1844), Lecture Notes in Computer Science (1436), PLoS One (1210), Communications of the ACM (1197), and Proceedings of the National Academy of Sciences of the United States of America (1128).These six journals are the primary publishing outlets and the dominant citing sources for big data scholars, and contribute to the sustainable intellectual base formation of big data.Sustainability 2018, 10, x FOR PEER REVIEW 7 of Journals Co-Citation Network Journals co-citation analyses usually are employed to discover the journals that formed the intellectual base of a knowledge domain.Figure 4 shows the highly cited journals co-citation network from 2002 to 2016.This network is constructed by the top 50 most cited references in each given time slices based on 337 iterations.It contains 195 journals and 489 links among them.Table 2 lists the top 10 highest co-cited journals from 2002 to 2016.The journals with frequencies more than 1000 include Nature (1899), Science (1844), Lecture Notes in Computer Science (1436), PLoS One (1210), Communications of the ACM (1197), and Proceedings of the National Academy of Sciences of the United States of America (1128).These six journals are the primary publishing outlets and the dominant citing sources for big data scholars, and contribute to the sustainable intellectual base formation of big data.
3), University of Sydney (link strength: 0.23), and Harbin Institute of Technology (link strength: 0.21) respectively, indicate the stronger collaboration
3), University of Sydney (link strength: 0.23), and Harbin Institute of Technology (link strength: 0.21) respectively, indicate the stronger collaboration relationships.Moreover, two green lines, which are linked with Otto Von Guericke University and Harbin Institute of Technology, indicate that the first collaboration among them is in the 2012-2013 time slice.3.4.Authors Distribution and Co-Citation Network 3.4.1.Core Authors Identification

Table 1 .
Top 10 most publications core academic journals.

Table 1 .
Top 10 most publications core academic journals.

Table 2 .
Frequency distribution and between centrality of the highest co-cited Journals.

Table 3 .
Top 10 most prolific academic institutions.
Chang LiuEngineering and Information Technology, University of Technology Sydney Australia 12 Keqin Li Department of Computer Science State University of New York New Paltz USA 12 Francisco Herrera Dept. of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada Spain 11 Samee U. Khan electrical and computer engineering at North Dakota State University USA

Table 4
have close bonds with each other in the biggest network.
, Manyika J (0.11) at McKinsey global institute (MGI, San Francisco, CA, USA) firstly released the research report "Big Data: The Next Frontier for Innovation, Competition, and Productivity" in May 2011.This report is a milestone publication, which triggered the research enthusiasm of scholars worldwide.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.
lists the top 80 high frequency keywords.1

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.
lists the top 80 high frequency keywords.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.
Table 8 lists the top 80 high frequency keywords.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.

Table 7 .
Top 25 Cited Authors with Strongest Citation Bursts.