Emerging Scientific Field Detection Using Citation Networks and Topic Models—A Case Study of the Nanocarbon Field

In fields with high science linkage, such as the nanocarbon field, trends in academic papers are particularly important for identifying future technological trends. The use of the number of citations allows us to predict the qualitative trends on a paper-by-paper basis. At the same time, it is necessary to be able to comprehensively discuss both qualitative and quantitative aspects in the subject area. This study aimed to detect emerging areas in the nanocarbon field using network models and topic models. It was possible to not only construct a model that exceeded an 86.2% F1 measure but also to focus on an area that could not be detected by the prediction model. This was accomplished by focusing on paper units, such as the research on the chemical synthesis of zigzag single-walled carbon nanotubes. Thus, it is possible to obtain knowledge that contributes to diversified R&D strategies and innovation policies by considering the emergence of new fields from multiple perspectives.


Introduction
Academic research trends often help companies formulate research and development (R&D) strategies. Specialized knowledge is becoming more individualistic and, in highly fragmented fields, the results may change depending on the selection of participants [1]. In many academic fields, the number of publications is growing exponentially and it is becoming increasingly difficult to obtain comprehensive perspectives of such fields [2]. As the amount of science and technology data are increasing, attempts are being made to contribute to innovation policies and R&D strategies by considering entire fields [3][4][5][6][7][8][9]. While endeavoring to obtain the overall perspectives of such fields, efforts have also been directed toward investigations of what kinds of research will attract attention in the future [10][11][12][13][14]. In fields such as the nanocarbon field where the linkage representing the distance between technology and science is high, the trends of academic papers are important for identifying future technological trends. There are many review papers on nanocarbons, including ones on exudation in carbon nanotube (CNT) polymer composites [15], chemical vapor deposition (CVD) of CNTs [16], and the application of CNTs as electrode materials in lithium-ion batteries [17]. It is possible to provide certain guidelines for overviewing work in this area from the past to the present. However, few studies have adequately discussed future projections.
Extensive research has been conducted on the prediction or identification of emerging science and technology fields. The prediction of emerging research areas has traditionally been studied in bibliometrics or library and information science. It is known to be useful to focus on the citation relationships of papers as a method for extracting the essence of a field. The number of citations is a useful indicator for evaluating the quality of research. The use of regression of the number of citations allows us to downsize the number of papers and to avoid selection bias [18][19][20]. Due to the current advancement of prediction algorithms, such as machine learning algorithms and the improvement of computing capabilities, it has become possible to extract patterns that contribute to predicting future trends.
This research-based forecasting of the future of science is needed when companies and governments discuss innovation policy, but a paper-by-paper perspective alone is insufficient information for actual decision-making. A semimacro perspective on what new areas are emerging is needed, not just a paper-by-paper view. This study focuses on the field of nanocarbons, not only in terms of papers but also in terms of emerging areas. A combination of micro and semimacro perspectives will enable us to understand the trends in the field in terms of both quality and quantity.

Literature Review
There has been discussion regarding the definition of an emerging paper based on indicators of emergence using similarity and entropy between papers [21]. This includes an increase in terms of the abstracts [22] and the cumulative number of scientific and technological documents in basic and applied research. Dong et al. [23] predicted the h-index of a book five years after publication. They defined the impact of a paper based on six factors-author, content, publisher, citation, coauthorship, and time series-and applied their approach to 200,000 computer science papers [23]. Chakraborty et al. [24] used data from 1.5 million computer science papers to categorize the time series of citation counts for several years after publication into six types. By combining the characteristics of the authors, academic societies, and keywords, Chakraborty et al. predicted the citation counts within five years. Wang et al. [25] focused on the power law describing the citation numbers of papers and formulated a citation number prediction method for future papers from the time series information of citation numbers five years after publication. Adams [26] showed that the number of citations 3-10 years after publication is correlated with the number of citations 1-2 years after publication in the fields of life science and physics. Li and Tong [27] formulated the paper citation prediction as an optimization problem. The authors studied 50,000 computer science papers and estimated the number of citations 10 years after publication based on the information obtained 3 years after publication.
Reports that discuss emerging research using citation analysis can be broadly divided into those that involve cocitations [10,28], bibliographic coupling [29], and direct citations [30,31]. In many of these investigations, knowledge structures were abstracted as networks. Davletov et al. [32] estimated the number of citations 5 or 10 years after publication by using the time series information of citations several years after publication and the structure information of citation networks. This was accomplished by employing data from 27,000 arXiv energy physics papers [32], 150,000 computer science papers (Arnet Miner), and 200,000 papers (CiteSeerX). According to Davletov et al. [33], the time series of citations during the first two years after publication are important for prediction. Meanwhile, Mori et al. [34] focused on academic papers related to artificial intelligence to predict the emergence of papers from the perspective of increasing the citation count by using network, text, and cluster information, among other aspects. Sasaki et al. [35] attempted to extract emerging papers in the photovoltaic (PV) power generation field. Chen et al. [36] used cocitation networks and collaborative research networks in academic papers to focus on research across structural gaps in networks. Citation networks are connections of knowledge, at least based on the premise that knowledge is built on knowledge in academia.
Topic models are often used to predict trends in academic research. In recent years, Latent Dirichlet Allocation (LDA) has been used in numerous studies for scientific and technological bibliographic information [37][38][39]. For example, Jiang et al. [40] extracted common terms such as "fish," "species," "emission," "lake," "sediment," and "climate" using topic models from 1726 papers related to hydroelectric fields. These facilitate understanding of the topic of interest, using all the papers published in a particular year as a parameter. However, the topic model is a model that evaluates a topic based on the many terms that have appeared up to that point. In other words, they can capture a large trend in a topic, but they do not guarantee the quality of the research. In addition, topic models do not contribute to the future prediction of a field, because they evaluate the emergence of many terms as a posteriori.

Purpose and Contribution
As described above, it is possible to evaluate individual papers by using quality indexes such as the number of citations, but it is not sufficient to evaluate quantitatively the emerging fields of research. In addition, it is not sufficient to evaluate qualitatively and predict the future of research if we only evaluate topics that are emerging in large numbers at a given time, as in the case of the topic model. In forecasting the emergence of a field where both scientific and technological applications are mixed, such as nanocarbons, there are scientific and commercial perspectives. In such a complex field, there is a limit to the ability to predict and discuss the emerging technologies using only one forecasting method. However, not enough research exists to predict the emergence of the field from this perspective. In this study, we propose a prediction method that takes into account both the quality and quantity of emerging fields, by discussing them from both a micro perspective, which has not been sufficiently discussed in the past, and a semimacro perspective, which has focused on the cluster units.
The contribution of this study is to provide knowledge that will help companies and governments to predict the future from multiple perspectives when implementing innovation plans in highly uncertain fields, such as nanocarbons.

Overview
Figure 1 provides an overview of the method. As shown, the citation network was converted into an unweighted network with papers as nodes and citation relationships as links (Step 1 in Figure 1). Because core papers always constitute the largest component, direct citation is the most effective means of detecting research frontiers. In fact, not all papers are closely related to the target fields of nanocarbons. Papers having no citations as the largest component were considered digressional and were ignored in this study (Step 2 in Figure 1). The network was then divided into several clusters [41,42] using the topological clustering method [42] (Step 3 in Figure 1). Topological clustering is a clustering method based on the graph structure of a network, and here we use a modularity maximization. Here, a cluster is a module in a citation network and is a group of papers in which the citation relations are divided using a modularity (Q value) maximization method and are densely aggregated [42]. The modularity maximization method appreciates network partitioning such that the intracluster is dense and the intercluster is sparse. The modularity maximization method determines an optimal partitioning pattern by extracting the partitioning pattern that maximizes the modularity by a greedy algorithm. Q is an evaluation function of the degree of coupling within a cluster and between clusters and is given below.
Here, e ii is the ratio between the number of links connected to nodes belonging to the same cluster i and the number of links in the entire network. Additionally, a 2 i is the expected value of the ratio between the number of links of e ii and the total number of links.
Next, an emerging paper was defined, features were extracted from the cited networks, and the constructed machine learning model was evaluated (Step 4). Emerging papers were defined as papers that were ranked in the top 5% of the dataset each year and whose citations increased for three years after publication. Step 4 will be described in detail after the following task definition. Using the predicted emerging paper results, the topics of the clusters were further analyzed to address the emergence by area and paper (Step 5 in Figure 1).

Feature Extraction
The features of each paper were extracted from the bibliographic information and citation network of the obtained paper. The features referred to here were learning data for predicting emerging papers. These were used as explanatory variables. The features in this study can be classified into four categories: network macro features; cluster features; network centrality features; citing paper features. A network macro feature is a typical feature of the target citation network. These include: the maximum value (NW MAXQ) of the number of papers (NW NODES); the total number of citations (NW EDGES); the modularity (Q value) of the clusters in the network. Cluster features describe the cluster to which the target paper belongs to and includes the maximum Q value of the cluster (CL QMAX), the number of nodes in the cluster (CL NODES), and the order of the cluster (CL RANK) to which the subject paper belongs. The network centrality features indicate the central position of a given paper in a cited network. Specifically, these include: the centrality degree (CNT DEGRE) [43]; the betweenness centrality (CNT BETWE) [43]; the closeness centrality (CNT CLOSE) [43]; the eigenvector centrality (CNT EIGEN) [44]; the network constraint (CNT NETWO) [45]; the clustering coefficient (CNT CLUST) [46]; the page rank (CNT PAGER) [47]; the hub score (CNT HUBSC) [48]; the authority score (CNT AUTHOR) [48]. Reference paper feature values were calculated for the cited papers, and representative statistical values such as the maximum, minimum, average, and total were used as the feature values. For each of the papers, 15 features (network, cluster, and centrality) can be calculated directly. In addition, three cluster features and nine centrality features were calculated for each reference in the paper, and the maximum, minimum, average, and total values were considered to be the features. Thus, the number of reference paper features was 48 (=12 * 4). The number of all features was 63 (=15 + 48) for all papers. Table 1 summarizes all 63 types of features in the four abovementioned categories. These features calculated the maximum connected components among the cited networks in the target field, which were used as explanatory variables in the prediction model.

Feature Extraction
The features of each paper were extracted from the bibliographic information and citation network of the obtained paper. The features referred to here were learning data for predicting emerging papers. These were used as explanatory variables. The features in this study can be classified into four categories: network macro features; cluster features; network centrality features; citing paper features. A network macro feature is a typical feature of the target citation network. These include: the maximum value (NW MAXQ) of the number of papers (NW NODES); the total number of citations (NW EDGES); the modularity (Q value) of the clusters in the network. Cluster features describe the cluster to which the target paper belongs to and includes the maximum Q value of the cluster (CL QMAX), the number of nodes in the cluster (CL NODES), and the order of the cluster (CL RANK) to which the subject paper belongs. The network centrality features indicate the central position of a given paper in a cited network. Specifically, these include: the centrality degree (CNT DEGRE) [43]; the betweenness centrality (CNT BETWE) [43]; the closeness centrality (CNT CLOSE) [43]; the eigenvector centrality (CNT EIGEN) [44]; the network constraint (CNT NETWO) [45]; the clustering coefficient (CNT CLUST) [46]; the page rank (CNT PAGER) [47]; the hub score (CNT HUBSC) [48]; the authority score (CNT AUTHOR) [48]. Reference paper feature values were calculated for the cited papers, and representative statistical values such as the maximum, minimum, average, and total were used as the feature values. For each of the papers, 15 features (network, cluster, and centrality) can be calculated directly. In addition, three cluster features and nine centrality features were calculated for each reference in the paper, and the maximum, minimum, average, and total values were considered to be the features. Thus, the number of reference paper features was 48 (=12 * 4). The number of all features was 63 (=15 + 48) for all papers. Table 1 summarizes all 63 types of features in the four abovementioned categories. These features calculated the maximum connected components among the cited networks in the target field, which were used as explanatory variables in the prediction model. CNT_AUTHOR Authority score.

Property of reference
The sum of the features of paper sets that a paper cites.

CITING_MAX-[feature]
Maximum of features in questions in cited paper sets that a paper cites.

CITING_MIN-[feature]
Minimum of features in questions in cited paper sets that a paper cites.

CITING_AVG-[feature]
Average of features in questions in cited paper sets that a paper cites.
CITING_SUM-[feature] Sum of features in questions in cited paper sets that a paper cites.

Task Definition
In this study, all the features were calculated for the papers included in the largest connected component of the cited network and were treated as explanatory variables. The explained variable was whether the paper was an emerging paper. Emerging papers were defined as papers that were ranked in the top 5% of the dataset each year and whose citations increased for three years after publication. In fact, in the case of emerging papers included in the top 5% of the increase in the number of citations, a positive example was given a flag, and 50% or less was treated as a negative example, or as an explained variable. In other words, the emergence prediction problem in this study was considered to be a two-class classification problem, involving the identification of whether a paper satisfied the requirements for emergence within three years of publication. A logistic regression, which is a linear classifier, was adopted as the classifier, and LIBLINEAR was used for implementation. Among the data included in the negative example, the same amount of data as in the positive example were randomly extracted eight times, and eight kinds of datasets were constructed for each year. In addition, by performing a five-fold cross validation on each model, overlearning was avoided. The prediction model used learning data regarding whether a paper had become an emerging paper three years after publication and was actually applied to a group of papers published four years after the publication of the learning data as a prediction object. That is, when t 1 (=t 0 + 4) was set as the publication year of the prediction target paper, the learning data of t 0 + 3, which was the learning window, were applied to the paper from t 1 (=t 0 + 4). The model could also be evaluated three years after the publication of the model (t 1 + 3). This period was defined as the evaluation window. A schematic of the relationship between the learning and evaluation windows is provided in Figure 2.
when t1 (=t0 + 4) was set as the publication year of the prediction target paper, the learning data of t0 + 3, which was the learning window, were applied to the paper from t1 (=t0 + 4). The model could also be evaluated three years after the publication of the model (t1 + 3). This period was defined as the evaluation window. A schematic of the relationship between the learning and evaluation windows is provided in Figure 2.  Table 2 shows the correspondence between the learning and evaluation years for each model.

Evaluation
The 1 which is defined as the harmonic mean of the and , was used to evaluate the analytical model. The is the ratio between the number of actually emerging papers and the number predicted as emerging. The is the ratio between the number of papers predicted as emerging and the number actually emerging. The 1 was extensively used to evaluate the prediction models. The definitions of precision and recall, which are commonly used in machine learning classification models, are shown below.
The precision is the fraction of positive data that is actually positive: The recall is the fraction of data that is actually positive relative to the data that were predicted to be positive:   Table 2 shows the correspondence between the learning and evaluation years for each model.

Evaluation
The F1 measure which is defined as the harmonic mean of the Precision and Recall, was used to evaluate the analytical model. The Precision is the ratio between the number of actually emerging papers and the number predicted as emerging. The Recall is the ratio between the number of papers predicted as emerging and the number actually emerging. The F1 measure was extensively used to evaluate the prediction models. The definitions of precision and recall, which are commonly used in machine learning classification models, are shown below.
The precision is the fraction of positive data that is actually positive: The recall is the fraction of data that is actually positive relative to the data that were predicted to be positive: The F1 measure is the harmonic mean of the precision and recall:

Topic Extraction from Each Cluster
The topics of the papers belonging to each cluster were estimated using latent dirichlet allocation (LDA) [49]. LDA is a topic model-a probabilistic language model for estimating the contents of a target document (group). Since the LDA model assumes that a document (group) consists of multiple topics, it serves the purpose of analyzing the object as a cluster unit in a quoting network. For example, suppose a group of papers has silicon-based solar PV (photovoltaics), thin-film solar PV, and dye-sensitized solar PV as topics. The probability distribution is determined as the probability of generating (silicon, membrane, dye-sensitized) = (0.1, 0.3, 0.6) which is 0.3 for each topic and that of (silicon, membrane, dye-sensitized) = (0.6, 0.2, 0.2) is 0.6. The graphical model is shown in Figure 3. Here, αis a parameter for obtaining the topic selection probability. Additionally, βis a parameter for obtaining the terms generation probability in accordance with the topic. These parameters are estimated with N terms in each document and M document sets [49].
Appl. Syst. Innov. 2020, 3, x FOR PEER REVIEW 7 of 18 = + The F1 measure is the harmonic mean of the precision and recall:

Topic Extraction from Each Cluster
The topics of the papers belonging to each cluster were estimated using latent dirichlet allocation (LDA) [49]. LDA is a topic model-a probabilistic language model for estimating the contents of a target document (group). Since the LDA model assumes that a document (group) consists of multiple topics, it serves the purpose of analyzing the object as a cluster unit in a quoting network. For example, suppose a group of papers has silicon-based solar PV (photovoltaics), thin-film solar PV, and dye-sensitized solar PV as topics. The probability distribution is determined as the probability of generating (silicon, membrane, dye-sensitized) = (0.1, 0.3, 0.6) which is 0.3 for each topic and that of (silicon, membrane, dye-sensitized) = (0.6, 0.2, 0.2) is 0.6. The graphical model is shown in Figure  3. Here, α is a parameter for obtaining the topic selection probability. Additionally, β is a parameter for obtaining the terms generation probability in accordance with the topic. These parameters are estimated with N terms in each document and M document sets [49]. LDAvis is used to visualize LDA [50]. The saliency of term w in any topic t is defined by Equation (5). In addition, the number of topics included in each cluster is estimated [51].

Dataset
In this study, the analysis will focus on the field of nanocarbons. A nanocarbon material is a material made from graphite composed of carbon nanotubes (CNTs), graphene, and fullerenes. Nanocarbon materials are employed in various devices, such as semiconductors, fuel cells, optical devices, and structural materials. This can be attributed to their excellent mechanical, electrical, and thermal properties. For example, the potential use of nanocarbon materials in energy fields [52][53][54][55][56] and space elevators has been discussed [57][58][59].
The Science Citation Index (SCI) and Social Science Citation Index (SSCI) database indexed by Web of Science were used to extract papers with "((carbon and (nano* OR micro*)) or fullerene or Buckminsterfullerene or Buckminster-fullerene or C60 or C-60 or graphene or (filament* and LDAvis is used to visualize LDA [50]. The saliency of term w in any topic t is defined by Equation (5). In addition, the number of topics included in each cluster is estimated [51].
Saliency term w = f recuency(w) sum t p(t|w) log p(t|w) p(t)

Dataset
In this study, the analysis will focus on the field of nanocarbons. A nanocarbon material is a material made from graphite composed of carbon nanotubes (CNTs), graphene, and fullerenes. Nanocarbon materials are employed in various devices, such as semiconductors, fuel cells, optical devices, and structural materials. This can be attributed to their excellent mechanical, electrical, and thermal properties. For example, the potential use of nanocarbon materials in energy fields [52][53][54][55][56] and space elevators has been discussed [57][58][59].
The Science Citation Index (SCI) and Social Science Citation Index (SSCI) database indexed by Web of Science were used to extract papers with "((carbon and (nano* OR micro*)) or fullerene or Buckminsterfullerene or Buckminster-fullerene or C60 or C-60 or graphene or (filament* and carbon))" in the titles or keyword lists of papers published between 1 January 1970 and 31 November 2015. As a result, 411,084 papers satisfying these criteria were extracted. 2015. As a result, 411,084 papers satisfying these criteria were extracted.

Result of the Network Model
After constructing a citation network based on direct citations, 379,044 papers belonged to the largest connected component. The features listed in Table 2 were calculated for all papers belonging to the largest connected component. The negative cases were randomly selected from the papers published in the same year in which the citation number increase was within the bottom 50%. Random sampling was conducted eight times. In other words, eight models were constructed in each experiment, and the averages of these values were evaluated. The evaluation results are listed in Table 3.  Table 4 lists the features with high predictive contributions for the model constructed for each year. Table 5 lists the numbers of citations for the top 10 papers published in 2014, three years after 2011. Of the 10 predicted papers, nine papers satisfied the conditions of emerging papers in 2014. In other words, 90% of the 10 papers listed in Table 5 were in the top 5% in terms of citation increases in 2014. These papers were sorted by calculated probabilities the predicted paper will be an emerging paper.

Result of the Network Model
After constructing a citation network based on direct citations, 379,044 papers belonged to the largest connected component. The features listed in Table 2 were calculated for all papers belonging to the largest connected component. The negative cases were randomly selected from the papers published in the same year in which the citation number increase was within the bottom 50%. Random sampling was conducted eight times. In other words, eight models were constructed in each experiment, and the averages of these values were evaluated. The evaluation results are listed in Table 3.  Table 4 lists the features with high predictive contributions for the model constructed for each year. Table 5 lists the numbers of citations for the top 10 papers published in 2014, three years after 2011. Of the 10 predicted papers, nine papers satisfied the conditions of emerging papers in 2014. In other words, 90% of the 10 papers listed in Table 5 were in the top 5% in terms of citation increases in 2014. These papers were sorted by calculated probabilities the predicted paper will be an emerging paper.   Figure 5 shows the results of the aggregation by clustering up the third layer for the top 1000 papers as the emerging score. Note that the first and second layers targeted only the upper three clusters, while the third layer targeted all the clusters. This figure shows that papers with the highest degrees of emergence in the cluster unit are concentrated in subcluster 1-3-3. This report focuses on subcluster 1-3-3, which has a small number of papers.  Figure 5 shows the results of the aggregation by clustering up the third layer for the top 1000 papers as the emerging score. Note that the first and second layers targeted only the upper three clusters, while the third layer targeted all the clusters. This figure shows that papers with the highest degrees of emergence in the cluster unit are concentrated in subcluster 1-3-3. This report focuses on subcluster 1-3-3, which has a small number of papers. The results of the topic model by LDA analysis for the emerging papers in subcluster 1-3-3 are shown in Figure 6. Figure 6 provides an overview of the topic classification in subcluster 1-3-3. The image on the right shows the frequency distribution of the top 20 prominent terms extracted from the abstracts of papers belonging to subcluster 1-3-3. The chart on the left is a visualization of the "1-1-1" "1-1-2" "1-1-3" "1-1-4" "1-2-1" "1-2-2" "1-2-3" "1-3-1" "1-3-2" "1-3-3" "1-3-4" "2-1-1" "2-1-2" "2-1-3" "2-1-4" "2-3-3" "3-1-1" "3-1-2" "3-1-3" "3-2-1" "3-2-2" "3-2-3" "3-2-5" "3-3-1" "3-3-2" "3-3-3" The results of the topic model by LDA analysis for the emerging papers in subcluster 1-3-3 are shown in Figure 6. Figure 6 provides an overview of the topic classification in subcluster 1-3-3. The image on the right shows the frequency distribution of the top 20 prominent terms extracted from the abstracts of papers belonging to subcluster 1-3-3. The chart on the left is a visualization of the principal component analysis obtained by classifying subcluster 1-3-3 into eight topics. In this figure light blue represents the frequencies of terms corresponding to the highlighted topic, and red represents the estimated frequencies of those corresponding to unhighlighted topics. In this figure, some attention is paid to terms with high estimated frequencies such as "electronic," "armchair," "band," "zigzag," "gap," and so on that are conspicuous. These considerations are addressed in greater detail in the discussion section.

Discussion
In this study, the model was applied to the nanocarbon field to predict whether a novel paper would be published three years later. Nine out of the top ten predicted papers published in 2011 were confirmed to be emerging papers by definition. The F1 measure remained stable at around 0.8 throughout the year, and the model was believed to be built with a balance between precision and recall. Table 4 indicates that the feature having the highest contribution in each case is the page rank (CNT PAGER). Page ranking is a proposed method for evaluating the importance of web pages based on citation relationships, although it was used to evaluate scientific papers in this study. This characteristic can be interpreted as an index that increases when a paper with several citations is cited. Simultaneously, this indicator decreases the relative importance of papers with citations between local communities, such as cross-references. In this study, from the perspective of calculation cost, the feature, centrality, etc., are calculated with the quotation-related network as an undirected network. Strictly speaking, it is the sum of the number of citing papers and cited papers. However, the number of citing papers in the year of publication is extremely small and it can be considered that most citations are based on the number of cited papers. The next important feature is the centrality degree (CNT DEGRE). This means that the more papers an article cites, the more likely it is to be

Discussion
In this study, the model was applied to the nanocarbon field to predict whether a novel paper would be published three years later. Nine out of the top ten predicted papers published in 2011 were confirmed to be emerging papers by definition. The F1 measure remained stable at around 0.8 throughout the year, and the model was believed to be built with a balance between precision and recall. Table 4 indicates that the feature having the highest contribution in each case is the page rank (CNT PAGER). Page ranking is a proposed method for evaluating the importance of web pages based on citation relationships, although it was used to evaluate scientific papers in this study. This characteristic can be interpreted as an index that increases when a paper with several citations is cited. Simultaneously, this indicator decreases the relative importance of papers with citations between local communities, such as cross-references. In this study, from the perspective of calculation cost, the feature, centrality, etc., are calculated with the quotation-related network as an undirected network. Strictly speaking, it is the sum of the number of citing papers and cited papers. However, the number of citing papers in the year of publication is extremely small and it can be considered that most citations are based on the number of cited papers. The next important feature is the centrality degree (CNT DEGRE). This means that the more papers an article cites, the more likely it is to be ranked at the top of the list. The centrality degree is a characteristic feature. Based on the fact that these two features are higher, the papers that are to be expected to earn citation counts in the future (i.e., the emerging papers mentioned in this report) are those that have been appropriately researched in the subject field.
The top 10 papers predicted to be emerging will be discussed here. Zhang et al. [60] focused on the mass production of CNTs. Initially, the arc discharge and laser evaporation methods were used for this purpose. The arc discharge method can produce high-quality CNTs with few defects; however, it cannot produce large quantities of them. Although laser evaporation can produce CNTs with relatively high purity, it is also considered to be unsuitable as an industrial manufacturing technique [64]. Against this background, CVD, which is said to be suitable for mass synthesis, has attracted attention. Based on proposals made by Professor Endo of Shinshu University, such as those entitled, "Carbon multiwall nanotubes" and "CoMoCATProcess at SWeNT" by the University of Oklahoma, several manufacturing technologies have been pioneered and are already being employed for practical purposes. Zhang et al. [60] comprehensively introduced and discussed research on not only CVD but also CNT mass production. Zhang et al. received 84 citations in three years, which demonstrates that the paper is drawing attention.
The article [61] ranked second in Table 5 is a review article on the current situation and physical properties of oriented CNTs and their application areas. Among the CNT production technologies, CVD is also expected to provide a high orientation. Diverse application areas, such as light emission, optical antenna, subwavelength light transmission, and PV power generation with nanocoaxial structure, are expected for such aligned CNTs [61]. Although this paper did not satisfy the conditions for the emerging papers in the definition of this study, a certain number of citations was obtained.
The paper ranked third focuses on mass production techniques for the three-dimensional assembly of CNTs and graphene [62]. For studies related to three-dimensional networks of CNTs and graphene, see Dasgupta et al. [70]. Research on porous films with three-dimensional structures is still in the initial stage, and a material that contributes to the practical application from hereon is necessary [70]. Graphene is a single atomic plane of graphite crystal. In 2004, Novoselov et al. succeeded in extracting a thin piece of graphene by peeling off the surface of highly oriented anhydrous graphite with adhesive tape and then further peeling off the peeled surface. Since this report was published, the electrical, electronic, mechanical, and scientific properties of graphene have become clear [71].
In particular, the high electron mobility in graphene has been clarified, where electron mobility is a measure of the speed of electrons in a solid. The paper ranked fourth is a review article that focuses on the high electron mobility of graphene and discusses its electrical properties and applications [63]. A theoretical value of 2,000,000 cm 2 /Vs was predicted [64], and an experimental value of 200,000 cm 2 /Vs was obtained [72]. Considering that the electron mobility in silicon is 1000 cm 2 /Vs, the electron mobility of graphene is more than 100 times than in silicon. High electron mobility is an important factor to achieve high-speed transistors, for example. The paper ranked fourth was confirmed to have received 664 citations in 2014.
The paper ranked fifth is a comprehensive discussion of the physical properties of graphene. Graphene has a high electron mobility, high thermal stability, and excellent strength. In addition, this paper comprehensively describes the graphene-based applications in field-effect transistors, memory, solar devices, and sensing platforms. This paper had 587 citations in 2014.
The sixth article in Table 5 focuses on the methods of structural analysis of nanomaterials. Raman spectroscopy is one of the most effective methods for this purpose. In particular, the Raman spectra of carbon materials shows the G-band peaks derived from graphite structures and the D-band peaks derived from the defects. The ratios of these peaks can be used to evaluate the crystalline purity and defect concentration of nanocarbon materials. This is a review paper focusing on Raman spectroscopy in CNTs and graphene while summarizing related studies.
The paper ranked seventh in Table 5 is a review that comprehensively summarizes the prior literature related to the reaction principle of methane catalytic decomposition, the shape of the resulting nanocarbon material, and the formation principle. It is possible to produce hydrogen and carbon using steam reforming methane and a catalyst in a high-temperature section. The hydrogen produced can be used as fuel for fuel cells and has been attracting attention mainly as a means of producing hydrogen. In contrast, because the generated carbon can also be used in direct carbon fuel cells, it is one of the methods attracting attention from the perspective of nanocarbon material production.
CNTs are said to be toxic to humans because of their structural similarity to asbestos. Hence, toxicity reduction in other nanocarbons is a popular research topic for the use of nanocarbon materials. This paper [67], ranked eighth in Table 5, is an attempt to provide systematic knowledge in this field, called nanotoxicology. The authors also identified specific challenges for achieving low toxicity. The paper discusses techniques that lead to the biological and toxicological transformation of carbon nanomaterials through chemical changes.
The paper by Singh et al. [68] is an exhaustive summary of the history of graphene and its properties, means of production, and impacts on applications in various fields. This includes: electrical devices; optronics devices; scientific sensor nanocomposites; energy storage. As of 2014, there were 506 citations.
The paper ranked tenth is entitled "Carbonaceous nanomaterials for enhancement of TiO 2 photocatalysis" [69]. Titanium oxide (TiO 2 ) is generally used as a photocatalyst material. However, problems have been noted due to its efficiency and narrow response range. The properties can be changed considerably by combining TiO 2 with nanocarbon materials. As a paper on photocatalysis using nanocarbon-TiO 2 , this paper presents guidelines on generation methods, features, and future directions. As of 2014, the paper had 232 citations. Figure 6 shows that "electronic," "band," "zigzag," "gap," "armchair," etc., stand out when focusing on terms with high estimated frequencies. It is known that the structure of a single-wall CNT (SWCNT) has varying conductivity that depends on the degree of the helix (i.e., chirality). For example, a zigzag-type structure has the characteristics of being one-third metal and two-thirds semiconductor. Meanwhile, a chiral-type structure has the characteristics of a semiconductor and an armchair-type structure has the characteristics of a metal. In 2010, they had problems synthesizing chemicals; however, in October 2011, they succeeded in synthesizing chiral and armchair forms. The remaining zigzag CNTs were also presented in a paper by Hitosugi et al. [73] in the Journal of the American Chemical Society, entitled "Bottom-up synthesis and thread-in-bead structures of finite (n,0)-zigzag SWCNTs." Thus, around 2011, chemical syntheses of chiral, armchair, and zigzag single-phase CNTs were increasing. The terms "armchair," "zigzag," "gap," and "band," in Figure 6 exhibit the expected tendencies. Hitosugi et al. [74] also published a paper in 2011 related to Hitosugi et al. [73]. In fact, the number of citations reached 44 after three years, and the paper satisfies the definition of an emerging paper in this model. However, the emerging score was ranked at 11,932. This means that the article could not be identified only by the emerging prediction model focusing on the number of papers. Accordingly, it can be considered effective to a certain extent to specify research fields that will become popular in the future based on the granularity of emerging research in units of not only papers but also terms.
The validity of the proposed method was tested in the field of nanocarbons in this study. We found that papers falling into the emerging research areas obtained by a combination of network analysis and topic models were not necessarily at the top of the predictive rankings obtained by network analysis alone. In other words, the paper-by-paper method for predicting emerging research was inadequate to capture the trends in quantitative fields. In this study, the dataset on nanocarbons was extracted from Web of Science (WoS) as a case study, but the dataset on other fields can be applied as it is. An important aspect of this is the need to confirm the accuracy of the predictions of emerging papers obtained from the citation network analysis. In particular, it is important to ensure that the accuracy is stable even if we change the time window. As a preliminary experiment, we confirmed that the F1 measure was more than 70% accurate for several different regions, although the accuracy varied. Therefore, we demonstrated that the validity of our method is not specific to the nanocarbon field. The applicability of the method to various fields, which identify issues of interest in terms of quality and quantity, both in terms of papers and topics, may help companies and countries that are sensitive to science and technology trends make decisions. For example, this method may be useful for companies to consider the future direction of their areas of strength. By analyzing several related areas (e.g., any subdiscipline related to the material area), the country can obtain papers and topics that will contribute to the development of national innovation policies.

Conclusions
This study applied a model that predicts promising papers based on the vast amount of information on 411,084 scientific papers in the nanocarbon field. The purpose of this research was to predict the increase in the number of citations of a paper three years after its publication, based only on information available less than one year after its publication, in order to identify emerging areas earlier.
Unlike the existing research, this investigation involved the use of various features, network indicators, and clustering results to predict the increase in the number of citations of a paper several years in advance based on the features immediately after its publication. The features used in the prediction model mainly fall into four categories (network, cluster, centrality, and citation relationship features), and all of them can be constructed by observing a network. This investigation attempted to identify emerging research areas based on not only the micro (i.e., papers) but also the semimacro perspectives (i.e., research fields). This was achieved by employing the topic model while focusing on the terms used in the papers in the cluster with a high percentage of emerging papers, after identifying the emerging papers by using the aforementioned network indices. The predictive model of emerging papers itself achieved a certain level of accuracy in both the nanocarbon field and the PV power generation field, and a highly useful model was developed. The feature with the highest degree of contributions was the page rank. This means that the number of citations of a paper is likely to increase if it is cited in a paper that has a large number of citations. In addition, the contribution of the proximity centrality means that the papers are close to many papers; hence, they are the focal papers in the field. These findings demonstrate that emerging papers are those that have been thoroughly researched in the field and address issues that are evaluated by the community. The capabilities of the authors can be considered to be among the indices to quantify. By examining the characteristic terms of subclusters with high proportions of emerging papers expected, it was possible to focus on research on the chemical synthesis of zigzag SWCNTs in the nanocarbon field. The emerging fields were successfully examined, not in units of papers, but rather as research areas.
The limitations of this study along with future research need to be addressed. This study defined emerging papers as papers that have been cited the most-within the top 5%-three years after publication. However, the interpretation of citation counts depends on the field and training period. This can be rephrased depending on the process of formation of knowledge in scientific fields. Therefore, the robustness of the model against variations in these parameters is assumed to vary from field to field. Similarly, challenges remain regarding robustness in databases. In this study, the SCI and SSCI indexes in Web of Science (WoS) were used as the database. Until the creation of Scopus and Google Scholar in 2004, WoS had been the sole tool for citation analysis [75]. Even today, WoS is still one of the most effective databases in the historical field, as it is known to have a longer recording period than Scopus. However, both WoS and Scopus are now known as leading databases, and the robustness of the method remains to be evaluated in a future study.
Only the top 5% of papers is considered as a sprout and the number of positive examples is small; this will have an impact on the limit of prediction performance because there are few patterns to train. In the future, the application of this method to multiple fields is being examined and it is necessary to discuss robustness against parameters and appropriate parameter settings. It is necessary to devise a unique interpretation of a subcluster in which a group of papers expected to be sprouting papers is concentrated. From the relevant terms extracted, certain domain knowledge is essential to imagine what type of field it is. It is necessary to devise and enable a semantic interpretation using multiple terms. In the future, more sophisticated and stable models will be developed that can contribute to policy formulation and future trends in multiple fields.
As the amount of information increases and the structure of knowledge becomes more complex in the future, it will become extremely difficult for companies to make R&D investment decisions and for the government to make decisions regarding resource allocation for science and technology policy. The outlook for trends in science and technology should be developed independently. The role of predictive models such as those investigated in this study can facilitate decision-making. It is considered that the methods supporting the extraction of future useful papers based on enormous amounts of information will increase in the future.