# Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

#### 2.1. Research front Detection and Topic Evolution

#### 2.2. Related Algorithms Used in this Study

## 3. The Proposed Research Front Detection and Topic Evolution Method

#### 3.1. Notations

A/B | scientific documents A or B in case study |

p/q | clusters p or q in this study |

|p/q| | number of scientific documents in the clusters p or q |

N_{A} | number of scientific documents in the cluster which contains document A |

t | length of time window |

d | damping factor introduced in the PageRank algorithm, which is set as 0.85 in this study |

C_{cite}(A/B) | collection of scientific documents that cite documents A or B |

C_{cited}(A/B) | collection of scientific documents that are cited by documents A or B |

P(A/B) | rank value of documents A or B in the cluster |

N_{in}(A/B) | in-degree of documents A or B, which equals the number of scientific documents that cite documents A or B |

N_{out}(A/B) | out-degree of documents A or B, which equals the number of scientific documents that are cited by documents A or B |

N_{cluster}(B) | number of scientific documents that are cited by document B and belong to the same cluster with document B |

N_{ci}(p,q) | number of citations between clusters q and p |

H(var) | function that returns the value of variable var if var is not equal to zero, otherwise it returns positive infinity. |

H_{in}(A,B) | function that returns N_{in}(A) if document B cites document A, returns N_{in}(B) if document A cites document B, and returns positive infinity if documents A and B have no direct citation relationship |

H_{out}(A,B) | function that returns N_{out}(A) if document A cites document B, returns N_{out}(B) if document B cites document A, and returns positive infinity if documents A and B have no direct citation relationship |

S_{co}(A,B) | similarity between documents A and B based on relative co-citation [25] |

S_{bi}(A,B) | similarity between documents A and B based on relative bibliographic coupling [25] |

S(A,B) | similarity between documents A and B based on the traditional approach that combines relative co-citation and bibliographic coupling [10] |

${S}_{co}^{\prime}(A,B)$ | similarity between documents A and B based on extended co-citation |

${S}_{bi}^{\prime}(A,B)$ | similarity between documents A and B based on extended bibliographic coupling |

${S}^{\prime}(A,B)$ | similarity between documents A and B based on our proposed approach |

S_{cluster}(p,q) | similarity between clusters p and q |

F(p,x) | enhanced frequency of keyword x in the cluster p, which is based on our proposed approach |

δ(A,x) | binary parameter, with 1 representing that document A contains keyword x, and 0 otherwise |

#### 3.2. Scientific Document Clustering

_{3}), and the number of documents which cite either document A or document B is five (i.e., documents Y

_{1}, Y

_{2}, Y

_{3}, Y

_{4}, and A). According to the definition of relative bibliographic coupling (Equation (2)), the similarity between documents A and B is 1/5, because the number of documents which are cited by both documents A and B is one (i.e., document X

_{3}), and the number of documents which are cited by either document A or document B is five (i.e., documents X

_{1}, X

_{2}, X

_{3}, X

_{4}, and B). According to the definition of the approach that combines relative co-citation and bibliographic coupling (Equation (5)), the similarity between documents A and B is 2/10, because the number of documents that cite or are cited by both documents A and B is two (i.e., documents Y

_{3}and X

_{3}), and the number of documents that cite or are cited by either document A or document B is 10 (i.e., documents Y

_{1}, Y

_{2}, Y

_{3}, Y

_{4}, X

_{1}, X

_{2}, X

_{3}, X

_{4}, A, and B).

#### 3.3. Clustering Theme Detection

#### 3.4. Research front Detection and Topic Evolution

## 4. Case Study and Experiments

#### 4.1. Dataset

#### 4.2. Data Preprocessing

#### 4.3. Experimental Design and Evaluation Index

_{A}represents the silhouette value of document A; ${M}_{A}^{1}$ represents the mean similarity between document A and the scientific documents in the same cluster; and ${M}_{A}^{2}$ represents the mean similarity between document A and the scientific documents in the cluster which is most similar to the cluster that contains document A. Moreover, ${M}_{A}^{2}$ is calculated according to Equation (12).

_{p}represents the number of clusters in the time window which contains document A, and H

_{s}(A,B) represents the similarity function between documents A and B, which returns ${S}^{\prime}(A,B)$ if documents are clustered based on our proposed approach, and returns S(A,B) if documents are clustered based on the traditional approach [10]. In Equation (12), documents A and B belong to different clusters.

#### 4.4. Experiment Results

#### 4.4.1. Scientific Document Clustering

#### 4.4.2. Clustering Theme Detection

#### 4.4.3. Research front Detection and Topic Evolution

## 5. Conclusions and Future Work

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

Time Window | Cluster (Size) | Clustering Theme |
---|---|---|

1993–1997 | Cluster 1 (35) | Neural network; uncertainty; prediction |

Cluster 2 (31) | Association rule; knowledge discovery; clustering | |

Cluster 3 (19) | Knowledge discovery; machine learning; rule | |

1998–2002 | Cluster 1 (325) | Protein; identification; neural network |

Cluster 2 (207) | Neural network; knowledge discovery; decision tree | |

Cluster 3 (163) | Neural network; machine learning; genetic algorithm | |

Cluster 4 (151) | Knowledge discovery; rough set; machine learning | |

Cluster 5 (146) | Association rule; knowledge discovery; pattern | |

Cluster 6 (136) | Decision tree; machine learning; knowledge discovery | |

2003–2007 | Cluster 1 (1597) | Clustering analysis; bioinformatics; gene expression |

Cluster 2 (747) | Decision tree; machine learning; neural network | |

Cluster 3 (373) | Association rule; sequential pattern; knowledge discovery | |

Cluster 4 (344) | Association rule; knowledge discovery; frequent itemset | |

Cluster 5 (303) | Rough set; feature selection; genetic algorithm | |

Cluster 6 (224) | Sequential pattern; association rule; knowledge discovery | |

Cluster 7 (115) | Knowledge discovery; prediction; neural network | |

2008–2012 | Cluster 1 (1830) | Clustering analysis; identification; bioinformatics |

Cluster 2 (1617) | Support vector machine; decision tree; prediction | |

Cluster 3 (566) | Association rule; pattern; knowledge discovery | |

Cluster 4 (232) | Privacy; security; k-anonymity | |

Cluster 5 (200) | Sequential pattern; association rule; knowledge discovery | |

2013–2017 | Cluster 1 (2117) | Clustering analysis; social network; big data |

Cluster 2 (1750) | Support vector machine; prediction; neural network | |

Cluster 3 (1544) | Identification; gene expression; bioinformatics | |

Cluster 4 (930) | Association rule; sequential pattern; knowledge discovery | |

Cluster 5 (517) | Machine learning; prediction; decision tree | |

Cluster 6 (460) | Prediction; educational data mining; design | |

Cluster 7 (249) | Rough set; attribute reduction; approximation | |

Cluster 8 (189) | Differential privacy; k-anonymity; big data |

## References

- Chen, C. Citespace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J. Am. Soc. Inf. Sci. Technol.
**2006**, 57, 359–377. [Google Scholar] [CrossRef] - Wu, Y.; Jin, X.; Xue, Y.Z. Evaluation of research topic evolution in psychiatry using co-word analysis. Medicine
**2017**, 96, e7349. [Google Scholar] [CrossRef] [PubMed] - Liu, X.; Jiang, T.; Ma, F. Collective dynamics in knowledge networks: Emerging trends analysis. J. Informetrics
**2013**, 7, 425–438. [Google Scholar] [CrossRef] - Fujita, K.; Kajikawa, Y.; Mori, J.; Sakata, I. Detecting research fronts using different types of weighted citation networks. J. Eng. Technol. Manag.
**2014**, 32, 129–146. [Google Scholar] [CrossRef] - Chen, B.; Tsutsui, S.; Ding, Y.; Ma, F. Understanding the topic evolution in a scientific domain: An exploratory study for the field of information retrieval. J. Informetr.
**2017**, 11, 1175–1189. [Google Scholar] [CrossRef] - Boyack, K.W.; Klavans, R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? J. Assoc. Inf. Sci. Technol.
**2010**, 61, 2389–2404. [Google Scholar] [CrossRef] - Glänzel, W.; Thijs, B. Using ‘core documents’ for detecting and labelling new emerging topics. Scientometrics
**2012**, 91, 399–416. [Google Scholar] [CrossRef] - Yu, D.J.; Wang, W.R.; Zhang, S.; Zhang, W.Y.; Liu, R.Y. Hybrid self-optimized clustering model based on citation links and textual features to detect research topics. PLoS ONE
**2017**, 12, e0187164. [Google Scholar] [CrossRef] [PubMed] - Zhang, W.; Wang, X.G.; Zhao, D.L.; Tang, X.O. Graph degree linkage: Agglomerative clustering on a directed graph. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 428–441. [Google Scholar]
- Bichteler, J.; Iii, E.A.E. The combined use of bibliographic coupling and cocitation for document retrieval. J. Am. Soc. Inf. Sci.
**1980**, 31, 278–282. [Google Scholar] [CrossRef] - Shubankar, K.; Singh, A.P.; Pudi, V. A frequent keyword-set based algorithm for topic modeling and clustering of research papers. In Proceedings of the 3rd Conference on Data Mining and Optimization, Putrajaya, Malaysia, 28–29 June 2011; pp. 96–102. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Kim, J.; Lee, E. Understanding review expertise of developers: A reviewer recommendation approach based on latent dirichlet allocation. Symmetry Basel
**2018**, 10, 114. [Google Scholar] [CrossRef] - Kim, M.; Gupta, B.B.; Rho, S. Crowdsourcing based scientific issue tracking with topic analysis. Appl. Soft Comput.
**2018**, 66, 506–511. [Google Scholar] [CrossRef] - Qiao, S.; Han, A. A way to construct evolution model of scientific papers based on the seed document and OLDA models. In Proceedings of the 2013 International Conference on Mechatronic Science, Electric Engineering and Computer, Shenyang, China, 20–22 December 2013; pp. 900–903. [Google Scholar]
- Morris, S.A.; Yen, G.; Wu, Z.; Asnake, B. Time line visualization of research fronts. J. Am. Soc. Inf. Sci. Technol.
**2003**, 54, 413–422. [Google Scholar] [CrossRef] - Clauset, A.; Newman, M.E.; Moore, C. Finding community structure in very large networks. Phys. Rev. E
**2004**, 70, 066111. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Brin, S.; Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst.
**1998**, 30, 107–117. [Google Scholar] [CrossRef] - Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA
**2002**, 99, 7821–7826. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Newman, M.E.J. Fast algorithm for detecting community structure in networks. Phys. Rev. E
**2004**, 69, 066133. [Google Scholar] [CrossRef] [PubMed] [Green Version] - dos Santos, C.K.; Evsukoff, A.G.; de Lima, B.S.L.P. Cluster analysis in document networks. In Proceedings of the Conference on Data Mining Protection, Univ Cadiz, Cadiz, Spain, 26–28 May 2008; pp. 95–104. [Google Scholar]
- Chen, P.; Xie, H.; Maslov, S.; Redner, S. Finding scientific gems with google’s PageRank algorithm. J. Informetr.
**2007**, 1, 8–15. [Google Scholar] [CrossRef] - Nykl, M.; Campr, M.; Jezek, K. Author ranking based on personalized PageRank. J. Informetr.
**2015**, 9, 777–799. [Google Scholar] [CrossRef] - Yu, D.J.; Wang, W.R.; Zhang, S.; Zhang, W.Y.; Liu, R.Y. A multiple-link, mutually reinforced journal-ranking model to measure the prestige of journals. Scientometrics
**2017**, 111, 521–542. [Google Scholar] [CrossRef] - Egghe, L.; Rousseau, R. Co-citation, bibliographic coupling and a characterization of lattice citation networks. Scientometrics
**2002**, 55, 349–361. [Google Scholar] [CrossRef] - Boyack, K.W.; Newman, D.; Duhon, R.J.; Klavans, R.; Patek, M.; Biberstine, J.R. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE
**2011**, 6, e18029. [Google Scholar] [CrossRef] [PubMed] - Dehdarirad, T.; Villarroya, A.; Barrios, M. Research trends in gender differences in higher education and science: A co-word analysis. Scientometrics
**2014**, 101, 273–290. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef] [Green Version] - Janssens, F.; Glänzel, W.; Moor, B.D. A hybrid mapping of information science. Scientometrics
**2008**, 75, 607–631. [Google Scholar] [CrossRef] - Bafna, P.; Pramod, D.; Vaidya, A. Document clustering: TF-IDF approach. In Proceedings of the International Conference on Electrical, Electronics, and Optimization Techniques, Palnchur, India, 3–5 March 2016; pp. 61–66. [Google Scholar]

**Figure 3.**Mean silhouette value of document clustering based on our proposed approach and the traditional approach, with different time window lengths t.

**Figure 4.**Relationship between the number of documents (cumulative percentage of documents) and the published time interval between the documents and their corresponding references.

**Figure 5.**Evolution of the topics in the second largest cluster with a time window from 2013 to 2017.

Example | Relative Co-Citation | Relative Bibliographic Coupling | Extended Co-Citation | Extended Bibliographic Coupling | Traditional Approach | Our Proposed Approach | |
---|---|---|---|---|---|---|---|

(1) | 1/3 | 1/3 | 1/2 | 1/2 | 1/3 | 1/2 | |

(2) | 1/4 | 1/4 | 7/12 | 7/12 | 1/4 | 7/12 | |

(3) | 1/4 | 1/4 | 5/12 | 5/12 | 1/4 | 5/12 | |

(4) | 1/5 | 1/5 | 1/2 | 1/2 | 1/5 | 1/2 |

**Notes:**The circle represents a scientific document and the arrow represents citation direction. For example, document Y

_{1}cites document A in the first example.

Cluster (Size) | Clustering Theme |
---|---|

Cluster 1 (2117) | Clustering analysis; social network; big data |

Cluster 2 (1750) | Support vector machine; prediction; neural network |

Cluster 3 (1544) | Identification; Gene expression; Bioinformatics |

Cluster 4 (930) | Association rule; sequential pattern; knowledge discovery |

Cluster 5 (517) | Machine learning; prediction; decision tree |

Cluster 6 (460) | Prediction; educational data mining; design |

Cluster 7 (249) | Rough set; attribute reduction; approximation |

Cluster 8 (189) | Differential privacy; k-anonymity; big data |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xu, Y.; Zhang, S.; Zhang, W.; Yang, S.; Shen, Y.
Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm. *Symmetry* **2019**, *11*, 310.
https://doi.org/10.3390/sym11030310

**AMA Style**

Xu Y, Zhang S, Zhang W, Yang S, Shen Y.
Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm. *Symmetry*. 2019; 11(3):310.
https://doi.org/10.3390/sym11030310

**Chicago/Turabian Style**

Xu, Yangbing, Shuai Zhang, Wenyu Zhang, Shuiqing Yang, and Yue Shen.
2019. "Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm" *Symmetry* 11, no. 3: 310.
https://doi.org/10.3390/sym11030310