Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature : Structural Characteristics and Evolution

The subjects of literature are the direct expression of the author’s research results. Mining valuable knowledge helps to save time for the readers to understand the content and direction of the literature quickly. Therefore, the co-occurrence network of high-frequency words in the bioinformatics literature and its structural characteristics and evolution were analysed in this paper. First, 242,891 articles from 47 top bioinformatics periodicals were chosen as the object of the study. Second, the co-occurrence relationship among high-frequency words of these articles was analysed by word segmentation and high-frequency word selection. Then, a co-occurrence network of high-frequency words in bioinformatics literature was built. Finally, the conclusions were drawn by analysing its structural characteristics and evolution. The results showed that the co-occurrence network of high-frequency words in the bioinformatics literature was a small-world network with scale-free distribution, rich-club phenomenon and disassortative matching characteristics. At the same time, the high-frequency words used by authors changed little in 2–3 years but varied greatly in four years because of the influence of the state-of-the-art technology.


Introduction
Biotechnology science has led to biological information research in twenty-first century.Biological information improves the level of biological intelligence with the help of information technology.A large number of related articles have been published in this field as it is one of the most concerned research areas.Biological research results and findings are recorded in various forms of literature [1].More than 1 million articles are included in PubMed each year since 2011, which can be accessed by searching on the website (https://www.ncbi.nlm.nih.gov/pubmed).Consequently, how to quickly collect the literature and use knowledge discovery and data mining methods to identify future research hotspots has become an urgent issue in scientific investigations.For example, many countries carry a high cancer burden, and comprehensive cancer nursing has become increasingly complicated and difficult [1].Segregating the vast number of existing articles will help to identify the cause and effect of cancer and achieve the goal of preventing cancer.
The theory of complex networks has been widely applied in many fields [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].First, a social network is its most classical application.For instance, the relationship among people or networks by on-line detection can be found [2][3][4][5], the impact of disaster on social networks [6], the trade-off between social utility and economic performance [7], the intention to migrate [8,9], and unifying aging and frailty [10].Second, it is also used in other networks [11][12][13], for example, traffic networks [11,12] and metro networks [13].Co-word analysis is a direction for applying complex networks in text mining [14].Many articles have reported on the application of co-word analysis to the biological literature.For example, a text and co-word matrix composed of 40 high-frequency words and 2945 articles was constructed and the proposed cancer immunotherapy made use of the immune system for treating cancer [14].At the same time, complex networks, such as knowledge network, social networks, shipping network, traffic network, have been widely used in daily life, analysing the structural characteristics of the networks, and providing inspiration for work and life.Zarandi [15] proposed a novel Community Detection Algorithm based on Structural Similarity executed in two consecutive phases.˙Ilhan [16] proposed a novel framework that examined various structural features of the network and detected the most prominent subset of community features to predict the future direction of community evolution.Zhao [17] featured several significant findings.The evolution of the knowledge flow network of a strategic alliance could produce a bifurcation phenomenon composed of saddle-node bifurcation and trans-critical bifurcation.Knowledge-embedded resource allocation was the most effective in improving the knowledge flow rate of networks and could further supply ample impetus for evolution.The aforementioned findings were beneficial for understanding the key problems of each resource allocation model and the evolution of strategic alliance in knowledge flow networks.Li [18] investigated that the evolution of the network structure in TFP-glass with increasing temperature to explore the plausible mechanism.The dissociation of the network structure in TFP-glass, which could be observed in all TFP-glass samples synthesized under different conditions, was believed to be the reason for the temperature-dependency of the interfacial interaction.The mechanism underlying the dissociation was thoroughly investigated using two-dimensional infrared spectroscopy, dynamic rheological analysis and XRD.Yang [19] mainly explored the discussion network and its structural evolution based on an empirical study of a famous online discussion that happened in China in 2008 and found that the scale growth of the network had an S shape, the degree distribution represented the power law in the first halfway, and the network showed a degree of disassortative characteristic.Wang [20] referred to the analysis ideas and methods of complex networks, used the standardized Laplace matrix and K-means clustering method to divide the gene regulatory network into multiple communities, and demonstrated the gene interactions within each community and among communities.
The map of scientific knowledge is one of the hottest research methods in the field of international scientific metrology.It combines the citation analysis and visualization technology in scientific metrology to realize the effective use of information and further generate new knowledge.The relationship of co-occurrence is one of the most important aspects in the map of scientific knowledge.It has already been widely used in text mining [21], social network analysis [22], environmental analysis [23], and so forth.It is also applied in the field of biology to solve all kinds of related problems.For example, Kamneva [24] predicted co-occurrence between reference genomes from two 16S-based ecological datasets.Wang, et al. [25,26] analysed the Protein Domain Co-occurrence Network for predicting protein and domain functions.Li, et al. [27] showed that the signatures of ARG and MRG co-occurrence were much more frequent and the co-occurrence structures in the habitat divisions were significantly different, which could be attributed to their distinct gene transfer potentials.However, a few studies focused on the hot or high-frequency words in the biological literature.
In this study, the co-occurrence word method (one of the most popular methods for the scientific knowledge map) was mainly adopted to analyse 242,891 articles from 47 top bioinformatics journals during 2013-2018.The hot words of the related subjects and contents in the field of biological information could be accurately expressed and identified, the relationship among the subjects could be analysed, and then the co-occurrence network of high-frequency words could be constructed using the words and terms appearing in the same-subject articles.Subsequently, the structural characteristics and evolution rules of the network were analysed, and the hot research trends in the field of bioinformatics were concluded.Python was used to accomplish the experimental process and Gephi (an open-source and free software on https://gephi.org/)toachieve the visualization of results.

Datasets
Journals on bioinformatics are so vast in number that analysing all of them was difficult.Therefore, only those journals that included Mathematical and Computational Biology, Biochemistry and Molecular Biology, Biotechnology and Applied Microbiology, and Multidisciplinary Science were chosen in this study from PubMed and Letpub.Finally, 47 journals were selected, and their names, IF, rank, areas and publishers are given in Table 1.A total of 242,891 articles were chosen from these journals during the last five years (2013-2018) to identify the recent research hotspots in the bioinformatics literature.A co-occurrence network of bioinformatics high-frequency words based on these 242,891 articles was constructed, and its process is given in Figure 1.
Appl.Sci.2018, 8, x FOR PEER REVIEW 5 of 14 bioinformatics high-frequency words based on these 242,891 articles was constructed, and its process is given in Figure 1.

Word Segmentation and High-Frequency Words
An article usually comprises several thousand words (as in this study), and the efficiency of analysing words of all 242,891 articles was quite slow.Therefore, the subjects, including the title, summary, and key words of the article, were analysed instead of the full text of the article.Then, subjects were split into words by space and the sequence of words was recorded to express the article.Next, the function words, such as articles, prepositions, conjunctions and other words without practical significance were removed.The pseudocode of this section is shown in Figures A1-A3 in Appendix A. ci presents the ith word and ni is its frequency.Then, the probability of ci appearing in the subjects of the whole N articles was calculated using pi = ni/N.Next, words were sorted by frequency (or probability) in a descending order, which ensured ni ≥ nj for i < j (equivalent to pi ≥ pj).The number of high-frequency word K was set, which meant that the top K words were K high-frequency words.

High-Frequency Word Co-Occurrence Matrix and Co-Occurrence Network
Let eij be the number of any two high-frequency words ci and cj in the same subject among all N article subjects.The co-occurrence relationship between any two high-frequency words was expressed by mutual information in information theory (pseudocode shown in Figure A4), describing the degree of association between these two words.The following formula was used for calculation (1).I i,j = log 2 P i,j P i P j (1) where P i,j represents the probability of co-occurrence of ci and cj, P i indicates the probability of occurrence of ci, and P j indicates the probability of occurrence of cj.The larger the value of I i,j , the greater the co-occurrence degree of ci and cj.The matrix (I i,j )K × K is a high-frequency co-occurrence matrix (considering the symmetric relation, I i,j = I j,i , and the matrix (I i,j )K × K can also be expressed as a triangular or lower triangular matrix).
The main reason for choosing mutual information instead of selecting the number of frequent words could be analysed by the following process, assuming 10,000 articles, and the frequency of high-frequency words c1 and c2 as n1 = 8000 (p1 = 0.8), n2 = 7000 (p2 = 0.7), respectively.The number of co-occurrence of c1 and c2 was 5000 (p1,2 = 0.5), and mutual information was I1,2 = −0.36.The frequency of high-frequency words c3 and c4 was n3 = 5000 (p3 = 0.5), n4 = 5000 (p4 = 0.5), respectively, the number of co-occurrence of c3 and c4 was 4500 (p3,4 = 0.45), and the mutual information is I3,4 = 0.85.Although the co-occurrence of c3 and c4 was relatively small, c3 and c4 almost always appeared at the same time.Therefore, they were considered to be in a co-occurrence relationship.Additionally, the association rules [28] were algo used to obtain the co-occurrence relationship among high-frequency words.

Average Path Length
The average path length of a network is the average value of the shortest path length between any two nodes in the network, which was calculated using Equation (2):

Word Segmentation and High-Frequency Words
An article usually comprises several thousand words (as in this study), and the efficiency of analysing words of all 242,891 articles was quite slow.Therefore, the subjects, including the title, summary, and key words of the article, were analysed instead of the full text of the article.Then, subjects were split into words by space and the sequence of words was recorded to express the article.Next, the function words, such as articles, prepositions, conjunctions and other words without practical significance were removed.The pseudocode of this section is shown in Figures A1-A3 in Appendix A. c i presents the ith word and n i is its frequency.Then, the probability of c i appearing in the subjects of the whole N articles was calculated using p i = n i /N.Next, words were sorted by frequency (or probability) in a descending order, which ensured n i ≥ n j for ∀i < j (equivalent to p i ≥ p j ).The number of high-frequency word K was set, which meant that the top K words were K high-frequency words.

High-Frequency Word Co-Occurrence Matrix and Co-Occurrence Network
Let e ij be the number of any two high-frequency words c i and c j in the same subject among all N article subjects.The co-occurrence relationship between any two high-frequency words was expressed by mutual information in information theory (pseudocode shown in Figure A4), describing the degree of association between these two words.The following formula was used for calculation (1).
where P i,j represents the probability of co-occurrence of c i and c j , P i indicates the probability of occurrence of c i , and P j indicates the probability of occurrence of c j .The larger the value of I i,j , the greater the co-occurrence degree of c i and c j .The matrix (I i,j )K × K is a high-frequency co-occurrence matrix (considering the symmetric relation, I i,j = I j,i , and the matrix (I i,j )K × K can also be expressed as a triangular or lower triangular matrix).
The main reason for choosing mutual information instead of selecting the number of frequent words could be analysed by the following process, assuming 10,000 articles, and the frequency of high-frequency words c 1 and c 2 as n 1 = 8000 (p 1 = 0.8), n 2 = 7000 (p 2 = 0.7), respectively.The number of co-occurrence of c 1 and c 2 was 5000 (p 1,2 = 0.5), and mutual information was I 1,2 = −0.36.The frequency of high-frequency words c 3 and c 4 was n 3 = 5000 (p 3 = 0.5), n 4 = 5000 (p 4 = 0.5), respectively, the number of co-occurrence of c 3 and c 4 was 4500 (p 3,4 = 0.45), and the mutual information is I 3,4 = 0.85.Although the co-occurrence of c 3 and c 4 was relatively small, c 3 and c 4 almost always appeared at the same time.Therefore, they were considered to be in a co-occurrence relationship.Additionally, the association rules [28] were algo used to obtain the co-occurrence relationship among high-frequency words.

Average Path Length
The average path length of a network is the average value of the shortest path length between any two nodes in the network, which was calculated using Equation (2): where d ij is the number of edges between high-frequency word nodes i and j.The clustering coefficient of the network is the average of clustering coefficient of all nodes in the network defined as follows: where k i is the degree of node i, and N i is the number of edges among k i neighbour nodes.

Rich-Club Coefficient
The rich-club coefficient is defined as follows: where N >k (N >k − 1 /2 represents the maximum possible number of edges among nodes with degree more than k.

Neighbour Average Degree
The neighbour average degree of node i was calculated using Equation (5): where N i is the set of neighbours of node i.The average degree of neighbours of these nodes with same degree k was statistically averaged, which was the number of nodes in the network with moderate k.
If the value of k nn (k) increased with the increase of k, high-connectivity nodes were easy to connect with other high-connectivity nodes and the network was assortative network.Vice versa, if the value of k nn (k) increased with the decrease of k, the network performance was the disassortative network.

Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature
According to the co-occurrence matrix and the arrangement of mutual information in a descending order, the threshold of the co-occurrence of high-frequency words was the value of E, and the top E co-occurrence of high-frequency words was considered as the number of edges.The high-frequency words related to these E edges were nodes, and the network formed by these high-frequency words and edges was the co-occurrence network of high-frequency words.In this study, without loss of generality, K and E were chosen to be 500 and 200, respectively.The reasons were as follows: (1) If the value of K was increased and the value of E (e.g., K = 1000, E = 200) was fixed, the co-occurrence network remained unchanged; (2) If the value of E was increased and the value of K (e.g., K = 500, E = 500) was fixed, the newly added nodes had little influence on the structural characteristics of the network; (3) If the values of both K and E (e.g., K = 1000, E = 500) were increased, the nodes and their edges were too many to clearly display in the network (shown in Figure A5).Therefore, the 500 high-frequency words with the most frequent occurrences were chosen.The maximum 200 mutual information among these 500 high-frequency words were 200 edges, and the high-frequency words related to these 200 edges were nodes.Finally, the co-occurrence network of high-frequency words was obtained, as shown in Figure 2. In Figure 2, the size of the node indicates the value of the node degree, which means the number of neighbor nodes connected to this node directly; only 50 nodes were from the top 200 edges among the top 500 high-frequency words.
The topological structure of the co-occurrence network of the high-frequency words in bioinformatics articles is shown in Figure 2. In this network, a node represents a high-frequency word in all bioinformatics articles.An edge represents the co-occurrence relationship between two high-frequency words appearing in the subject of the same article simultaneously.The node degree is one of the key indicators to measure the node's importance in the network.Nodes with a large degree are often considered as high-connectivity nodes or hub nodes.
Generally, N is assumed to be the number of nodes in the co-occurrence network of high-frequency words, and the co-occurrence relationship of the high-frequency words as a binary adjacency matrix A(N,N).If a co-occurrence relationship exists between two high-frequency words i and j, the value of element aij is 1, otherwise its value is 0. A(N,N) is a symmetric matrix and can be used to calculate the structural characteristics of the network, such as the shortest path, network density, degree distribution, clustering coefficient, community structure, rich club, matching form, and so forth.

Small-World Network Characteristics
Many real-world networks exhibit the structural characteristics of small-world network.Compared with the same-scale random network, it has a similar average path length and higher clustering coefficient [29].According to the aforementioned, the number N of nodes in the co-occurrence network of high-frequency words in bioinformatics literature in Figure 2 was 50.The average path length of the network was 1.9 and the clustering coefficient was 0.363.Compared with the corresponding random networks, the co-occurrence network of the high-frequency words of the bioinformatics literature had the same level of average path length and higher level of clustering coefficient, implying a clear small-world phenomenon.The results showed that any two high-frequency words of bioinformatics literature were connected at most by another high-frequency word.More than half of high-frequency words had a direct co-occurrence relationship with each other, indicating a clear co-occurrence relationship among high-frequency words of the bioinformatics literature.In Figure 2, the size of the node indicates the value of the node degree, which means the number of neighbor nodes connected to this node directly; only 50 nodes were from the top 200 edges among the top 500 high-frequency words.
The topological structure of the co-occurrence network of the high-frequency words in bioinformatics articles is shown in Figure 2. In this network, a node represents a high-frequency word in all bioinformatics articles.An edge represents the co-occurrence relationship between two high-frequency words appearing in the subject of the same article simultaneously.The node degree is one of the key indicators to measure the node's importance in the network.Nodes with a large degree are often considered as high-connectivity nodes or hub nodes.
Generally, N is assumed to be the number of nodes in the co-occurrence network of high-frequency words, and the co-occurrence relationship of the high-frequency words as a binary adjacency matrix A(N,N).If a co-occurrence relationship exists between two high-frequency words i and j, the value of element a ij is 1, otherwise its value is 0. A(N,N) is a symmetric matrix and can be used to calculate the structural characteristics of the network, such as the shortest path, network density, degree distribution, clustering coefficient, community structure, rich club, matching form, and so forth.

Small-World Network Characteristics
Many real-world networks exhibit the structural characteristics of small-world network.Compared with the same-scale random network, it has a similar average path length and higher clustering coefficient [29].According to the aforementioned, the number N of nodes in the co-occurrence network of high-frequency words in bioinformatics literature in Figure 2 was 50.The average path length of the network was 1.9 and the clustering coefficient was 0.363.Compared with the corresponding random networks, the co-occurrence network of the high-frequency words of the bioinformatics literature had the same level of average path length and higher level of clustering coefficient, implying a clear small-world phenomenon.The results showed that any two high-frequency words of bioinformatics literature were connected at most by another high-frequency word.More than half of high-frequency words had a direct co-occurrence relationship with each other, indicating a clear co-occurrence relationship among high-frequency words of the bioinformatics literature.
According to the small-world characteristics of the co-occurrence network of high-frequency words in the bioinformatics literature, any two high-frequency words have a direct or indirect co-occurrence relationship.The information in Figure 2 can provide readers or researchers with certain searching suggestions.For example, if a researcher wants to query the literature related to "genetics", the platform should also automatically recommend the literature related to the high-frequency words "protein" and "metabolism" which have a direct co-occurrence relationship with "genetics".

Degree Distribution Characteristics
The degree distribution is one of the most important indicators for describing the characteristics of the complex network structure.In the existing literature, P(k) (the distribution function of node degree) or P(≥k) (that of the cumulative degree) was used to describe the degree distribution characteristics of nodes.The former P(k) is the ratio of the number of nodes with degree k in the complex network to the number of total nodes.The latter P(≥k) is the ratio of the number of nodes with degrees greater than or equal to k in the complex network to the number of total nodes.Empirical studies show that a large number of real-world complex networks are characterized by three types of degree distribution of nodes: scale-free properties, wide-scale properties, and single-scale properties.The cumulative distribution was used in this study to describe the degree distribution characteristics of the co-occurrence network of high-frequency words in the bioinformatics literature.The cumulative distribution of network in Figure 2 is shown in Figure 3.
According to the small-world characteristics of the co-occurrence network of high-frequency words in the bioinformatics literature, any two high-frequency words have a direct or indirect co-occurrence relationship.The information in Figure 2 can provide readers or researchers with certain searching suggestions.For example, if a researcher wants to query the literature related to "genetics", the platform should also automatically recommend the literature related to the high-frequency words "protein" and "metabolism" which have a direct co-occurrence relationship with "genetics".

Degree Distribution Characteristics
The degree distribution is one of the most important indicators for describing the characteristics of the complex network structure.In the existing literature, P(k) (the distribution function of node degree) or P(≥k) (that of the cumulative degree) was used to describe the degree distribution characteristics of nodes.The former P(k) is the ratio of the number of nodes with degree k in the complex network to the number of total nodes.The latter P(≥k) is the ratio of the number of nodes with degrees greater than or equal to k in the complex network to the number of total nodes.Empirical studies show that a large number of real-world complex networks are characterized by three types of degree distribution of nodes: scale-free properties, wide-scale properties, and single-scale properties.The cumulative distribution was used in this study to describe the degree distribution characteristics of the co-occurrence network of high-frequency words in the bioinformatics literature.The cumulative distribution of network in Figure 2 is shown in Figure 3. Figure 3 shows that the cumulative degree distribution curve seems to decline faster at the first stage and slower at the second stage with the increasing of the degree k, indicating that the node degree of the network was scale-free.The scale-free characteristics showed that the connectivity of a small number of nodes in the network were quite large (with a large number of connections), which had a leading role in the operation of the network while most of the nodes had small connections (only a small number of connections).

Rich-Club Phenomenon Characteristics
The rich-club phenomenon refers to the close connection between the more connected nodes (hub nodes) in the network and the formation of a core team in the network, which can be measured using the rich-club coefficient (k) [30].E>k denotes the number of connections among nodes whose degrees are larger than k in the network.
The rich-club coefficient of the co-occurrence network of high-frequency words in bioinformatics literature is shown in Figure 4.The coefficient increased with the increase in the node degree k, implying that the connection degree among hub nodes was larger than that among other nodes, and formed a rich club.At the same time, it showed that the nodes with degree greater than 10 formed a fully connected graph.The rich-club phenomenon of the network showed that the words in the club ere the core of the network, which controlled the composition of the high-frequency word nodes in the whole network.Figure 3 shows that the cumulative degree distribution curve seems to decline faster at the first stage and slower at the second stage with the increasing of the degree k, indicating that the node degree of the network was scale-free.The scale-free characteristics showed that the connectivity of a small number of nodes in the network were quite large (with a large number of connections), which had a leading role in the operation of the network while most of the nodes had small connections (only a small number of connections).

Rich-Club Phenomenon Characteristics
The rich-club phenomenon refers to the close connection between the more connected nodes (hub nodes) in the network and the formation of a core team in the network, which can be measured using the rich-club coefficient ϕ(k) [30].E >k denotes the number of connections among nodes whose degrees are larger than k in the network.
The rich-club coefficient of the co-occurrence network of high-frequency words in bioinformatics literature is shown in Figure 4.The coefficient increased with the increase in the node degree k, implying that the connection degree among hub nodes was larger than that among other nodes, and formed a rich club.At the same time, it showed that the nodes with degree greater than 10 formed a fully connected graph.The rich-club phenomenon of the network showed that the words in the club ere the core of the network, which controlled the composition of the high-frequency word nodes in the whole network.

Matching form Characteristics
The matching form described the relationship between the node degree and the neighbour node degree of the network [31][32][33].Figure 5 shows the relationship between the node degree and the neighbour node degree of the co-occurrence network of high-frequency words in the bioinformatics literature.The network was a mixed network.The result showed that the node with high connectivity was easy to connect with the node with low connectivity in the co-occurrence network of high-frequency words in the bioinformatics literature.Meanwhile, this showed that new words tended to connect words with high connectivity in the process of network generation and evolution.

Evolution of the Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature
The use of high-frequency words in the existing 242,891 articles divided by year, including 41,457 articles in 2013, 42,049 articles in 2014, 43,114 articles in 2015, 45,257 articles in 2016, 60,404 articles in 2017, and 10,610 articles in 2018, was analysed to trace the changing trend of high-frequency words in the bioinformatics literature.Setting K = 500, and E = 200, the co-occurrence network of high-frequency words in the bioinformatics literature from 2013 to 2018 was obtained, as shown in Figure 6.
Figure 6 showed that the changes in two consecutive years were relatively small.For example, a little change was observed in terms of high-frequency words in the bioinformatics literature during 2013 and 2014, and hub nodes in both years were "metabolism", "study", "genetic", and so forth.However, from 2015 to 2016, the word "analysis" became increasingly important.The analysis of the reasons indicated that the application of big data to solve practical problems has become more common, which was well reflected in the bioinformatics research and literature.From 2013 to 2015, three most important nodes were identified.However, there was only one most important node during 2016-2018, implying that, recently, more attention was paid to academic research and analysis in the bioinformatics literature instead of focusing on metabolism and genetics.

Matching form Characteristics
The matching form described the relationship between the node degree and the neighbour node degree of the network [31][32][33].Figure 5 shows the relationship between the node degree and the neighbour node degree of the co-occurrence network of high-frequency words in the bioinformatics literature.The network was a mixed network.The result showed that the node with high connectivity was easy to connect with the node with low connectivity in the co-occurrence network of high-frequency words in the bioinformatics literature.Meanwhile, this showed that new words tended to connect words with high connectivity in the process of network generation and evolution.

Matching form Characteristics
The matching form described the relationship between the node degree and the neighbour node degree of the network [31][32][33].Figure 5 shows the relationship between the node degree and the neighbour node degree of the co-occurrence network of high-frequency words in the bioinformatics literature.The network was a mixed network.The result showed that the node with high connectivity was easy to connect with the node with low connectivity in the co-occurrence network of high-frequency words in the bioinformatics literature.Meanwhile, this showed that new words tended to connect words with high connectivity in the process of network generation and evolution.Figure 6 showed that the changes in two consecutive years were relatively small.For example, a little change was observed in terms of high-frequency words in the bioinformatics literature during 2013 and 2014, and hub nodes in both years were "metabolism", "study", "genetic", and so forth.However, from 2015 to 2016, the word "analysis" became increasingly important.The analysis of the reasons indicated that the application of big data to solve practical problems has become more common, which was well reflected in the bioinformatics research and literature.From 2013 to 2015, three most important nodes were identified.However, there was only one most important node during 2016-2018, implying that, recently, more attention was paid to academic research and analysis in the bioinformatics literature instead of focusing on metabolism and genetics.Figure 6 showed that the changes in two consecutive years were relatively small.For example, a little change was observed in terms of high-frequency words in the bioinformatics literature during 2013 and 2014, and hub nodes in both years were "metabolism", "study", "genetic", and so forth.However, from 2015 to 2016, the word "analysis" became increasingly important.The analysis of the reasons indicated that the application of big data to solve practical problems has become more common, which was well reflected in the bioinformatics research and literature.From 2013 to 2015, three most important nodes were identified.However, there was only one most important node during 2016-2018, implying that, recently, more attention was paid to academic research and analysis in the bioinformatics literature instead of focusing on metabolism and genetics.

Conclusions
Based on the complex network theory, the association of high-frequency words in the bioinformatics literature was abstracted into a network, indicating that the high-frequency words of subjects in bioinformatics literature were taken as nodes, and the co-occurrence relationships of high-frequency words were selected as edges.The co-occurrence network of high-frequency words in the bioinformatics literature was constructed.Additionally, the structural characteristics and evolution laws of the bioinformatics literature were analysed.The main conclusions were summarized as follows:

•
The co-occurrence network of high-frequency words in bioinformatics literature is a small world network.The co-occurrence relationship between any two high-frequency words needed to be transferred at most once, and more than half of the high-frequency words in the bioinformatics literature had direct co-occurrence relationships.

•
The degree distribution of the co-occurrence network of high-frequency words in the bioinformatics literature was scale-free, and the connectivity of a small number of nodes in the network was large, which had a leading role in the network.On the contrary, the connectivity of most nodes was small, indicating that the factors explored by the authors of the bioinformatics literature were more concentrated.

•
The co-occurrence network of high-frequency words in the bioinformatics literature had the rich-club phenomenon.The high-frequency words in the club were the core words in the bioinformatics literature and they expressed the author's attention to the bioinformatics literature.

•
The co-occurrence network of high-frequency words in the bioinformatics literature had the characteristics of disassortative network.High-connectivity nodes were easily connected to nodes with low connectivity.

•
The analysis on the evolution of the co-occurrence network of high-frequency words in the bioinformatics literature revealed that the high-frequency words in the bioinformatics literature changed little in 2-3 years.However, the state-of-the-art technology was introduced gradually with time.Consequently, the authors' wording also changed, such as passion for big data and data analysis.

Figure 1 .
Figure 1.Process of constructing the co-occurrence network of high-frequency words.

Figure 1 .
Figure 1.Process of constructing the co-occurrence network of high-frequency words.

Figure 3 .
Figure 3. Cumulative distribution of the co-occurrence network of high-frequency words.

Figure 3 .
Figure 3. Cumulative distribution of the co-occurrence network of high-frequency words.

Figure 5 .
Figure 5. Average degree of the neighbours.

Figure 5 .
Figure 5. Average degree of the neighbours.

3. 6 .
Evolution of the Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature The use of high-frequency words in the existing 242,891 articles divided by year, including 41,457 articles in 2013, 42,049 articles in 2014, 43,114 articles in 2015, 45,257 articles in 2016, 60,404 articles in 2017, and 10,610 articles in 2018, was analysed to trace the changing trend of high-frequency words in the bioinformatics literature.Setting K = 500, and E = 200, the co-occurrence network of high-frequency words in the bioinformatics literature from 2013 to 2018 was obtained, as shown in Figure 6.

Figure 5 .
Figure 5. Average degree of the neighbours.

3. 6 .
Evolution of the Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature The use of high-frequency words in the existing 242,891 articles divided by year, including 41,457 articles in 2013, 42,049 articles in 2014, 43,114 articles in 2015, 45,257 articles in 2016, 60,404 articles in 2017, and 10,610 articles in 2018, was analysed to trace the changing trend of high-frequency words in the bioinformatics literature.Setting K = 500, and E = 200, the co-occurrence network of high-frequency words in the bioinformatics literature from 2013 to 2018 was obtained, as shown in Figure 6.

Author Contributions:Funding:
Conceptualization: T.L. and J.B.; methodology: T.L.; validation: J.B., X.Y. and Q.L.; formal analysis: T.L.; investigation: Q.L.; resources: J.B.; data curation: T.L.; writing-original draft preparation: T.L.; writing-review and editing: T.L.; visualization: X.Y.; supervision: Y.C.; project administration: T.L.; funding acquisition: T.L.This research was funded by the National Natural Science Foundation of China, grant number 71271034, the National Social Science Foundation of China, grant number 15CGL031, the Fundamental Research Funds for the Central Universities, grant numbers 3132016306 and 3132018160, the Program for Dalian High Level Talent Innovation Support, grant number 2015R063, the National Natural Science Foundation of Liaoning Province, grant number 20180550307, and the National Scholarship Fund of China for Studying Abroad.