A Keyword-Based Literature Review Data Generating Algorithm—Analyzing a Field from Scientiﬁc Publications

: A scientiﬁc review is a type of article that summarizes the current state of a speciﬁc ﬁeld, which is crucial for promoting the advancement of our science community. Authors need to read hundreds of research articles to prepare the data and insights for a comprehensive review, which is time-consuming and labor-intensive. In this work, we present an algorithm that can automatically extract keywords from the meta-information of each article and generate the basic data for review articles. Two different ﬁelds—communication engineering, and lab on a chip technology—were analyzed as examples. We ﬁrst built an article library by downloading all the articles from the target journal using a python-based crawler. Second, the rapid automatic keyword extraction algorithm was implemented on the title and abstract of each article. Finally, we classiﬁed all extracted keywords into class by calculating the Levenshtein distance between each of them. The results demonstrated its capability of not only ﬁnding out how communication engineering and lab on a chip were evolved in the past decades but also summarizing the analytical outcomes after data mining of the extracted keywords. Our algorithm is more than a useful tool for researchers during the preparation of a review article, it can also be applied to quantitatively analyze the past, present and help authors predict the future trend of a speciﬁc research ﬁeld.


Introduction
With the development of advanced science and emerging technologies, more and more interdisciplinary fields have become the frontiers of scientific research, which are crucial to the welfare of all human beings. For instance, virtual reality (VR) is currently now widely used in surgical training programs for medical students by breaking the boundary between the real world and computer simulations [1][2][3][4]. Artificial intelligence (AI), Internet of Things (IoT) and big data are helping radiologists to develop deep neural networks for classification, detection, and segmentation tasks of different diseases that are threatening the health of millions of patients [5,6]. Researchers and engineers from Tissue engineering [7], Genetic engineering [8], Bioinformatics [9], Biological systems engineering [10], Biotechnology [11] and other fields are developing usable, tangible or economically viable Bioengineering products such as bionic eye [12], supramolecular biomaterials [13], among others, for the benefit of all people.
However, the difficulty of these research projects is significantly increased due to the need of researchers and engineers from multiple backgrounds to work closely for achieving the same pivotal goal. For example, a computer engineer who is developing a deep-learning-based cancer classifier should at least have a basic understanding of how radiologists are detecting cancer using computer tomography (CT) [14,15]. To develop a feasible bionic eye, electronic engineers shall reach tissue engineers for help to make the bionic eye bio-compatible and minimize rejection reaction [12]. These interdisciplinary research projects create an additional barrier for researchers who are not experts in one field but willing to take advantage of the scientific achievement created by others in their own research.
One way to overcome this barrier is to read the latest review article of a specific field to have a preliminary comprehension of the corresponding research progress. A review article (or so-called survey article) is an article that summarizes the current state of understanding on a topic, discusses the recent major advances and discoveries in the corresponding field, and gives ideas of where research might go next to readers [16]. However, there are three issues of the current production of review articles. First, review articles in a certain field are not reported every year. A recently-published review article could provide a comprehensive understanding of the latest research achievements. However, in other cases, the review articles we can find are outdated to a certain extent. Second, preparing the data for a review article is time-consuming and labor-intensive. As shown in Figure 1, authors (usually the experts in a field) would collect the results and data from hundreds of peer-reviewed articles and summarize key points from each of these articles [17]. Thus, even preparing a single data table for a review article could take weeks or months. After that, authors need to find the indications behind these results and data in order to make their own prediction or comments for a certain field. Compared with drafting the manuscript, conducting a thorough literature survey is a much more painful part before submission to the journal. Last but not least, the number of published research articles is increasing tremendously. In 2018, one of the largest research article publisher, Elsevier, has published more than 470,000 articles in 2500 journals [18]. Thus, summarizing the results from hundreds of peer-reviewed articles could only explore a limited fraction of a specific field. Fortunately, big data analysis has been widely applied in scientific research to find the cause and effect behind these large amounts of numbers [19][20][21][22]. Therefore, we are motivated by the question, is it possible that a non-expert researcher could summarize what a field has happened in the past few years and what the major accomplishment of the corresponding field is? More specifically, is it possible for us to generate the data for the review article automatically? In this work, we present an algorithm that generate the data to be further used in a review article by leveraging automatic keyword extraction and similarity calculation. As an example of our algorithm, we summarized the major accomplishments of two different fields, communication engineering and lab on a chip, from two scientific journals, IEEE transaction on communications and Lab on a Chip, respectively.

Existed Guidelines for Preparing the Literature Review Article
Researchers have developed mature guidelines for writing different kinds of review articles in various fields. Machi et al. provided a classic six-step of how to write a review article [23]. Wee et al. discussed the significance of why and how to write a literature review paper, and focused on the importance of adding value, rather than only providing an overview [17]. Torraco focused on how to organize and write an integrative literature review and discussed the contributions of review articles to the knowledge base of human resource development [24]. Randolph not only summarized the conventional steps for writing a review article but also provided a framework for the self-evaluation [25]. Boote et al. suggested criteria to evaluate the quality of literature reviews for the dissertation of PhD and Master's students [26]. Denney et al. provided the structure, process, and art of writing a review article for both undergraduate and graduate students in the criminal justice field [27]. Levy et al. introduced a framework for conducting and writing an effective literature review in the Information Systems field [28]. Aveyard structured the guidelines of writing a review article by asking ourselves eight questions in the health and social care field [29]. Piper focused on how to write a systematic literature review in the medical field [30].

Materials and Methods
The major steps of our proposed method were summarized as follows. (1) Establish the journal library of a target field; (2) Keyword extraction process from title and abstract; (3) Calculate the similarity between different keywords and classify them into proper groups; (4) Analyze the overall development of a target field from post-processed data.
We first used a python-based web crawler to download the published articles from target peer-reviewed journals (Please refer to the supplementary information for details). In this work, we chose two different peer-reviewed journals as examples to illustrate our algorithm. The first journal we chose is IEEE transaction on communications, which is one of the top journals in the field of communication engineering [31]. The second journal we chose is Lab on a chip, which is one of the top journals in the field of devices and applications at the microscale and nanoscale [32].
During the crawling process, we stored the title, abstract and the published year of each article into a MySQL database (Please refer to the supplementary information for details). After that, the rapid automatic keyword extraction (RAKE) [33] algorithm was implemented to extract keywords from both the title and abstract. RAKE is a well-known keyword extraction method that uses a list of stop-words and phrase delimiters to detect the most relevant words or phrases in a piece of text. The pseudocode describing the corresponding process was summarized in Algorithm 1. We set the parameters of RAKE to extract two keywords from the title and three keywords from the abstract. We found that extracted two words from the title is good enough to represent the main novelty and focus of one study. In contrast, only extracting three keywords from the abstract could prevent including unrelated keywords as abstract usually would provide more details of one study than the title. After keyword extraction, each keyword was stored in the MySQL database as well.

Algorithm 1 Extract keywords from target text.
Require: Target text, T; The number of extracted keywords, n; Ensure: The keyword list, W = [w 0 , w 1 , ..., w n ]; 1: r ← Rake(); Initialize a RAKE object. 2: W ← r.get_ranked_phrases(T); Extract keywords with customized weights. 3: return W; The next procedure was to determine the similarity between different keywords by calculating Levenshtein distance (LD) [34] and classify them as various classes with a certain meaning. The pseudocode describing the corresponding process was summarized in Algorithm 2. At the beginning of this part of the algorithm, we chose the first keyword as w i in the functioning ratio (line 11), and fetch another keyword in the word set as w k , then calculated the LD between two words. If the LD was larger than the threshold (T s ), then we classified these two words into the same group (an array). If not, we moved to the next keyword and assigned it as w k and calculate the LD again. The threshold (T s ) here was a critical parameter because if the value is too big, the algorithm might classify two highly related keywords into a different word group. If the value is too small, the algorithm might be over-optimized and put unrelated keywords into the same group. In our algorithm, we tried several different values of T s , and find T s = 0.6 was a proper value to make our algorithm work. After traversing all the other keywords in the word set, we replaced w i with another unclassified keyword and repeat the steps above until all the keyword had been classified into a proper group. Until now, we had a series of similar groups that contains the related keywords or phrases. After that, the occurrence of the keywords was calculated by going through a searching process of all the post-classified keywords. Finally, we could use the listed data to draw the developing tendency of a certain discipline in a specific period and find how one field was evolved and advanced.
Algorithm 2 Find related keywords groups by calculating the Levenshtein distance.

Require:
The keyword list, W = [w 0 , w 1 , ..., w n ]; Threshold used to define similarity, T s ; Ensure: Different related keywords groups, S 0 , S 1 ..., S j ; 1: i ← 0 Initialize the index for the main loop. 2: j ← 0 Initialize the index for different related keyword groups. 3: W used ← list() Define a list to store used keywords. 4: while (i ≤ n) do 5: if w i / ∈ W used then Continue if w i has not been used. 6: Initialize a new list for a related keywords group. 8: Append the compared word into G j . 9: for k = i + 1 → n do 10: if w k / ∈ W used then Continue if w k has not been used. 11: Calculate the similarity between w i and w k . 12: if LD > T s then Continue if the similarity is larger than T s . 13: Add related keyword into the same group. 14: W used .append(w k ) Add the keyword into used keyword list. 15: end if 16: end if 17: end for 18: j ← j + 1 19: end if 20: i ← i + 1 21: end while 22: return S 0 , S 1 , ..., S j ;  Figure 2B depicted the trend of how three stages were evolved. Table 1 listed the comparison between algorithm-generated keywords and author-selected keywords. Author-selected keywords were provided by the authors or editors when articles were published in IEEE transaction on communications. We could verify the performance and robustness of our algorithm in the keyword extraction aspect. In general, over 71% algorithm-generated keywords were included in author-selected keywords, which indicated that our algorithm-generated could imitate or replace the keyword-selection process to a certain extent.    Figure 3A showed the three stages of development of LoC technologies, which are Early days of miniaturization, Growth of LoC technologies and Rise of LoC applications. Figure 3B depicted the trend of how three stages were evolved.  Further details of the top extracted keywords and their occurrence from Lab on a Chip were showed in Table 2. We have listed those keywords and their specific sub-keywords to demonstrate how those keywords can be used to provide information for a scientific review. For example, the keyword detection may refer to seven different sub-keywords, electrical detection, cell detection, biological detection, sensitive detection, amperometric detection, fluorescence spectrum detection, fluorimetric lead detection, which were directly extracted by our algorithm.  (33) wavelength particle trapping, inertial particle focusing, particle manipulation, particle inertial migration, efficient particle separation, particle tracking Throughput (30) high throughput screening devices, high throughput drug testing, high throughput separation, throughput tracking throughput rheology, throughput diffusion Separation (22) multidimensional separations, multiway separation, electrokinetic separation electrophoretic separation, chemical separations, 2D separations Optofluidic (21) optofluidic lasers, optofluidic sers chip, droplet optofluidic imaging, optofluidic imaging system, optofluidic waveguide, optofluidic chip, optofluidic ultrahigh Nano (17) nanoparticle assay, nanoparticles, nanoparticle separation Surface (15) surface acoustic waves, standing surface acoustic wave microfluidics, surface acoustic wave driven fluid motion, surface acoustic wave enabled pipette Digital (13) single digital microfluidic reactor chip, digital microfluidic chip, digital diffraction detection, digital biology, thermal digital microfluidic device, magnetic digital microfluidics Fabrication (10) nanofabrication, design fabrication, fabrication platform, situ fabrication, rapid fabrication Integrated (10) chip integrated system, integrated microsystem, integrated disposable dye clad leaky waveguide sensor, integrated optical leaky waveguide sensor, integrated microfluidic processor, integrated microfluidic device, integrated microfluidic uv absorbance detector Acoustic (10) acoustic radiation force, acoustic streaming, acoustic trapping, driven acoustic streaming Electrophoresis (9) capillary electrophoresis, microchip electrophoresis, microscale capillary electrophoresis, two-dimensional capillary gel electrophoresis, dielectrophoresis

Discussion
The first field we investigated was communication engineering and we chose IEEE transaction on communications to test our algorithm. All the articles published in IEEE transaction on communications had listed a few keywords which were provided by the authors or editors. Thus, we could verify the keyword extraction performance by comparing algorithm-extracted keywords with author provided keyword. Totally 11,199 articles from 1984 to 2019 were crawled and processed. And we specifically analyzed the keywords in the period from 2001 to 2019.
According to the metrics of the presence of keywords, we can see a three-stage distribution of keywords in the field of communications in Figure 2. The three stages were Information processing (blue arrow), Higher-level communication technologies (yellow arrow) and Advanced integration & IoT (gray arrow). And we summarized all the 13 keywords listed in Figure 2A as well as the corresponding sub-keywords which were highly consistent with author-provided keywords and algorithm-provided keywords in Table 1. The x-axis of Figure 2B showed the timeline of communication technology development, while the y-axis showed the rank of three stages based on their corresponding keywords occurrence ratio. The figure demonstrated clearly that Information processing was the most popular research topic at the first stage of communication technology, but with the further development of basic knowledge around 2006, the rise of higher-level communication technologies became more attractive from 2007 to 2013 and researches on channels, codes and so on has become the key focus. During the analyzed 12 years of development, Advanced integration & IoT had steady growth and gradually showed its importance. With the growing need of higher-level technologies and the application of internet of things, the stage of Advanced integration & IoT has gathered increased research focus and became the most popular research topic in the field of communication engineering staring 2014.
In the first stage (2001 to 2006), the most popular topics were channel coding, decoding, signal modulation, channel estimation, and signal detection, which were fundamental research topics of communications. In the second stage (2007 to 2013), keywords like codes, cooperative, channel and analysis showed up the most times. When referring to the meaning or details behind these keywords, we could find that different types of codes were investigated and the corresponding standards were established in this period. For instance, low-density parity-check codes, time-spatial codes, and some polar codes were popular research topics. The keyword cooperative included multi-user cooperative communications, cooperative relay system, cooperative beamforming technology, and cooperative beamforming wireless system. Besides, there was a lot of research about signal channels. Keywords related to the channel included fading channel, non-linear channel, Multi-input Multi-output (MIMO) channel and mixed channel. In addition, keyword analysis was important in all three stages and especially in the second stage. Capacity, error, delay and timing-jitter analysis made up the content of the analysis of the keyword. Compared with the previous period, researchers moved forward to do higher-level research and tried to analyze the performance of previously developed techniques in communications. The third stage was from 2012 to the present. In this stage, the advanced integration of different kinds of network and internet of things became the focus in the community. The number of research topics related to MIMO has increased significantly, including millimeter network, wireless cellular networks, relay networks, and sensor networks, as well as other networks. All we can see from the three-stage period was the development of communications. At first, researchers were focusing on theoretical studies. After that, they were moving to solve the signal transmission part of communications. In the third period, researchers took the system and different kinds of networks as a whole into consideration and IoT became the most attractive application and research topics. The trend analyzed by our algorithm matched the development pattern of communications. And the example provided by us can be seen as the development of a new generation of communication technology.
To evaluate the performance of our method, it was necessary to compare the similarities and differences between algorithm-generated keywords and author selected keywords. From Table 1, we can easily find out that some algorithm-generated keywords are exactly the keywords that the author selected, and others that are not included in the author-selected keywords show more explicit details. These results demonstrated the ability of the method sorting out keywords for author-selected keyword replacement.
The second field we investigated is microfluidics/Lab on a Chip (LoC) and we chose Lab on a Chip as our source journal. Lab on a Chip did not ask authors to list any keywords so it was a good example to test our algorithm in an emerging field.
Based on the extracted data from Lab on a Chip, we concluded the development of LoC field into three stages as well in Figure 3. The three stages were Early days of miniaturization (gray arrow), Growth of LoC technologies (red arrow) and Rise of LoC applications (green arrow). And we summarized all the 14 keywords listed in Figure 3A as well as the corresponding sub-keywords which were highly consistent with author-provided keywords and algorithm-provided keywords in Table 2. The x-axis of Figure 3B showed the timeline of LoC technology development, while the y-axis showed the rank of three stages based on their corresponding keywords occurrence ratio. After solving the fabrication and integration issues, the research focus was moved to several LoC technologies, such as (cell/particle) detection, (cell/particle) separation, acoustic (microfluidics), optofluidics. In this stage, the researcher was focusing on developing or inventing LoC-based technologies. Thus, we used Growth of LoC technologies to characterize this stage. Since 2014, researchers were more focusing on find specific applications for LoC technologies. From the extracted keywords in this stage, we could see the biological application keywords (cell, particle), chemical analysis keywords (surface, nano) and so on.

Conclusions
In conclusion, we propose an algorithm to automatically extract keywords from the meta information of peer-reviewed journals and use these keywords as the basic data for scientific review articles. Our algorithm consists of two parts. We use the RAKE algorithm to achieve keyword extractions and use Levenshtein distance to classify different related keywords into the same group. We first applied our algorithm in the field of communication engineering. From the keyword data, we divided the communication engineering into three stages from 2001 to present and summaries the extracted keywords with their occurrence in a table that could be further used in a review article. We also compared the algorithm-generated keywords with author-selected keywords, which showed the robustness of our method. What's more, we tested our algorithm on an emerging field, Lab on a Chip. Based on the extracted keywords, we found a clear path of how LoC technology was developed in the past decades. Researchers from LoC were solving the fabrication or integration issues first, then moving to develop new LoC-based technology, and finally trying to apply LoC technology to solve real-world problems.

Limitations
The proposed algorithm has certain constraints and limitations. The extracted keywords are usually terminology. We still need human experts to illustrate the further significance behind those raw keywords. For example, everyone knows 5G technology is developed for the next-generation mobile Internet connection, yet few people have a basic understanding of Modulation, Coding, or decoding ( Figure 2A). Besides, the extracted keywords from one single journal could be biased to predict the overall development of one specific field. Different research journals in the same field could have distinct scopes and aims, which would tend to publish the articles more suitable to their scopes and aims. This issue could be amended by the implementation of our algorithm on multiple journals in the same field and analyze the extracted keywords together.

Future Directions
With the tremendous increment of published research articles, our algorithm could be one solution to reduce the labor-intensive literature survey process. We are confident that our algorithm could help provide not only the data but also the point of view that researchers might need a lot of time to find out during the process of understanding a new field or writing review articles. Moreover, nothing is stopping us to apply our proposed method only in the scientific field. It will be interesting to extend our method to more general fields and find the indications behind it. For example, it would be easy to apply our method to the print media (e.g., Newspaper) to find what kinds of news would be more attractive to people in different generations and connect the findings with economic development.
Author Contributions: J.W. and G.S. contributed equally to this work. J.W. contributed for the idea , data mining and manuscript; G.S. contributed for the idea and manuscript; C.W. contributed for the data collection and data mining; X.H. contributed for paper review and manuscript; L.S. contributed for paper review. All authors have read and agreed to the published version of the manuscript.