You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

5 August 2020

Deep Learning Based Biomedical Literature Classification Using Criteria of Scientific Rigor

,
,
and
1
Department of Software, Sejong University, Seoul 05006, Korea
2
Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do 446-701, Korea
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Internet of Things (IoT)-Based Wireless Health: Enabling Technologies and Applications

Abstract

A major blockade to support the evidence-based clinical decision-making is accurately and efficiently recognizing appropriate and scientifically rigorous studies in the biomedical literature. We trained a multi-layer perceptron (MLP) model on a dataset with two textual features, title and abstract. The dataset consisting of 7958 PubMed citations classified in two classes: scientific rigor and non-rigor, is used to train the proposed model. We compare our model with other promising machine learning models such as Support Vector Machine (SVM), Decision Tree, Random Forest, and Gradient Boosted Tree (GBT) approaches. Based on the higher cumulative score, deep learning was chosen and was tested on test datasets obtained by running a set of domain-specific queries. On the training dataset, the proposed deep learning model obtained significantly higher accuracy and AUC of 97.3% and 0.993, respectively, than the competitors, but was slightly lower in the recall of 95.1% as compared to GBT. The trained model sustained the performance of testing datasets. Unlike previous approaches, the proposed model does not require a human expert to create fresh annotated data; instead, we used studies cited in Cochrane reviews as a surrogate for quality studies in a clinical topic. We learn that deep learning methods are beneficial to use for biomedical literature classification. Not only do such methods minimize the workload in feature engineering, but they also show better performance on large and noisy data.

1. Introduction

In the practice of modern-age medicine, providing appropriate evidence for clinical decisions plays a vital role in increasing the reliability of the system while practicing evidence-based medicine [1]. However, the classification of a vast set of medical documents is daunting and time consuming. In order to solve these problems, recent research on the classification of documents through machine learning is being actively carried out. Additionally, machine learning algorithms have presented a special use in data mining applications, especially where it is difficult for the human to understand and model the domain [2]. Artificial neural networks appear to advance further the area of biomedical literature mining and classification [3,4]. One of the advantages of deep learning over shallow machine learning is the automatic feature engineering. However, the main issue for deep learning models is the acquisition of pre-annotated data required for training and testing its performance.
This study aims at building a deep learning binary classifier for the identification of quality studies in the biomedical literature to prove its superiority to the promising machine learning approaches historically used for the same problem. To achieve this objective, we design deep learning based on a multi-layer feed-forward artificial neural network also called multi-layer perceptron (MLP) model tied with automatic engineering of features created from the text in title and abstract of biomedical documents related to Kidney disease. To obtain studies that are scientifically rigorous, we run a set of queries related to Kidney disease on the Cochrane database. The same queries are executed using PubMed to get labels for studies that not scientifically rigorous. The proposed model secures a better score as compared to shallow machine learning algorithms.
The main contributions of this work are made in the following areas: (i) the identification and preparation of data using Cochrane and PubMed databases, (ii) the design of an artificial neural network (ANN)-based deep learning model, and (iii) the empirical as well as qualitative insights of deep learning in comparison to shallow machine learning algorithms.

3. Methods

We propose an MLP model for evaluating studies of scientific rigor. As depicted in Figure 1, our proposed method consists of two steps. Step 1 is dedicated to acquiring identifiers of studies through queries from the sources and apply filters to remove duplications in the data. Step 2 processes the deduplicated data to get the text of titles and abstracts, which are then preprocessed to create feature vectors to apply the proposed MLP model to classify the studies into scientific rigor and non-rigor.
Figure 1. The high-level architecture of the proposed method for finding high-impact studies in biomedical literature.

3.1. Step 1—Preparation of Datasets

To collect high-quality studies, we use the Cochrane Library by executing a general-purpose query of Kidney disease and obtain the identifiers called PMID (PubMed Identifiers) for all the retrieved articles. The same query is executed using PubMed and the PMIDs for all the retrieved articles are obtained. The records obtained from the Cochrane are labeled with “scientific rigor” class, and the records obtained from PubMed are labeled with “scientific non-rigor.” In the step of deduplication, all duplicated items are removed from the collected PMIDs in the PubMed dataset in order to avoid the chance of assigning two labels to the same study. Identifiers in both the datasets are uploaded using NCBI Entrez Programming Utilities (eUtils) [18]. The eUtils is a service of NCBI that provides access to a total of 50 databases via a web interface, a public FTP (file transfer protocol) site [19]. Using search and fetch functions of the API, we retrieve two data features (title and abstract) from each article. At this stage, the title and abstract are in the original format consisting of texts as written in the publications. The combined dataset holds 7958 records, out of which 1083 are classed as scientific rigor and the rest as scientific non-rigor.

3.2. Step 2—Development of Deep Learning Model

To achieve the optimal design for MLP, we employ the Auto Model extension of RapidMiner, which provides a graphical visual environment for the convenience of designing a faster and better model of automated classification and prediction [20]. To learn the classification model, the dataset is passed through several internal steps that include preprocessing, feature extraction, and feature engineering, as shown in the detailed diagram (Figure 2) of step 2 of our proposed methodology.
Figure 2. Abstract description of the proposed deep learning process model.

3.2.1. Preprocessing

Preprocessing consists of multiple sub steps such as the role setting and the transformation of the initial types. In the role setting step, all attributes’ roles are changed to regular except the class attribute, which is set to the ‘label’ role. In the transformation of the initial types step, all the text columns are transformed into polynomial columns. After initial preprocessing, the data is checked for missing values and filled where applicable; for instance, the missing numeric values are replaced with the average value. Finally, the values are filtered based on the no_missing_labels parameter that keeps only those records that do not have a missing value in the special attribute with the label role. The data is then split into training set and validation set (holdout data) with a ratio of 6:4, respectively.

3.2.2. Feature Extraction

Technically, this step performs tokenization, changing the case to lower, and calculating TF-IDF (term frequency–inverse document frequency) values for each token across all the records. The input text in the nominal form is transformed into a vectorized format using TF-IDF. The TF-IDF is a statistical value which reflects how important a word is to a document in a corpus. This step helps later in the feature selection based on the importance determined by TF-IDF value.

3.2.3. Automatic Feature Engineering

The automatic feature engineering is a robust utility of RapidMiner that uses a deep learning MLP model internally for the selection of a subset of features from a full set of features. The MLP comes with default parameters that are optimized for finding the best feature sets. Taking all records as a training dataset, the MLP is trained to obtain the final features on the default parameter settings provided by the Auto Model extension of RapidMiner.

3.2.4. Classification Model

After obtaining the best feature sets through automatic feature engineering, a classification model was used to identify the scientific rigor studies. The classification model is the same deep learning MLP model as used for feature engineering. The parameter settings, as described in Table 1, are the optimized default values provided by the RapidMiner.
Table 1. Parameter settings of the proposed deep learning (multi-layer perception (MLP)) classification model.

3.2.5. Performance and Explanation

The trained classification model is applied on hold-out datasets to get the performance in the form of widely used matrices in the AI domain that include accuracy, recall, precision, F-measurement, and AUC. In addition to these metrics, we also explain the model′s predictions in the form of confidence value. The higher confidence value of a prediction shows model trust on a classification of a study leads the users to accept the output of the model with a firmer belief.

3.3. Model Selection

To prove the hypothesis that deep learning model can perform better than the shallow machine learning algorithms, we choose four well-known algorithms which have been experimented with in multiple studies [7,13,15] that include Naïve Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), and Gradient Boosted Trees (GBT). A brief description of parameter settings for each algorithm is provided in Table 2.
Table 2. Descriptions of parameter settings of shallow machine algorithms used for comparison with deep learning.
We compare the performance of these algorithms with proposed MLP on the basis of an accumulated score obtained from summing up the widely used performance metrics such as accuracy, recall, F1-measure, and AUC (area under the ROC curve). In order to evaluate the variations, we check to compare the performance of each algorithm individually on title and abstract and then finally on the combination of them, as shown in three compartments of Table 3.
Table 3. Results for learning methods on hold-out datasets.
We can see the performance of each model in contrast individually on the accuracy, AUC, recall, and F1-measure scores as well as the overall score. Applying the proposed MLP using the dataset with the title and abstract has the highest overall score. Comparing models on to title feature alone, we can observe the highest overall score (value: 255) is slightly less than the highest overall score (value: 257) of models trained on the abstract feature. When the two features are combined, the overall score for almost all algorithms has been increased. We can observe the highest overall score (value: 283) of MLP trained on the combination of title and abstract is considerably better in each of the machine learning algorithms. The second-highest performer, i.e., GBT overall score (value: 262), is about 10% less than the MLP score. Not only the overall score, but also the MLP performs better at each metric, i.e., the accuracy (97.3%), AUC (0.993), recall (95.1%), and F1-measure (90.4%) are higher by about 2, 1, 7, and 7%, respectively, than the second-best performer. Based on these measurements, we conclude that the proposed deep learning model is a better performer on the given dataset; we choose the MLP to test on datasets from a selected clinical domain.

4. Results and Discussion

We apply the MLP model to evaluate the performance of the unseen test data retrieved with a set of real-world queries executed for two domains: kidney disease (same domain used in training) and cancer disease (different domain). As shown in Figure 3, the process steps used for data manipulations are similar to the steps used at the training stage of the model, i.e., the preprocessing, feature extraction, and automatic engineering. The already trained MLP classification model is loaded to apply to the feature sets created from the data retrieved with queries.
Figure 3. The process steps of application of the trained deep learning model to the test dataset.

4.1. Scenario 1: Results of Same Domain Test Queries

Hypothesis 1.
The proposed deep learning model yields equivalent or marginally lower accuracy for unseen test dataset retrieved with real world queries from the same domain. We define four queries and collect the articles by running each query on the PubMed database. The defined queries and the number of articles retrieved against each query are shown in Table 4.
Table 4. Same domain test queries and number of articles.
The results for each query are shown in Table 5. We can observe that there is the least variation in the performance of the model on each query. Additionally, each metric performance is marginally lower than the performance on a hold-out dataset, which proves our Hypothesis 1.
Table 5. The result of evaluating performance of the proposed model on test dataset collected through queries from the same domain i.e., Kidney.

4.2. Scenario 2: Different Domain Test Queries Results

Hypothesis 2.
The proposed deep learning model yields equivalent or marginally lower (not less than 10%) accuracy for unseen test dataset retrieved with real world queries from a different domain.
To test this hypothesis, we similarly collected data as we collected for other queries, however, for a different domain (cancer disease). We collected 1022 high-quality literature data and 6569 general literature data. This cross-domain results in evaluating the proposed model are shown in Table 6.
Table 6. The result of evaluating performance of the proposed model on test dataset collected through queries for a different domain (i.e., cancer).
We can observe that the performance is slightly lower than the average score of the same domain, however it is nearly equivalent to query 4, which has the lowest score in the four queries of the same domain. The marginally lower performance (not less than 10%) than the score on the hold-out dataset, proves Hypothesis 2.

4.3. Significant Findings

To our knowledge, this is the first study to use Cochrane Systematic Reviews for training deep learning techniques to identify scientifically sound studies in the biomedical literature in the Kidney domain. Besides, our proposed deep learning model performed reasonably well compared with state-of-the-art machine learning approaches. Beyond higher accuracy, we noticed that the proposed deep learning model performed well for unknown test results obtained with the real-world queries. It has shown excellent performance when experiments are performed through multiple queries within the same domain. In addition, our model performed considerably well in other domains, like cancer. Besides, the results were consistent for testing datasets across different queries. The minimum variation across different queries and different domains, qualify our proposed model to be useful to use for identification of high-quality medical articles in the biomedical literature.

4.4. Comparison with Prior Work

There have been numerous attempts to recognize high-quality medical literature that has been automated using computing techniques. A comparison of previous studies and our proposed model is shown in Table 7. Although the comparison is thematic as the overall objective of the mentioned studies is the same, i.e., the identification of high-impact studies in the biomedical literature, it provides a perspective on the superiority of the deep learning model over shallow machine learning methods. From a method perspective, the study conducted by Del Fiol et al. [12] is closer to our study as they have also used a deep learning method in their experiments. The difference was that they used deep learning based on CNN, while the proposed method used MLP—a deep learning based on multi-layer feed-forward neural networks. They utilized clinical queries data as a surrogate for high-impact studies while we used Cochrane citations as a surrogate for scientific rigor (high-impact) studies. Both studies used titles and abstracts as features for the experiment. In both cases, the data were noisy because of the existence of false positives in the dataset. The recall in both cases is similar to about a 1% difference. However, there is a huge difference in F-measure due to the higher precision of our approach. One of the possible reasons is the noise in data, i.e., the number of false positives. Another possibility is that using Cochrane citations as a surrogate for high-impact studies may be more impactful as compared to clinical queries or clinical hedges.
Table 7. Performance comparison of the proposed model and past research.
Afzal et al. [15] used the Quality Recognition Model (QRM). The QRM is a supervised classification model based on the SVM machine learning algorithm trained on a dataset of clinical hedges annotated by a team of professionals [21]. This was our prior work, where we compared multiple machine learning algorithms and found SVM as the best performer. In this study, we compared our deep learning model with the best performer and other machine learning algorithms and found an increase of 5% in the accuracy. The study conducted by Bian et al. [14] used 15,845 PubMed documents and obtained 77.5% recall using the high-impact Naïve Bayes classifier. The authors found that Scopus citations and journal impact factors are two key features as compared to other features such as the number of comments on PubMed, high-impact journals, Altmetric scores, and other PubMed metadata.

4.5. Error Analysis

Cochrane Systematic Reviews are usually not up to date because of the dependency on human experts. Therefore, the training dataset lacks the inclusion of the current studies available in the primary literature, at least for the class of high-quality medical evidence. As a result, there is an imbalance between high-quality medical evidence literature and primary literature data, which may cause an error. It may contain data that has not been fully validated because the proposed model was trained on the data labeled without the help of medical professionals.

4.6. Limitation and Future Work

We have created the literature data obtained from the review of the Cochrane Library as high-quality literature data. Firstly, we need more high-quality medical evidence literature data, which is usually not available in the Cochrane Library for specific topics. Secondly, the up-to-date research papers were not included in the training model as Cochrane Library lacks the reviews for recent research. Lastly, the proposed model uses only text data such as title and abstract and is not tested by adding features such as year of publication and impact factor.
These limitations could be addressed by increasing the dataset combining data retrieved through PubMed Clinical Queries with the Cochrane Reviews data. Alternatively, the expertise of medical professionals could be utilized for the evaluation of a dataset of testing queries—add the experts′ annotated data to the training dataset at a particular stage and retrain the model.

5. Conclusions

In evidence-based medicine, clinicians need high-quality medical evidence data to provide the best guidance. We proposed a deep learning model that automatically classifies biomedical literature by learning high-quality studies and general studies data. The proposed model utilized data from Cochrane Systematic Reviews and PubMed primary literature. We evaluated the proposed model based on different performance parameters like accuracy, F1-measure, and area under the curve (AUC) and obtained reasonably better results compared to competitors. We tested the model on different medical queries in the same domain of which training dataset was accumulated and also in a different domain. Experiments on four different queries in the same domain yielded similar performance to those we evaluated for training data, while the performance was slightly lower for the different domains. This research on automatic identification of high-quality evidentiary documents will reduce the work hours of clinicians who practice evidence-based medicine. It will also increase confidence of users in the decisions made by the physicians assisted with decision-making systems. In this sense, our proposed model will be a useful addition to the scientific world as well as real-world clinical setups for its vast utility in clinical practice and research.

Author Contributions

M.A. is the principal researcher who conceptualized the idea. B.J.P. did contribute towards data preparation and implementation. M.H. reviewed the text and validate the results. S.L. supervised the work and arranged funding for the project. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-0-01629) supervised by the IITP (Institute for Information & communications Technology Promotion)”. In addition, this work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017-0-00655), NRF-2016K1A3A7A03951968, and NRF-2019R1A2C2090504This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Heneghan, C.; Mahtani, K.R.; Goldacre, B.; Godlee, F.; Macdonald, H.; Jarvies, D. Evidence based medicine manifesto for better healthcare. BMJ 2017, 357, j2973. [Google Scholar] [CrossRef] [PubMed]
  2. Güiza, F.; Ramon, J.; Bruynooghe, M. Machine learning techniques to examine large patient databases. Best Pract. Res. Clin. Anaesthesiol. 2009, 23, 127–143. [Google Scholar]
  3. Zhang, Y.; Lin, H.; Yang, Z.; Wang, J.; Sun, Y.; Xu, B.; Zhao, Z. Neural network-based approaches for biomedical relation classification: A review. J. Biomed. Inform. 2019, 99, 103294. [Google Scholar] [CrossRef] [PubMed]
  4. Burns, G.A.; Li, X.; Peng, N. Building deep learning models for evidence classification from the open access biomedical literature. Database 2019, 2019, 1–9. [Google Scholar] [CrossRef] [PubMed]
  5. McCartney, M.; Treadwell, J.; Maskrey, N.; Lehman, R. Making evidence based medicine work for individual patients. BMJ 2016, 353, i2452. [Google Scholar] [CrossRef] [PubMed]
  6. Krauthammer, M.; Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 2004, 37, 512–526. [Google Scholar] [CrossRef] [PubMed]
  7. Kilicoglu, H.; Demner-Fushman, D.; Rindflesch, T.C.; Wilczynski, N.L.; Haynes, R.B. Towards Automatic Recognition of Scientifically Rigorous Clinical Research Evidence. J. Am. Med. Inform. Assoc. 2009, 16, 25–31. [Google Scholar] [CrossRef] [PubMed]
  8. Shi, H.; Liu, Y. Naïve Bayes vs. Support. Vector Machine: Resilience to Missing Data; Springer: Berlin/Heidelberg, Germany, 2011; pp. 680–687. [Google Scholar]
  9. Khan, A.; Khan, A.; Baharudin, B.; Lee, L.H.; Khan, K.; Tronoh, U.T.P. A Review of Machine Learning Algorithms for Text-Documents Classification. J. Adv. Inf. Technol. VOL 2010, 1, 4–20. [Google Scholar]
  10. Anderlucci, L.; Guastadisegni, L.; Viroli, C. Classifying textual data: Shallow, deep and ensemble methods. arXiv 2019, arXiv:1902.07068. [Google Scholar]
  11. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the EMNLP 2014-2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
  12. Del Fiol, G.; Michelson, M.; Iorio, A.; Cotoi, C.; Haynes, R.B. A Deep Learning Method to Automatically Identify Reports of Scientifically Rigorous Clinical Research from the Biomedical Literature: Comparative Analytic Study. J. Med. Internet Res. 2018, 20, e10281. [Google Scholar] [CrossRef] [PubMed]
  13. Sarker, A.; Mollá, D.; Paris, C. Automatic evidence quality prediction to support evidence-based decision making. Artif. Intell. Med. 2015, 64, 89–103. [Google Scholar] [CrossRef] [PubMed]
  14. Bian, J.; Morid, M.A.; Jonnalagadda, S.; Luo, G.; Del Fiol, G. Automatic identification of high impact articles in PubMed to support clinical decision making. J. Biomed. Inform. 2017, 73, 95–103. [Google Scholar] [CrossRef] [PubMed]
  15. Afzal, M.; Hussain, M.; Haynes, R.B.; Lee, S. Context-aware grading of quality evidences for evidence-based decision-making. Health Inform. J. 2017, 25, 146045821771956. [Google Scholar] [CrossRef] [PubMed]
  16. Satterlee, W.G.; Eggers, R.G.; Grimes, D.A. Effective Medical Education: Insights From the Cochrane Library. Obstet. Gynecol. Surv. 2008, 63, 329–333. [Google Scholar] [CrossRef] [PubMed]
  17. Armstrong, R.; Hall, B.J.; Doyle, J.; Waters, E. "Scoping the scope" of a cochrane review. J. Public Health (Bangkok) 2011, 33, 147–150. [Google Scholar] [CrossRef] [PubMed]
  18. Bethesda (MD): National Center for Biotechnology Information (US). Entrez Programming Utilities Help. 2010. Available online: https://www.ncbi.nlm.nih.gov/books/NBK25501/ (accessed on 1 July 2020).
  19. Winter, D.J. Rentrez: An R package for the NCBI eUtils API. R J. 2017, 9, 520–526. [Google Scholar] [CrossRef]
  20. Rapidminer Build Predictive Models, Faster & Better|RapidMiner Auto Model. Available online: https://rapidminer.com/products/auto-model/ (accessed on 18 July 2020).
  21. Wilczynski, N.L.; Morgan, D.; Haynes, R.B. An overview of the design and methods for retrieving high-quality studies for clinical care. BMC Med. Inform. Decis. Mak. 2005, 5, 1–8. [Google Scholar] [CrossRef] [PubMed]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.