Deep Learning Based Biomedical Literature Classiﬁcation Using Criteria of Scientiﬁc Rigor

: A major blockade to support the evidence-based clinical decision-making is accurately and e ﬃ ciently recognizing appropriate and scientiﬁcally rigorous studies in the biomedical literature. We trained a multi-layer perceptron (MLP) model on a dataset with two textual features, title and abstract. The dataset consisting of 7958 PubMed citations classiﬁed in two classes: scientiﬁc rigor and non-rigor, is used to train the proposed model. We compare our model with other promising machine learning models such as Support Vector Machine (SVM), Decision Tree, Random Forest, and Gradient Boosted Tree (GBT) approaches. Based on the higher cumulative score, deep learning was chosen and was tested on test datasets obtained by running a set of domain-speciﬁc queries. On the training dataset, the proposed deep learning model obtained signiﬁcantly higher accuracy and AUC of 97.3% and 0.993, respectively, than the competitors, but was slightly lower in the recall of 95.1% as compared to GBT. The trained model sustained the performance of testing datasets. Unlike previous approaches, the proposed model does not require a human expert to create fresh annotated data; instead, we used studies cited in Cochrane reviews as a surrogate for quality studies in a clinical topic. We learn that deep learning methods are beneﬁcial to use for biomedical literature classiﬁcation. Not only do such methods minimize the workload in feature engineering, but they also show better performance on large and noisy data.


Introduction
In the practice of modern-age medicine, providing appropriate evidence for clinical decisions plays a vital role in increasing the reliability of the system while practicing evidence-based medicine [1]. However, the classification of a vast set of medical documents is daunting and time consuming. In order to solve these problems, recent research on the classification of documents through machine learning is being actively carried out. Additionally, machine learning algorithms have presented a special use in data mining applications, especially where it is difficult for the human to understand and model the domain [2]. Artificial neural networks appear to advance further the area of biomedical literature mining and classification [3,4]. One of the advantages of deep learning over shallow machine learning is the automatic feature engineering. However, the main issue for deep learning models is the acquisition of pre-annotated data required for training and testing its performance.
This study aims at building a deep learning binary classifier for the identification of quality studies in the biomedical literature to prove its superiority to the promising machine learning approaches historically used for the same problem. To achieve this objective, we design deep learning based on a multi-layer feed-forward artificial neural network also called multi-layer perceptron (MLP) model

Background and Related Work
Over more than two decades, evidence-based medicine has rightfully become part of the fabric of modern clinical practice and has contributed to many advances in healthcare [5]. To work in the domain of evidence-based medicine, clinicians need to keep up with current research findings. However, the issue is that they face challenges in accessing scientifically sound studies in an accurate and timely manner. Currently, the biomedical knowledge has increased many-fold, which spurred interest in the techniques like text mining and natural language processing (NLP) to utilize when dealing with the vast body of biomedical articles [6]. In addition, the biomedical literature is growing by more than one million studies per year, which makes it difficult for researchers to keep pace with the new knowledge published on different topics [3]. These rapid advancements in the domain make it increasingly time consuming for the researchers to critically appraise research studies to find the evidence of clinical impact [7]. The issue is lacking automated and reliable computerized methods to support researchers in recognizing studies of scientific rigor. With the advent of data-driven approaches, we have the opportunity to apply machine-and deep learning algorithms to get autonomous access and appraisal of the biomedical literature.

Use of Machine Learning and Deep Learning for Biomedical Literature Classification
As a subfield of artificial intelligence, machine learning allows machines to learn from data by designing and developing intelligent algorithms and techniques. That is, the machine learns the pattern and characteristics of data, evaluates and predicts new data based on it, and enables us to utilize it. Machine learning can be divided into supervised learning and unsupervised learning. In a supervised learning model, the algorithm learns from the categorized data set and provides the criteria that the algorithm can use to evaluate accuracy in learning data. Conversely, the unsupervised learning model provides unclassified data, and the algorithm tries to understand the data by extracting its characteristics and patterns. Among the classification algorithms, there is a range of classifiers. For instance, the Naïve Bayes classifier is a typical generative classifier and is regarded as a special case of Bayesian Network classifiers [8]. The support vector machine (SVM) algorithm learns how important each training data point is to distinguish the decision boundaries between the two classes during learning. In general, some of the training data, only the data points located at the boundary between the two classes, influence the decision boundary. These data points are called support vectors.
Deep Learning models are a special kind of machine learning that allows computational models to learn data with multiple levels of abstraction through multiple processing layers. Deep learning discovers complex structures in big datasets by using the backpropagation algorithms to show how a machine should alter its inner parameters using the representation in the past layer to calculate the representation in each layer. There are three important types of neural networks used in deep learning models: convolutional neural networks (CNN), recurrent neural networks (RNN), and multi-layer artificial neural networks (ANN). Deep convolutional networks have made breakthroughs in image, video, voice, and audio processing, while recurrent networks have shed light on sequential information such as text and voice [8]. In contrast, multi-layer ANN is a suitable option for the classification of textual data structured in a tabular form.
NLP, data mining, and machine and deep learning techniques work together to classify and discover patterns in the text of biomedical documents automatically. The primary objective of text mining is to allow users to obtain data from textual resources and deal with such activities as retrieval, classification, and summarization [9]. Anderlucci, Laura, et al. provide a detailed comparison of the shallow, ensemble, and deep learning methods used for classification of textual data [10]. Initially invented for computer vision, CNN models have subsequently been shown to be useful for NLP and have achieved excellent results in semantic parsing [11]. A CNN-based deep learning model [12] by Del Fiol et al. was trained using a large, noisy dataset of PubMed citations with title and abstract as features and obtained comparatively better results as compared to competitors that include PubMed s Clinical Query Broad treatment filter and McMaster s text word search strategy. Using supervised machine learning methods, Sarkar et al. develop a model for identifying quality articles using data features of title, abstract, and others [13]. Bian et al. proposed a machine learning-based high-impact classifier [14] trained on a set of different features and claimed to outperformed the high-quality Naïve Bayes classifier proposed by outperforms Kilicoglu et al. s [7]. Afzal et al. built compared different machine learning algorithms and learned to choose a support vector machine (SVM) based model due to its higher performance [15]. As an extension to this work, we propose an MLP-based binary classifier and compare its performance with shallow and ensemble machine learning methods.

Use of Cochrane Reviews for Annotation of Scientifically Rigor Studies
Cochrane Collaboration is an international organization that prepares, maintains, and offers available systematic reviews of health care interventions benefits/risks. The Cochrane Library is commonly considered to be the best source of credible healthcare evidence [16]. Systematic reviews use a transparent and systematic process to construct study questions, look for studies, evaluate their quality, and synthesize qualitative or quantitative results [17]. The Cochrane Database of Systematic Reviews is the world s most vibrant resource of meta-analysis, with 54 active organizations responsible for organizing, advising, and publishing systematic reviews. In this paper, we utilize studies cited in Cochrane reviews as a surrogate for quality studies. Because of the recognition of Cochrane reviews at a global scale as the best standard in evidence-based medicine, we, therefore, adopted it for the classification of primary documents referenced in the reviews as high-impact evidentiary articles.

Methods
We propose an MLP model for evaluating studies of scientific rigor. As depicted in Figure 1, our proposed method consists of two steps. Step 1 is dedicated to acquiring identifiers of studies through queries from the sources and apply filters to remove duplications in the data. Step 2 processes the deduplicated data to get the text of titles and abstracts, which are then preprocessed to create feature vectors to apply the proposed MLP model to classify the studies into scientific rigor and non-rigor.

Step 1-Preparation of Datasets
To collect high-quality studies, we use the Cochrane Library by executing a general-purpose query of Kidney disease and obtain the identifiers called PMID (PubMed Identifiers) for all the retrieved articles. The same query is executed using PubMed and the PMIDs for all the retrieved articles are obtained. The records obtained from the Cochrane are labeled with "scientific rigor" class, and the records obtained from PubMed are labeled with "scientific non-rigor." In the step of deduplication, all duplicated items are removed from the collected PMIDs in the PubMed dataset in order to avoid the chance of assigning two labels to the same study. Identifiers in both the datasets are uploaded using NCBI Entrez Programming Utilities (eUtils) [18]. The eUtils is a service of NCBI that provides access to a total of 50 databases via a web interface, a public FTP (file transfer protocol) site [19]. Using search and fetch functions of the API, we retrieve two data features (title and abstract) from each article. At this stage, the title and abstract are in the original format consisting of texts as written in the publications. The combined dataset holds 7958 records, out of which 1083 are classed as scientific rigor and the rest as scientific non-rigor.

Step 2-Development of Deep Learning Model
To achieve the optimal design for MLP, we employ the Auto Model extension of RapidMiner, which provides a graphical visual environment for the convenience of designing a faster and better model of automated classification and prediction [20]. To learn the classification model, the dataset is passed through several internal steps that include preprocessing, feature extraction, and feature engineering, as shown in the detailed diagram ( Figure 2) of step 2 of our proposed methodology.

Preprocessing
Preprocessing consists of multiple sub steps such as the role setting and the transformation of the initial types. In the role setting step, all attributes' roles are changed to regular except the class attribute, which is set to the 'label' role. In the transformation of the initial types step, all the text columns are transformed into polynomial columns. After initial preprocessing, the data is checked for missing values and filled where applicable; for instance, the missing numeric values are replaced with the average value. Finally, the values are filtered based on the no_missing_labels parameter that keeps only those records that do not have a missing value in the special attribute with the label role. The data is then split into training set and validation set (holdout data) with a ratio of 6:4, respectively.

Step 1-Preparation of Datasets
To collect high-quality studies, we use the Cochrane Library by executing a general-purpose query of Kidney disease and obtain the identifiers called PMID (PubMed Identifiers) for all the retrieved articles. The same query is executed using PubMed and the PMIDs for all the retrieved articles are obtained. The records obtained from the Cochrane are labeled with "scientific rigor" class, and the records obtained from PubMed are labeled with "scientific non-rigor." In the step of deduplication, all duplicated items are removed from the collected PMIDs in the PubMed dataset in order to avoid the chance of assigning two labels to the same study. Identifiers in both the datasets are uploaded using NCBI Entrez Programming Utilities (eUtils) [18]. The eUtils is a service of NCBI that provides access to a total of 50 databases via a web interface, a public FTP (file transfer protocol) site [19]. Using search and fetch functions of the API, we retrieve two data features (title and abstract) from each article. At this stage, the title and abstract are in the original format consisting of texts as written in the publications. The combined dataset holds 7958 records, out of which 1083 are classed as scientific rigor and the rest as scientific non-rigor.

Step 2-Development of Deep Learning Model
To achieve the optimal design for MLP, we employ the Auto Model extension of RapidMiner, which provides a graphical visual environment for the convenience of designing a faster and better model of automated classification and prediction [20]. To learn the classification model, the dataset is passed through several internal steps that include preprocessing, feature extraction, and feature engineering, as shown in the detailed diagram ( Figure 2) of step 2 of our proposed methodology.

Performance and Explanation
The trained classification model is applied on hold-out datasets to get the performance in the form of widely used matrices in the AI domain that include accuracy, recall, precision, Fmeasurement, and AUC. In addition to these metrics, we also explain the model′s predictions in the form of confidence value. The higher confidence value of a prediction shows model trust on a classification of a study leads the users to accept the output of the model with a firmer belief.

Model Selection
To prove the hypothesis that deep learning model can perform better than the shallow machine learning algorithms, we choose four well-known algorithms which have been experimented with in multiple studies [7,13,15] that include Naïve Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), and Gradient Boosted Trees (GBT). A brief description of parameter settings for each algorithm is provided in Table 2.

Algorithm
Parameter Settings Naïve Bayes laplace correction (True) Decision Tree criterion (gain_ration), maximum depth (10) We compare the performance of these algorithms with proposed MLP on the basis of an accumulated score obtained from summing up the widely used performance metrics such as accuracy, recall, F1-measure, and AUC (area under the ROC curve). In order to evaluate the variations, we check to compare the performance of each algorithm individually on title and abstract and then finally on the combination of them, as shown in three compartments of Table 3.

Preprocessing
Preprocessing consists of multiple sub steps such as the role setting and the transformation of the initial types. In the role setting step, all attributes' roles are changed to regular except the class attribute, which is set to the 'label' role. In the transformation of the initial types step, all the text columns are transformed into polynomial columns. After initial preprocessing, the data is checked for missing values and filled where applicable; for instance, the missing numeric values are replaced with the average value. Finally, the values are filtered based on the no_missing_labels parameter that keeps only those records that do not have a missing value in the special attribute with the label role. The data is then split into training set and validation set (holdout data) with a ratio of 6:4, respectively.

Feature Extraction
Technically, this step performs tokenization, changing the case to lower, and calculating TF-IDF (term frequency-inverse document frequency) values for each token across all the records. The input text in the nominal form is transformed into a vectorized format using TF-IDF. The TF-IDF is a statistical value which reflects how important a word is to a document in a corpus. This step helps later in the feature selection based on the importance determined by TF-IDF value.

Automatic Feature Engineering
The automatic feature engineering is a robust utility of RapidMiner that uses a deep learning MLP model internally for the selection of a subset of features from a full set of features. The MLP comes with default parameters that are optimized for finding the best feature sets. Taking all records as a training dataset, the MLP is trained to obtain the final features on the default parameter settings provided by the Auto Model extension of RapidMiner.

Classification Model
After obtaining the best feature sets through automatic feature engineering, a classification model was used to identify the scientific rigor studies. The classification model is the same deep learning MLP model as used for feature engineering. The parameter settings, as described in Table 1, are the optimized default values provided by the RapidMiner.

Performance and Explanation
The trained classification model is applied on hold-out datasets to get the performance in the form of widely used matrices in the AI domain that include accuracy, recall, precision, F-measurement, and AUC. In addition to these metrics, we also explain the model s predictions in the form of confidence value. The higher confidence value of a prediction shows model trust on a classification of a study leads the users to accept the output of the model with a firmer belief.

Model Selection
To prove the hypothesis that deep learning model can perform better than the shallow machine learning algorithms, we choose four well-known algorithms which have been experimented with in multiple studies [7,13,15] that include Naïve Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), and Gradient Boosted Trees (GBT). A brief description of parameter settings for each algorithm is provided in Table 2. We compare the performance of these algorithms with proposed MLP on the basis of an accumulated score obtained from summing up the widely used performance metrics such as accuracy, recall, F1-measure, and AUC (area under the ROC curve). In order to evaluate the variations, we check to compare the performance of each algorithm individually on title and abstract and then finally on the combination of them, as shown in three compartments of Table 3. We can see the performance of each model in contrast individually on the accuracy, AUC, recall, and F1-measure scores as well as the overall score. Applying the proposed MLP using the dataset with the title and abstract has the highest overall score. Comparing models on to title feature alone, we can observe the highest overall score (value: 255) is slightly less than the highest overall score (value: 257) of models trained on the abstract feature. When the two features are combined, the overall score for almost all algorithms has been increased. We can observe the highest overall score (value: 283) of MLP trained on the combination of title and abstract is considerably better in each of the machine learning algorithms. The second-highest performer, i.e., GBT overall score (value: 262), is about 10% less than the MLP score. Not only the overall score, but also the MLP performs better at each metric, i.e., the accuracy (97.3%), AUC (0.993), recall (95.1%), and F1-measure (90.4%) are higher by about 2, 1, 7, and 7%, respectively, than the second-best performer. Based on these measurements, we conclude that the proposed deep learning model is a better performer on the given dataset; we choose the MLP to test on datasets from a selected clinical domain.

Results and Discussion
We apply the MLP model to evaluate the performance of the unseen test data retrieved with a set of real-world queries executed for two domains: kidney disease (same domain used in training) and cancer disease (different domain). As shown in Figure 3, the process steps used for data manipulations are similar to the steps used at the training stage of the model, i.e., the preprocessing, feature extraction, and automatic engineering. The already trained MLP classification model is loaded to apply to the feature sets created from the data retrieved with queries.

Scenario 1: Results of Same Domain Test Queries
Hypothesis 1. The proposed deep learning model yields equivalent or marginally lower accuracy for unseen test dataset retrieved with real world queries from the same domain. We define four queries and collect the articles by running each query on the PubMed database. The defined queries and the number of articles retrieved against each query are shown in Table 4. The results for each query are shown in Table 5. We can observe that there is the least variation in the performance of the model on each query. Additionally, each metric performance is marginally lower than the performance on a hold-out dataset, which proves our Hypothesis 1.

Scenario 1: Results of Same Domain Test Queries
Hypothesis 1. The proposed deep learning model yields equivalent or marginally lower accuracy for unseen test dataset retrieved with real world queries from the same domain. We define four queries and collect the articles by running each query on the PubMed database. The defined queries and the number of articles retrieved against each query are shown in Table 4. The results for each query are shown in Table 5. We can observe that there is the least variation in the performance of the model on each query. Additionally, each metric performance is marginally lower than the performance on a hold-out dataset, which proves our Hypothesis 1. To test this hypothesis, we similarly collected data as we collected for other queries, however, for a different domain (cancer disease). We collected 1022 high-quality literature data and 6569 general literature data. This cross-domain results in evaluating the proposed model are shown in Table 6. We can observe that the performance is slightly lower than the average score of the same domain, however it is nearly equivalent to query 4, which has the lowest score in the four queries of the same domain. The marginally lower performance (not less than 10%) than the score on the hold-out dataset, proves Hypothesis 2.

Significant Findings
To our knowledge, this is the first study to use Cochrane Systematic Reviews for training deep learning techniques to identify scientifically sound studies in the biomedical literature in the Kidney domain. Besides, our proposed deep learning model performed reasonably well compared with state-of-the-art machine learning approaches. Beyond higher accuracy, we noticed that the proposed deep learning model performed well for unknown test results obtained with the real-world queries. It has shown excellent performance when experiments are performed through multiple queries within the same domain. In addition, our model performed considerably well in other domains, like cancer. Besides, the results were consistent for testing datasets across different queries. The minimum variation across different queries and different domains, qualify our proposed model to be useful to use for identification of high-quality medical articles in the biomedical literature.

Comparison with Prior Work
There have been numerous attempts to recognize high-quality medical literature that has been automated using computing techniques. A comparison of previous studies and our proposed model is shown in Table 7. Although the comparison is thematic as the overall objective of the mentioned studies is the same, i.e., the identification of high-impact studies in the biomedical literature, it provides a perspective on the superiority of the deep learning model over shallow machine learning methods. From a method perspective, the study conducted by Del Fiol et al. [12] is closer to our study as they have also used a deep learning method in their experiments. The difference was that they used deep learning based on CNN, while the proposed method used MLP-a deep learning based on multi-layer feed-forward neural networks. They utilized clinical queries data as a surrogate for high-impact studies while we used Cochrane citations as a surrogate for scientific rigor (high-impact) studies. Both studies used titles and abstracts as features for the experiment. In both cases, the data were noisy because of the existence of false positives in the dataset. The recall in both cases is similar to about a 1% difference. However, there is a huge difference in F-measure due to the higher precision of our approach. One of the possible reasons is the noise in data, i.e., the number of false positives. Another possibility is that using Cochrane citations as a surrogate for high-impact studies may be more impactful as compared to clinical queries or clinical hedges. Afzal et al. [15] used the Quality Recognition Model (QRM). The QRM is a supervised classification model based on the SVM machine learning algorithm trained on a dataset of clinical hedges annotated by a team of professionals [21]. This was our prior work, where we compared multiple machine learning algorithms and found SVM as the best performer. In this study, we compared our deep learning model with the best performer and other machine learning algorithms and found an increase of 5% in the accuracy. The study conducted by Bian et al. [14] used 15,845 PubMed documents and obtained 77.5% recall using the high-impact Naïve Bayes classifier. The authors found that Scopus citations and journal impact factors are two key features as compared to other features such as the number of comments on PubMed, high-impact journals, Altmetric scores, and other PubMed metadata.

Error Analysis
Cochrane Systematic Reviews are usually not up to date because of the dependency on human experts. Therefore, the training dataset lacks the inclusion of the current studies available in the primary literature, at least for the class of high-quality medical evidence. As a result, there is an imbalance between high-quality medical evidence literature and primary literature data, which may cause an error. It may contain data that has not been fully validated because the proposed model was trained on the data labeled without the help of medical professionals.

Limitation and Future Work
We have created the literature data obtained from the review of the Cochrane Library as high-quality literature data. Firstly, we need more high-quality medical evidence literature data, which is usually not available in the Cochrane Library for specific topics. Secondly, the up-to-date research papers were not included in the training model as Cochrane Library lacks the reviews for recent research. Lastly, the proposed model uses only text data such as title and abstract and is not tested by adding features such as year of publication and impact factor.
These limitations could be addressed by increasing the dataset combining data retrieved through PubMed Clinical Queries with the Cochrane Reviews data. Alternatively, the expertise of medical professionals could be utilized for the evaluation of a dataset of testing queries-add the experts annotated data to the training dataset at a particular stage and retrain the model.

Conclusions
In evidence-based medicine, clinicians need high-quality medical evidence data to provide the best guidance. We proposed a deep learning model that automatically classifies biomedical literature by learning high-quality studies and general studies data. The proposed model utilized data from Cochrane Systematic Reviews and PubMed primary literature. We evaluated the proposed model based on different performance parameters like accuracy, F1-measure, and area under the curve (AUC) and obtained reasonably better results compared to competitors. We tested the model on different medical queries in the same domain of which training dataset was accumulated and also in a different domain. Experiments on four different queries in the same domain yielded similar performance to those we evaluated for training data, while the performance was slightly lower for the different domains. This research on automatic identification of high-quality evidentiary documents will reduce the work hours of clinicians who practice evidence-based medicine. It will also increase confidence of users in the decisions made by the physicians assisted with decision-making systems. In this sense, our proposed model will be a useful addition to the scientific world as well as real-world clinical setups for its vast utility in clinical practice and research.