A Sentence Classification Framework to Identify Geometric Errors in Radiation Therapy from Relevant Literature

The objective of systematic reviews is to address a research question by summarizing relevant studies following a detailed, comprehensive, and transparent plan and search protocol to reduce bias. Systematic reviews are very useful in the biomedical and healthcare domain; however, the data extraction phase of the systematic review process necessitates substantive expertise and is labour-intensive and time-consuming. The aim of this work is to partially automate the process of building systematic radiotherapy treatment literature reviews by summarizing the required data elements of geometric errors of radiotherapy from relevant literature using machine learning and natural language processing (NLP) approaches. A framework is developed in this study that initially builds a training corpus by extracting sentences containing different types of geometric errors of radiotherapy from relevant publications. The publications are retrieved from PubMed following a given set of rules defined by a domain expert. Subsequently, the method develops a training corpus by extracting relevant sentences using a sentence similarity measure. A support vector machine (SVM) classifier is then trained on this training corpus to extract the sentences from new publications which contain relevant geometric errors. To demonstrate the proposed approach, we have used 60 publications containing geometric errors in radiotherapy to automatically extract the sentences stating the mean and standard deviation of different types of errors between planned and executed radiotherapy. The experimental results show that the recall and precision of the proposed framework are, respectively, 97% and 72%. The results clearly show that the framework is able to extract almost all sentences containing required data of geometric errors.


Introduction
Systematic reviews are a type of review that uses a rigorous and transparent approach to provide an evidence-based answer for a particular clinical question by summarizing relevant literature [1,2]. Systematic reviews are composed by summarizing different data elements of a particular topic, collected from relevant articles following a detailed, comprehensive, and transparent plan to reduce bias [2]. Some examples of such data elements include the population of an intervention, the inclusion criteria for testing the effect of a drug, etc. The experts manually extract these data elements from the relevant literature, following a predefined protocol, and build a systematic review, a process which typically requires a substantial amount of time [1]. The process of data element extraction to compose a systematic review is labour-intensive and time-consuming when dealing with large quantities of data [1][2][3]. The main challenge that such efforts face lies in the fact that the required data elements that need to be identified within a particular article lack particular, defined patterns of occurrence and are typically reported within tables or in plain text. Moreover, the data elements may occur in a variety of contexts within an article, rendering their identification extremely difficult.
Within the radiation therapy domain, systematic reviews form extremely useful mechanisms for providing answers to particular radiation therapy questions and tasksfor example, identifying evidence of the effective improvement of geometric discrepancies in the radiation therapy of cancer patients. The geometrical uncertainties are developed from the treatment process of the external beam radiotherapy of tumors [4]. The main sources of uncertainty are tumor delineation inaccuracies of the gross tumor volume, unknown extent of microscopic tumor, organ positional variation within the patient, and setup variations [4]. The deviation between planned and executed radiotherapy indicates geometric error, or discrepancy, even if it is small [4]. Therefore, geometric errors have to be identified and removed for safe radiation therapy. There are many recent studies which address issues of measurement and the reduction in geometrical errors [5][6][7].
In this work, we report on the development of a framework based on machine learning and natural language processing (NLP) for extracting sentences containing required data elements of geometric discrepancies in radiation therapy from relevant literature. The work was carried out in collaboration with a research radiation therapist at the Somerset NHS Foundation Trust in the UK, who collected the articles containing relevant data elements of geometric errors for experimental analysis. The relevant sentences from a number of articles containing required geometric errors were manually identified by a radiation therapist to evaluate the performance of the proposed framework. We experimentally evaluated the framework and reported on its effectiveness and limitations for the data extraction of geometric errors of radiation therapy from relevant literature. The experimental results delineate that the use of an SVM classifier can extract the sentences containing the required geometric errors with a 97% recall and 72% precision, demonstrating the effectiveness of our approach.

Related Works
The purpose of radiation therapy or radiotherapy is to deliver doses of radiation to tumors by minimizing the risk of side effects in healthy tissues. Undeniably, radiotherapy planning and delivery face many uncertainties [8]. Target volume definition, the first step in the treatment planning chain, is associated with substantial uncertainty [8]. The main sources of uncertainty are the tumor delineation inaccuracies of the gross tumor volume, the unknown extent of microscopic tumor, as well as the organ positional variation within a patient and setup variations [4]. The geometric error or discrepancy indicates any deviation between the planned and executed radiotherapy, even if it is small [4]. Hence, it is necessary to have high geometrical accuracy for a safe clinical application of precise radiotherapy. Some recent studies describe the process of measurement and reduction in geometrical errors [5][6][7]. There are some reviews which discuss different types of errors in radiotherapy and the process to overcome these discrepancies [3,[8][9][10]. Such reviews are extremely valuable and they necessitate research radiation therapists to manually explore the literature to identify case specific geometric errors as well as the corresponding measurements to plan for safe doses. Such tasks though are both resource and time expensive, since the required data elements that need to be extracted do not follow particular or fixed reporting patterns. Machine learning approaches can potentially be used to address these challenges [2,11].
There are a growing number of efforts to identify data elements related to a number of diseases across both the scientific literature as well social media datasets using machine learning and NLP techniques [1,12,13]. Goswami et al. developed a machine learning technique applying a random forest classifier to extract data elements of anxiety outcome measures from relevant literature [11], with potential to assist reviews with large numbers of studies synthesising these measures [3]. RobotReviewer is a web-based system that employs both machine learning and NLP to identify the Risk of Bias (RoB) of how a particular clinical study was performed [14]. Another recent study by Basu et al. describes a machine learning framework to identify relevant data elements of congestive heart failure from literature applying SVM classifier [2]. Several PubMed indexed systematic reviews of congestive heart failure were utilised to generate the training data in this study [2]. Hassanzadeh et al. proposed a framework for quantifying the semantic similarity of clinical evidence in the biomedical literature based on a series of component level generic and domain specific semantic similarity measures [15].
Different workshops on NLP were organised for the de-identification of protected health information from relevant medical records by the Informatics for Integrating Biology and the Bedside (i2b2) research group based at Harvard Medical School [16][17][18][19][20]. Yim et al. developed a sparse annotation method for tumour information extraction and built a conditional random field based system for entity and relation extraction for these characteristics [21]. Recently, Wang et al. published a review article of clinical information extraction applications [22]. They analysed different applications, based on machine learning and NLP techniques, for information extraction from various types of electronic health records [22].

Proposed Framework
However, to our knowledge, there is no study that discusses the issue of automatically identifying data elements of geometric errors of radiotherapy from relevant publications. To address this need, a supervised machine learning framework is developed to extract the sentences containing the required geometric errors of radiotherapy from the relevant literature. The framework consists of two major parts, as described below.

Building Training Corpus
We used 60 articles in PDF format related to geometric errors of radiotherapy to conduct this study. Fitz (https://pypi.org/project/PyMuPDF/1.9.2, accessed on 17 March 2021), a Python module, was used to convert the PDFs to free text. A total of 52 out of 60 documents were randomly selected to build the training corpus, with two classes, geometric-errors, and non-geometric-errors. In principle, the geometric errors class should have the sentences that contain the required geometric errors of radiotherapy. Certain keywords related to the geometric errors of radiotherapy-e.g., geometric organ error-were used to identify whether a sentence belongs to the geometric errors class. This set of keywords was defined by the domain expert and is reported in Table 1. The sentences that do not contain any of these keywords related to geometric errors were used to form the non-geometric errors class. There may exist some sentences in an article that contain some of the required keywords, but do not contain any required data element-i.e., a decimal number. These sentences were discarded, as they were not relevant to either of the classes.
A sentence similarity measure was used to identify the relevant sentences from a given article that represent individual classes. The similarity measure, termed sent_sim, was defined in line with the Jaccard similarity measure [23]. The similarity between a keyword (say, kw) that represents the geometric error and every sentence (say, S) in a given article or part of the training set is defined as: Here, T (kw) and T (S) denote the set of words in kw and S, respectively. Note that the values of sent_sim range between [0, 1], where 1 denotes highest similarity. The aim of sent_sim is to identify how many words of kw exist in S, unlike traditional sentence similarity measures, such as Jaccard, which compute the similarity based on the common words of two sentences. Let us assume that kw is a small phrase and S is a large sentence, but many words of kw exist in S. In that case, the sent_sim score of kw and S will be high, indicating the sentence S is relevant to kw. The sentences with sent_sim scores greater than or equal to a prefixed threshold α and containing a decimal number were extracted from a given article to construct the geometric-errors class. The value of α was fixed experimentally as described in Section 5. Sentences with a sent_sim score of 0, for all the given keywords, were used to form the non-geometric-errors class. The remaining sentences of the document were discarded. Algorithm 1 describes the detailed steps of the training corpus generation.

Extraction of Desired Data Elements
A machine learning framework was developed in the second stage, where the training corpus was used to train a classifier to determine whether a sentence from a test article contains any geometric error. The bag of words model was then applied for generating features from free text. Unigrams, bigrams, and trigrams generated from sentences were used as features with the SVM classifier in the experimental analysis. A unigram considers all unique words in a sentence as features [24]. A bigram or trigram, on the other hand, considers only two or three consecutive words as a feature, respectively [24]. Both bigrams and trigrams were used in this framework, since there were many terms in the training corpus-e.g., rotational discrepancy, random displacement error-which should be conjoined for analysis.
The conventional vector space model was used to represent the vector corresponding to each sentence [24,25], which is widely used by several text classification and clustering techniques [26]. Let us consider the number of sentences in the corpus as n and the number of unique terms-i.e., the number of unigrams, bigrams, and trigrams-as m. Let us also consider that t i denotes the ith term and the frequency of t i in the jth sentence is denoted by t f ij , i = 1, 2, · · · , m; j = 1, 2, · · · , n. The entropy-based term weighting technique is used by many researchers to form a term-document matrix from free text data [27,28]. This method reflects the assumption that the more important term is the more frequent one that occurs in fewer documents, taking the distribution of the term over the corpus into account [28]. Thus, the weight of a term t i in the jth sentence, denoted by W ij , is determined by the entropy-based technique (https://radimrehurek.com/gensim/models/logentropy_model.html, accessed on 17 March 2021) [28] as follows: Let us assume, S j is the vector of a sentence, say S j , where the ith component of the vector is W ij -i.e., S j = W 1j , W 2j , · · · , W mj , ∀j = 1, 2, · · · , n. The cosine similarity is a commonly used measure to find similarity between documents [24][25][26]. Thus, the similarity between two sentences-say, S j and S k -can be defined as: Note that cosine similarity is non-negative and ranges between 0 and 1, both inclusive. cos( S j , S k ) = 1 indicates that the sentences are exactly similar and the similarity decreases as the value comes nearer to 0. The SVM classifier is used to classify the sentences of the test documents using the training corpus. Given a set of training documents in a vector space, SVM finds the best decision hyperplane that separates individual documents belonging to two different classes. An SVM classifier extends its applicability on the linearly non-separable data sets either by using soft margin hyperplanes or by mapping the original data vectors to a higher dimensional space in which the vectors are linearly separable. The linear kernel is recommended when a data set has large number of features [29], since it has been reported that mapping the data to a higher dimensional space using a non-linear kernel does not result in substantial performance improvement [29]. Since free text data is high-dimensional, an SVM classifier with linear kernel that improves the performance of text classification [29,30] is used in the experimental analysis.

Experimental Setup
The performance of logistic regression, random forest, and SVM classifiers was tested to classify the sentences of the test documents. The training set was used to tune the parameters of these classifiers, applying 10-fold cross validation technique. The sentences of the test documents were then classified using the best set of parameters of each classifier. The sentences were either classified to the geometric errors class or to the non-geometric-errors class. The code and data set that were used to implement the proposed framework are available at Github (https://github.com/tanmaybasu/A-Sentence-Classification-Framework, accessed on 17 March 2021).

Evaluation Measures
The performances of the individual classifiers were evaluated using precision, recall, and f-measure [25]. The precision and recall can be defined as: Here, true positive represents the number of sentences correctly predicted as belonging to the geometric-errors class. False positive represents the number of sentences that are predicted as geometric errors but are members of the non-geometric-errors class. False negative represents the number of sentences that, while predicted as non-geometricerrors, are members of the geometric-errors class. The f-measure can be defined in the following way: F-measure will be high when the values of precision and recall are close to each other [31]. The value of f-measure is 1 when the values of precision and recall are 1 and becomes 0 when the precision is 0, recall is 0, or both are 0. Thus, the value of the f-measure ranges in between 0 and 1. A high value for f-measure indicates the good performance of a classifier.

Analysis and Results
The training and test corpora contained 9545 and 4336 sentences, respectively. The training corpus included 324 sentences belonging to the geometric errors class and 9221 sentences that were members of the non-geometric errors class. The rest of the sentences of the training corpus were discarded, as they were not related to either of the classes. Table 2 shows the performance of the proposed framework using the SVM classifier to classify the sentences of the eight test documents.
Note that the objective of this framework is to achieve a high accuracy in terms of identifying sentences containing measurements of different types of geometric errors from the test documents. Therefore, a high recall is desirable. The sentences of the test documents containing the required data elements were manually identified by a domain expert in radiation therapy to evaluate the performance of the framework.
Thus the recall and precision scores for each test document were computed using the sentences manually identified by the domain expert and the sentences extracted by the proposed framework. The true positive, false positive, false negative, and recall and precision scores of each of the eight test documents are presented in Table 2. Almost all the test documents have zero false negatives, leading to a very good recall score, indicating that the proposed system is able to retrieve relevant information from these documents. Table 3 shows that the aggregate recall of the framework using SVM classifier for the eight test documents is 0.97, while the aggregate precision score is 0.72. A low precision score is still efficient, since potential reviewers would now need to review only 1/0.72 = 1.38 sentences per document to identify the geometric errors as opposed to reading the entire document, which, on average, contain around 200 sentences.

Discussion
We developed a framework based on bag of words model and SVM classifier that partially automates the process of building systematic radiotherapy literature reviews by extracting relevant sentences from the literature. Logistic regression and random forest classifiers also perform well for text classification [11,32,33]. Hence, the performance of the proposed framework is assessed using these classifiers. The aggregate precision and recall scores of the proposed framework using logistic regression, random forest, and SVM classifiers for the eight test documents are reported in Table 3. The f-measure scores, reported in Table 3, were computed from the aggregate precision and recall scores of the individual classifiers. It can be seen from Table 3 that the SVM classifier obtained the best performance in terms of precision, recall, and f-measure.
To our knowledge, the proposed framework is the first of its kind that can automatically extract geometric errors from relevant publications to expedite the process of systematic literature review. Goswami et al. developed a similar method based on term frequency and inverse document frequency (TF-IDF)-based term weighting scheme [24] to extract anxiety outcome measures for comfort intervention from relevant literature [11]. This approach used different articles collected from Medline, EMBASE, CINAHL, and AHMED related to anxiety outcome measures to build the training corpus [11]. However, a set of keywords, defined by the domain experts, was utilised to assess whether any of these keywords occurred in a sentence so as to generate a training corpus, unlike our method that employs a sentence-matching technique. Furthermore, the keywords that were identified by the domain experts for this study [11] are fairly simple, whereas the keywords used in our approach were much more complex. Let us consider the following sentence from test document 3 [34].
It may be noted that 'translational error' is one of the keywords in the proposed study and this sentence is clearly describing a geometric error. However, this sentence is indicating another author's work cited in this paper [34] and hence it is treated as false positive by the domain expert. There are many such sentences in different test documents. As the proposed framework is based on bag of words model and was trained on the sentences that contain the keywords, any similar sentence will be extracted as relevant. Thus, the number of false positives is high for some test documents, which results in a low aggregate precision score.
In principle, a high value of the sentence similarity threshold α is desirable in Algorithm 1 so as to avoid a performance degradation. On the other hand, a very high value of α (e.g., α = 0.95) may result in the inclusion of very few sentences in the geometric_errors class of the training set. In order to assess the necessary trade-off between these two, the value of α was experimentally determined and different training corpora were generated using different α values. Subsequently, the SVM classifier was performed to classify the sentences of these training corpora following 10-fold cross validation technique. Eventually, the training set for a particular α value, with the highest f-measure, was used to classify the test documents. Thus the value of α was fixed to report the results in Tables 2 and 3.
The performance of the proposed framework was also tested using the conventional TF-IDF based term weighting scheme of the vector space model for text document representation [11,24] instead of entropy-based technique. Additionally, the simple keyword matching technique using sent_sim similarity measure to build the training corpus is used to extract relevant sentences from the test documents and the performance is reported in Table 4. The performance of the SVM classifier applying both the entropy-based feature weighting scheme and the TF-IDF-based feature weighting scheme is also reported in Table 4. Moreover, the performance of BioBERT [35], which is a pre-trained language representation model for the biomedical domain is reported in Table 4. BERT (Bidirectional Encoder Representations from Transformers) is a contextualised word representation model that is based on a masked language model and pretrained using bidirectional trans-formers [36]. This deep learning architecture has been widely used in many NLP tasks over the last few years [35]. BERT was pretrained on general domain corpora-i.e., English Wikipedia and books [36]. BioBERT was initialised with weights from BERT and was pretrained on full text articles and abstracts from PubMed [35]. BioBERT performed very well for certain NLP tasks-e.g., sentence classification for relation extraction [35]. We used the BioBERT pretrained model (https://github.com/naver/biobert-pretrained, accessed on 17 March 2021) and then fine-tuned it on our training corpus. Subsequently, the sentences of the test documents were classified using this pretrained model and the sentence classification framework of BioBERT. It can be seen from Table 4 that the proposed framework-i.e., the entropy-based feature weighting scheme and SVM classifier-performs better than BioBERT and other techniques in terms of aggregate precision, recall, and f-measure scores of eight test documents. It is observed from Table 4 that BioBERT did not perform well on the test documents. We checked the vocabulary built by the pretrained BioBERT model on the PubMED corpus and noticed that it does not contain some useful words from the given keywords in Table 1, which appeared in many documents of the training and test corpora. Hence, it could not capture the semantic interpretation of these keywords from the given texts. The proposed framework has some limitations, although it has performed well empirically. The method extracts required geometric errors from relevant documents, but it cannot make any judgment on the extracted data. This framework works on free text documents and it can not read and extract data from figures or charts. Furthermore, the proposed framework is based on bag of words model and cannot therefore apply any semantic interpretation of the text extracts. A deep learning based document or word embeddings could potentially be employed to generate such semantic features from the documents. In this particular case, however, such deep learning approaches, since they require a large number of documents for training, may not work well, as the size of the corpus used in this work is very small.

Conclusions
A machine learning and NLP-based framework is proposed in this study to automatically build a training corpus followed by a sentence classification framework to extract required geometric errors of radiotherapy from relevant literature. The sentence classification framework was developed based on bag of words model for text feature generation, followed by an entropy-based feature weighting scheme and SVM classifier. Although the SVM classifier extracted almost all the relevant sentences containing the measurement of different geometric errors, it extracted some false positive sentences as well from the test documents. In future, we plan to build a deep learning-based embedding by using a substantial number of relevant articles of geometric errors in radiotherapy over PubMED, Scopus, Wikipedia, and other relevant resources to properly derive the semantic interpretation of the contextual information. We also plan to include a direct feed into a systematic review paper and inferences over the extracted data that would be useful for clinical researchers. Finally, we plan to generalise our approach and assess its effectiveness for other diseases and clinical settings.
Author Contributions: T.B. contributed to the conception and design of the proposed framework and to conduct the experimental analysis. S.G. collected the data and had done the manual annotations of the test documents to evaluate the performance of the framework. T.B., S.G. and G.V.G. took part in writing the manuscript and accountable for the manuscript's contents. All authors have read and agreed to the published version of the manuscript.