Towards Identifying Author Confidence in Biomedical Articles

In an era where the volume of medical literature is increasing daily, researchers in the biomedical and clinical areas have joined efforts with language engineers to analyze the large amount of biomedical and molecular biology literature (such as PubMed), patient data, or health records. With such a huge amount of reports, evaluating their impact has long stopped being a trivial task. In this context, this paper intended to introduce a non-scientific factor that represents an important element in gaining acceptance of claims. We postulated that the confidence that an author has in expressing their work plays an important role in shaping the first impression that influences the reader’s perception of the paper. The results discussed in this paper were based on a series of experiments that were ran using data from the open archives initiative (OAI) corpus, which provides interoperability standards to facilitate effective dissemination of the content. This method may be useful to the direct beneficiaries (i.e., authors, who are engaged in medical or academic research), but also, to the researchers in the fields of biomedical text mining (BioNLP) and NLP, etc.


Introduction
The interest in biomedical digital libraries, along with the continuous development of various qualitative and quantitative text analysis tools, has made language technologies the natural choice to analyze the evolution of scientific life. Mining biomedical literature to extract the science behind it, such as concepts, patterns, or relations, is a very productive research area. The extraction of nonscientific information from biomedical data has recently seen an increase in interest, with applications ranging from the identification of speculative language, to the retrieval of papers with a specific writing style, in an attempt to cope with different reader preferences.
This paper proposes a method to identify the degree of confidence that an author has in their own writing. The experiments and results discussed in this paper are based on a complex system, that was run using a set of data extracted from the open archives initiative (OAI) corpus (https://www.openarchives.org/), which consists of over 12,000 papers extracted for the timeframe 2006-2017 under the malaria domain.
This survey was based on the legitimate question: What elements reveal the author's level of trust in their own scientific writing?
The paper is structured as follows: Section 2 briefly presents the relevant articles regarding the mining of biomedical literature, which reveals a wide interest in identifying features that drive readers to choose a particular scientific article. Section 3 shortly describes the open archives initiative (OAI) corpus of full-text academic articles. Section 4 presents the architectural components used to identify the critical features for evaluating authors' confidence. Section 5 describes a new system based on the linguistic analysis of scientific biomedical articles at the lexical, syntactic, and semantic level, and the results are presented in Section 6. The limitations of this methodology, which focused on three linguistic characteristics for recognizing authors' confidence that were closely analyzed at this stage, are presented in Section 7. A challenge for future work is to find reliable linguistic cues that generalize full confidence in the accuracy and integrity of the author's work.

Background
Biomedical text mining (BioNLP) uses sophisticated predictive models to understand, identify, and extract concepts from a large collection of scientific texts in the fields of medicine [1], biology, biophysics, chemistry, etc., to discover knowledge which can add value to biomedical research [2].
Therefore, a wide range of language resources were developed, including complex lexicons, thesauri, and ontologies that covered the entire spectrum of clinical concepts. Keizer [3,4] and Cornet [5] described a terminological and typology system to provide a uniform conceptual understanding.
Aside from mining knowledge, part of this new research direction tries to identify the factors that drive readers to choose one scientific article instead of another.
The retrieval of important literature represents a day-to-day activity for PhD students and scientific researchers, both for finding the latest breakthrough or for compiling a state-of-the-art for an area of interest. In [6], a set of stylometric features were used to develop an author search tool which allowed the finding of paragraphs written by a specific author or in a specific writing style, since they directly related the author's writing style to the readability of textual content [7,8,9]. Hyland [10] analyzed 240 texts to verify if self-citation and exclusive first person pronouns influenced paper acceptance in eight disciplines.
An important research direction in the biomedical domain is the identification of hedges (i.e., speculative and tentative statements). For most natural language applications hedging can be safely ignored; however, for the biomedical domain it is essential to properly identify if a relation between a drug and a disease is a fact or just speculation. Friedman et al. [11] discuss uncertainty in radiology reports and they identified five levels of certainty. Other studies in the speculative aspect of biomedical text annotate speculations [12] and identify them through simple substring matching [13,14], using machine learning techniques with variants of the well-known "bag-of-words" approach [15] or as classification problems [16,17].
The inspiration for our research was the study in [18], which investigated the relation between an individual's self-reported confidence and the influence that they had within a freely interacting group. They concluded that the influence of an individual within a group was directly dependent on his or her confidence level.
In this context, we hypothesized that a confident scientific paper will be selected, either for reading or for approval in various scientific journals, compared to a similar paper, written in a less confident manner. Therefore, we developed an instrument for identifying an author's confidence, based on his or her writing style and other linguistic clues, such as passive versus active voice, first versus third person, etc.

Data Set
To identify author confidence, we collected a set of about 12,000 documents belonging to the open archives initiative (OAI) corpus, which contains articles from 2006 to 2017, in English. OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements. Over time, OAI has established itself as a promoter of broad access to digital resources for e-Scholarship, e-Learning, and e-Science.
The collection contained several XML files, each with around 25 scientific articles, selected from the OAI amongst articles which contained the term "malaria" in either the title or the abstract, and also belonged to the specified timeframe. The reason for selecting a specific disease was that we expected the articles to be comparable with regard to the medical terms that were used. The first step in our processing involved a pre-processing of the XML files to split each article to a separate file, which was then fed to the author confidence detection system.
An excerpt of the structure of the XML files for each article is presented in Figure 1. Each article was divided into its composing sections, enclosed into the <sec> tag. Although the structure of each article was different, according to the specific requirements of the publishing journal, there were some common sections appearing in most scientific writings: Abstract, introduction, methods, results, conclusions. Each section contained a <title> tag and a set of paragraphs (<p> tags). Owing to space restrictions, only the introduction section and part of the results section are presented in Figure 1.

Methodology
While the study of the connection between discourse patterns and the personal identification of an author is decades old, the study of these patterns using language technologies is relatively recent. In the more recent tradition, we framed the author's confidence prediction from a text as an important problem for the natural language processing domain. Confidence [19] is generally described as a state of being certain that either a hypothesis or prediction is correct or that a chosen course of action is the best or most effective. Different approaches consider confidence in terms of "appropriateness" or "trustworthiness" [20], or they correlate it to uncertainty. In [21], the authors described a function theory, called Dempster-Shafer (D-S), for evaluating the confidence of an argumentation. In [22], a trust case framework was used to check the argumentation used to demonstrate the compliance with specific standards. The programming language used was Python, with its most useful package the NLTK (Natural Language ToolKit).
In the context of this study, a structured argumentation, although it plays an important role in the communication, is not enough. Automatically discovering if an author is confident or not in his argumentation is a challenging task, which involves finding the author's sentiments, features to determine their writing style, as well as information about the author's mastering of the scientific field.
The architecture of our proposed system is presented in Figure 2.
In order to determine the confidence of an author in their work, we proposed a system composed of three main modules: A preprocessing step, a parser, and a voting procedure. After extracting each article in a separate XML file, the preprocessing step extracted only the text, whilst deleting all the tags. Only two sections were analyzed for each article, the Results and Conclusion sections, since we found in a previous study that in these sections, authors were more likely to present their work in a confident or reluctant manner (see Appendix A). The raw text of the two sections was then cleaned to avoid sending unrecognized characters to the parser. The parser consisted of three modules: a lexical, a syntactic, and a semantic analyzer. The last step was the voting procedure, which took the scores from the three previous analyzers and merged them. The weights of each feature were empirically determined, based on annotated examples, but also on the feature relevance. The system was initially tested with equal weights for all features, and the performance was recorded. Subsequently, the system was fine-tuned by running it with different values for the weights, and then saving the best performing version. Then, we used a threshold to decide if the text was written in a confident manner or not. The next section describes the three analyzers in more detail.

System Description
Our system was based on a linguistic analysis of scientific biomedical articles, through exploring various lexical, syntactic, and semantic features. After the preprocessing step, the raw texts were fed to a parser with three modules, in a pipeline.
The first module was the lexical analyzer (see Figure 3), which tokenized the text to identify each word. From this step, the sentence length could be obtained.

Preprocessing
Syntactic Analyzer

XML files
Lexical Analyzer

Semantic Analyzer
Voting Figure 3. Description of the lexical analyzer.
We analyzed this feature, since we noticed that sentences which were too long tended to be more difficult to follow. After this step, a lemmatiser identified the dictionary form of the words. This was useful in order to count the frequencies more accurately. Thus, the frequencies of unique unigrams and bigrams were computed and normalized by the length of the document and the number of tokens within. Functional words were removed, and the number of medical terms in each document was computed. Although specialized language needs to be used to prove mastery of the domain, if the number of specialized words is too high in a document, when a comparison is made with words from the common vocabulary, the reading and understanding of the article becomes difficult.
The second module was the syntactic analyzer, presented in Figure 4. The part-of-speech (POS) tagging was performed using a RACAI POS tagger (http://www.racai.ro/en/tools/) for English. We chose this tagger instead of the NLTK's POS functionality, since it facilitated the extraction of the verbal voice, a feature that we further used. Once parts of the speeches had been identified, we extracted two features: (1) The use of a passive or active voice, and (2) the preference for using the first or third person for both verbs and pronouns. We considered that the voice of the scientific articles is relevant, since in the argumentation theory, the active voice is preferred and considered to indicate more commitment. The passive voice, on the contrary, indicates a certain distance from what is being presented.
For instance, the sentence: "It has been shown that confident authors express themselves in an active voice." focuses on someone else, i.e., the one who made the statement, and it establishes a certain distance. On the contrary, the active version of this sentence shows more commitment and agreement: "Research has shown that confident authors express themselves in an active voice." The other relevant information was the inflection of the verbs and pronouns, with regards to the number. Writing in the first, second, or third person is referred to as the author's point of view [23]. The common tendency is to personalize the text of blogs, journals, or books by writing in the first person ("I" and "we"). However, this tactic is not common in academic writings.
In science and mathematics, the first person is rarely used, being considered to move the focus of the statement from the research to the author. In medical texts, in some journals, it is generally acceptable to use the first person point of view in the abstracts, introductions, discussions, and conclusions, to refer to the group of researchers that were part of the study. The third person point of view is used when writing the methods and results sections. Adhering to this common practice shows knowledge of the usual norm, as well as showing rigorousness, and thus confidence.
The point of view of the third person is generally used in scientific papers, though in different forms. Indefinite pronouns are used to refer back to the subject, while avoiding the use of masculine or feminine terminology.
The following sentence uses the indefinite pronoun: "An author must ensure that he has used the proper person in his writing. " An example of masculine and feminine terminology, which should be avoided, considered a factor of distraction if repeated, is: "An author must ensure that he or she has used the proper person in his or her writing." The third and last module, named the semantic analyzer, performed two types of analyses (see Figure 5): Sentiment identification and author profiling. The POS-tagged corpus of the articles was filtered to identify the overall sentiment of each paper using the Stanford sentiment analysis tool (https://nlp.stanford.edu/sentiment/). Their deep learning model builds up a representation of a whole sentence based on its grammatical structure. The Stanford sentiment analysis tool computes the sentiment based on how words compose the meaning of longer phrases, using a recurrent neural network. The sentiment is expressed as a polarity (i.e., the text tends to be positive or negative). After analyzing each sentence individually, a score for the entire document is given. The collection of articles came with its own metadata (see Appendix B), from which we extracted information about the author's profile, i.e., the name of the author, name of the journal, keywords, etc. In order to identify the importance of a paper in its domain, we checked the internet to find the author's notoriety and investigated the author's previous publications by considering the number of citations per paper and the number of article views. Additionally, we considered the number of times the article was cited and the total number of cited reference papers for each given article.
Each of the three main analyzers (lexical, syntactic, and semantic) returned a score for each article, and the final step involved the concatenation of the intermediate scores, with specific weights, to obtain the final result, which was a good predictor of whether a certain author had written their paper in a confident tone or not. The weights of each module were empirically identified, using information from the corpus, but also from various online good practice guides on how to write a scientific article.
For example, for the first module, the lexical module, the sentence length and the frequency of medical terms formed the score for the module. Thus, the weight for sentence length was 0.05 if the sentence was between 15 and 20 words, and 0 otherwise. Concerning the frequency of medical terms, if they appeared at less than 33% of all words in the article, the weight of these features was 0.75× normalized frequency, otherwise we used the weight 0.25. The linear combination of these two scores formed the overall score for the first lexical module. Similarly, weights were computed for the active vs. passive voice and the 1st vs. 3rd person, and their combination formed the score for the syntactic module. As for the third, the semantic module, a weight of 0.5× sentiment score was used for the sentiments identified, to which the author profile score was added. The latter score was computed as the sum of its different components: (a) 0.05× publication number/10 for authors with less than 10 publications, (b) 0.05; otherwise, 0.15× number of views/1000 for less than 1000 views, 0.05 otherwise, and (c) a citation score, was similarly computed. All the three scores (lexical, syntactic, and semantic scores) had equal weights in the total score.

Results
This section presents the results obtained for the three features (sentiment analysis, average number of words per sentence, and the frequency of medical terms) in evaluating an author's confidence (Figures 6, 7, and 8 respectively). We observed that through the sentiment analysis and medical terms frequency we obtained distinctive results, suggesting that the choice of words of confident authors reflected positive sentiments, and the medical terms frequency was in tandem with the first feature. The feature based on the average words per sentence had an irregular behavior. It was normal because the performance of a good argumentation, in both spoken and written form, contained no unnecessary words.

Sentiment Analysis
The computational treatment of sentiments, subjectivity, and opinions has recently attracted a great deal of attention, in part because of its potential applications. Sentiment analysis has proven useful for editorial sites. Companies create summaries of peoples' experiences and opinions are extracted from reviews based on a review's polarity, i.e., positive or negative.
Identification of the author's confidence poses a significant challenge to data-driven methods, resisting the traditional techniques. In the present study, we used sentiment analysis to identify the author's level of confidence. In Figure 6, we show the results obtained after running the sentiment analysis tool. Our results indicated that most of the papers had positive (towards 1) sentiments, and that confidence was directly linked to the positive expression of sentiment.

Average Words Per Sentence
When writing a scientific paper, the first quality, with precedence over all others, is clarity. According to the Oxford Academy (https://www.ox.ac.uk/sites/files/oxford/field/field_document/ Tutorial%20essays%20for%20science%20subjects.pdf), it is highly recommended to use up to 15 words in a sentence, and if an author chooses to use too many words in a sentence, it reveals a low degree of confidence while writing the work in question. This analysis was supported by our findings, where an article that was marked as having a confident author would have an average sentence length in the range of 15-20 words. To demonstrate that an author is self-confident, it is essential to use the appropriate terms (in our case, the medical terms), and to avoid jargon, because it is the secret language of the scientific field. It excludes the intelligent, otherwise well-informed, reader, and speaks only to the initiated. The statistical analysis of our corpus showed that the articles marked as showing non-confidence had either below 25% of medical terminology or above 40%.
In this study, we have shown that it is possible to automatically identify the level of confidence that an author had when writing a scientific paper.

Discussion
In this paper, we have presented a method to extract non-scientific information from biomedical papers, more specifically, the confidence of an author regarding their work. Given this purpose, we explored the linguistic features that were predictive of the author's level of trust in his own scientific writing. While our focus was on a single type of disease ("malaria"), we chose a method that is generalizable to other diseases, revealing the similarity present in other medical interactions.
We studied the relation between lexical analysis (frequencies of medical words, sentence length); syntactic features (POS tagging, voice and person of verbs and pronouns); and semantic features (sentiment analysis, author profiling), to automatically predict the author's confidence. The weights of each feature were empirically determined, based on annotated examples, but also on the feature's own relevance.
To improve the performance of our system, we intend to enrich the gold annotated corpus with articles for different diseases and to additionally use machine learning techniques for the classification task.
To further test our belief that author confidence influences the acceptance of papers in peerreviewed journals, we intend to extend the study by analyzing the reviews from journals with an open review process. proportion of authors that report more than one institutional address. Of the more than