A Natural Language Processing Approach to Automated Highlighting of New Information in Clinical Notes

: Electronic medical records (EMRs) have been used extensively in most medical institutions for more than a decade in Taiwan. However, information overload associated with rapid accumulation of large amounts of clinical narratives has threatened the e ﬀ ective use of EMRs. This situation is further worsened by the use of “copying and pasting”, leading to lots of redundant information in clinical notes. This study aimed to apply natural language processing techniques to address this problem. New information in longitudinal clinical notes was identiﬁed based on a bigram language model. The accuracy of automated identiﬁcation of new information was evaluated using expert annotations as the reference standard. A two-stage cross-over user experiment was conducted to evaluate the impact of highlighting of new information on task demands, task performance, and perceived workload. The automated method identiﬁed new information with an F1 score of 0.833. The user experiment found a signiﬁcant decrease in perceived workload associated with a signiﬁcantly higher task performance. In conclusion, automated identiﬁcation of new information in clinical notes is feasible and practical. Highlighting of new information enables healthcare professionals to grasp key information from clinical notes with less perceived workload.


Introduction
Electronic medical records (EMRs) have been developed in Taiwan for more than a decade [1]. They have been implemented in most of the larger medical institutions to store information about encounters and events between patients and healthcare systems [2]. The implementation of EMRs not only enables large-scale storage and collection of patient data but also makes the exchange of healthcare information between healthcare facilities possible through the Electronic Medical Record Exchange Center [2]. Almost all the medical complaints, diagnoses, processes of clinical care, laboratory results, and medication use from various departments of most domestic hospitals are readily available from EMRs. The immediate accessibility of patient information is likely to help healthcare professionals improve patient care delivery and enhance the quality of medical decision-making [3,4]. Moreover, the development of integrated clinical decision support systems and EMRs has achieved substantial success in reducing medical errors and improving patient safety [5].
Despite the positive changes brought by EMRs, patient information explosion associated with the rapid accumulation of a large amount of unstructured data has become a threat to the effective use of EMRs. The immense amount of information stored in EMRs can lead to information overload for healthcare professionals [6]. In addition, the use of EMRs may interfere with the interaction between patients and practitioners and cause dissatisfaction and burnout among healthcare professionals [7]. Furthermore, because current EMR systems generally provide several time-saving features such as the "copy-and-paste" function, physicians frequently use copying and pasting to reduce the omission of what they consider important information. Clinical notes have thus become full of redundant information and barely able to convey useful information [8]. These redundant contents also cause difficulties in reading and prolong reading time.
As more longitudinal clinical narratives are produced with the increasing number of healthcare encounters, the practice of copying and pasting inevitably generates more redundant information, which is actually noise and masks new and clinically relevant information within notes [9]. A systematic review found that 66% to 90% of clinicians routinely use copy-and-paste and approximately 80% of physicians use copy-and-paste regularly for inpatient documentation [10]. Such practices are similarly common among residents and attendings [11]. Despite the many deficits in notes written using copy-and-paste, approximately 80% of physicians agreed that copy-and-paste behaviors should continue [12] considering that almost half of their work time is spent on EMR-related work [13,14].
Now that the adoption of healthcare information technology has brought about lots of redundant information in EMRs, healthcare information technology should be able to help reduce the interference from redundant information. According to a survey on the usability of EMR systems, "ease of finding the required information on the screen" is the most desired requirement [15]. Hence, many studies have focused on the redesign of EMR interfaces, hopefully facilitating clinicians to keep track of relevant patient information. Well-designed data visualization in EMR systems not only allows healthcare professionals to communicate information efficiently and effectively but also improves data interpretation and clinical reasoning [16,17]. Besides, it is recommended that copied material should be displayed in a different font or color, so that they can be easily identified [10].
With the advances in natural language processing, investigators have tried to build applications around NLP technologies for summarization or extraction of needed information from longitudinal patient records within EMRs [17,18]. Among them, Zhang et al. have developed algorithms based on statistical language models to identify relevant new information in longitudinal clinical narratives [9,19]. Experimentation with a visualization tool for the presentation of new information also found some positive influences on the synthesis of patient information from EMRs [20]. Motivated by these works, this preliminary study aimed to investigate (1) the amount of new versus redundant information in inpatient clinical notes; (2) the accuracy of automated identification of new information in clinical notes; and (3) whether highlighting of new information affects the performance, task demands, and perceived workload of healthcare professionals in reviewing clinical notes.

Study Setting
This study was conducted in Ditmanson Medical Foundation Chia-Yi Christian Hospital, a 1000-bed teaching hospital located in southern Taiwan. It employs 3000 staff, with approximately 47,000 admissions, 1,110,000 outpatient visits, and 89,000 emergency visits per year. The study protocol was approved by the Ditmanson Medical Foundation Chia-Yi Christian Hospital Institutional Review Board (CYCH-IRB No.2018085).

Clinical Notes and Manual Annotation
A purposive sample of ten patients was selected for review of clinical notes from the inpatient population of medical wards and intensive care units. Patients selected for this study had to be hospitalized for more than 10 days, with complex conditions and multiple comorbidities. All clinical notes were checked to ensure that clinicians participating in this study had not taken care of the patient at any time previously. Patient identifiers were replaced by a unique study identification number to ensure confidentiality; the informed consent was thus exempted.
Two experienced attending physicians (LCH and SFS) independently annotated the clinical notes of each patient. After reviewing the admission note, they evaluated the subsequent 9 days of progress notes to identify new information based on all preceding notes chronologically using their clinical judgment. Inter-rater agreement was assessed at the line level using the Kappa statistic. Discrepancies between the two annotators were arbitrated by consensus. The final set of annotations was used as the reference standard for automated highlighting of new information.

Automated Highlighting Using the Bigram Language Model
A statistical language model is a probability distribution over word sequences. It is useful in many natural language processing applications, such as speech recognition, text categorization, and information retrieval. An n-gram model is a type of language model that approximates the probability of observing a word based on the preceding n-1 words in a word sequence [21]. For example, a bigram model estimates the occurrence of a word in the context of the preceding one word.
The automated highlighting of clinical notes largely followed the method developed by Zhang et al. using the bigram language model [19,22]. All the clinical notes from the same patient were ordered chronologically. The text was preprocessed through sentence splitting, stop-word removal, spell checking, and stemming. Then a bigram language model was built based on preceding notes to identify new information in the target note. If a bigram had never appeared in any preceding notes, the sentence containing the bigram was considered new information and was thus highlighted [19].

Evaluation of Automated Highlighting
Precision (positive predictive value), recall (sensitivity), and F1 score were used to evaluate the performance of automated highlighting of new information against the reference standard at the line level. A true positive means that a line containing new information was identified by both automated and expert annotation. A false positive indicates that a line was highlighted as having new information by automated but not by expert annotation, while a false negative indicates that a line was highlighted as having new information by an expert but not by automated annotation. Precision was calculated as true positives divided by the sum of true positives and false positives, recall as true positives divided by the sum of true positives and false negatives, and F1 score as 2 times precision times recall, divided by the sum of precision and recall.
To determine the optimal number (N) of preceding notes for building the bigram language model, we varied N from 1 to 8. In other words, each target note was highlighted based on the preceding 1 to at most N notes of the target note. The N value that achieved the highest F1 score was used in the user experiment.

User Experiment
A convenience sample of twelve clinicians from the staff of medical wards and intensive care units was recruited. The participants were contacted in person and were asked if they were willing to participate. Participation was voluntary and compensated. Age, gender, and years of experience in clinical practice were collected for each participant.
Four of the ten patients used in the first experiment were selected for the user experiment. The selection was made to balance the number of clinical notes and the amount of text as best as possible. The number of lines per progress note was similar among the four patients, ranging from 21 to 23. All of them had multiple underlying comorbidities and a complicated hospitalization course. Each patient had several active problems that needed further investigations to determine the etiology and repeated evaluations of the response to treatment. These intricate clinical scenarios may better represent the daily practice of the study participants.
The user experiment used a 2-period crossover design ( Figure 1). The total number of participants was set at 12 so that each patient would appear six times in testing scenarios with notes in the original condition and those with notes in the highlighted condition. Six participants first reviewed the clinical notes of two of the four patients in the original condition (period 1) and then reviewed the notes of the other two patients in the highlighted condition (period 2). After each study period, participants had to fill a National Aeronautics and Space Administration task load index (NASA-TLX) questionnaire. The other six participants reviewed the clinical notes in the reverse order. That is, they first reviewed the notes of two patients in the highlighted condition and then the notes of the other patients in the original condition.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 11 All of them had multiple underlying comorbidities and a complicated hospitalization course. Each patient had several active problems that needed further investigations to determine the etiology and repeated evaluations of the response to treatment. These intricate clinical scenarios may better represent the daily practice of the study participants. The user experiment used a 2-period crossover design ( Figure 1). The total number of participants was set at 12 so that each patient would appear six times in testing scenarios with notes in the original condition and those with notes in the highlighted condition. Six participants first reviewed the clinical notes of two of the four patients in the original condition (period 1) and then reviewed the notes of the other two patients in the highlighted condition (period 2). After each study period, participants had to fill a National Aeronautics and Space Administration task load index (NASA-TLX) questionnaire. The other six participants reviewed the clinical notes in the reverse order. That is, they first reviewed the notes of two patients in the highlighted condition and then the notes of the other patients in the original condition.

Measurement of Task Demands
This study used Morae Recorder, version 3.3.4, (TechSmith Corporation, Okemos, MI) to record screen captures and track mouse actions. All participants used the same computer and had access only to the clinical notes displayed in a standard web browser. Immediately before conducting the experiment, participants were instructed on how to browse the clinical notes. They were asked to review the notes at the same pace as they used to, and no time limit was set. The total numbers of mouse clicks, wheels, and mouse moves (in pixels) and the total time for participants to complete each testing scenario were used to measure task demands.

Measurement of Task Performance
During the note review process, participants had to complete a 20-item task questionnaire for each patient. Task items mainly focused on clinical fact or event finding (e.g., "Does this patient have a history of hypertension?"), date finding (e.g., "When did this symptom start?"), and clinical comparisons (e.g., "Was the condition getting better?"). Half of the answers to these task items came from the admission note and the other half from the progress notes. Reference answers were also obtained from the two expert annotators. Answers of participants were scored as correct or incorrect against the reference answers. Each correct answer was given one point. In addition to the total scores, the scores were subtotaled separately for questions regarding the admission note and those regarding the progress notes. The scores were normalized from 0 to 100 to represent task performance.

Measurement of Task Demands
This study used Morae Recorder, version 3.3.4, (TechSmith Corporation, Okemos, MI) to record screen captures and track mouse actions. All participants used the same computer and had access only to the clinical notes displayed in a standard web browser. Immediately before conducting the experiment, participants were instructed on how to browse the clinical notes. They were asked to review the notes at the same pace as they used to, and no time limit was set. The total numbers of mouse clicks, wheels, and mouse moves (in pixels) and the total time for participants to complete each testing scenario were used to measure task demands.

Measurement of Task Performance
During the note review process, participants had to complete a 20-item task questionnaire for each patient. Task items mainly focused on clinical fact or event finding (e.g., "Does this patient have a history of hypertension?"), date finding (e.g., "When did this symptom start?"), and clinical comparisons (e.g., "Was the condition getting better?"). Half of the answers to these task items came from the admission note and the other half from the progress notes. Reference answers were also obtained from the two expert annotators. Answers of participants were scored as correct or incorrect against the reference answers. Each correct answer was given one point. In addition to the total scores, the scores were subtotaled separately for questions regarding the admission note and those regarding the progress notes. The scores were normalized from 0 to 100 to represent task performance.

Measurement of Perceived Workload
The NASA-TLX, a widely used tool to assess workload and effectiveness in humans [23], was applied to evaluate perceived workload. The NASA-TLX consists of six dimensions including mental demand, physical demand, temporal demand, overall performance, effort, and frustration level [24]. It has been used to quantify the perceived workload associated with the use of EMRs [25,26]. The workload score ranges from 0 to 100 for each dimension, with a higher score indicating a greater workload. To obtain the weight of each dimension of the NASA-TLX, each participant performed 15 separate pairwise comparisons of the 6 dimensions to determine the relative relevance of each dimension in the task of reviewing clinical notes. Next, an overall NASA-TLX score was obtained by multiplying the dimension score with the corresponding dimension weight, summing across all dimensions, and dividing by 15.

Statistical Analysis
Given the small sample size in the user experiment, non-parametric statistical analyses were performed. Continuous variables were reported with medians and interquartile ranges. The Wilcoxon signed-rank test was performed for comparison between testing scenarios with original notes and those with highlighted notes because users were measured repeatedly [17]. Two-tailed p values < 0.05 were considered statistically significant. Statistical analyses were performed using Stata 15.1 (StataCorp, College Station, TX, USA). Table 1 lists the characteristics of the clinical notes selected for review. The number of progress notes within 9 days following admission ranged from 7 to 17. The average number of lines per progress note varied from 15 to 35 and increased with the number of progress notes (Figure 2A). The inter-rater agreement (kappa value) for manual annotation between the two attending physicians was 0.767, indicating substantial agreement. Based on the results of manual annotation, 34% to 78% of the lines of the progress notes were determined to contain new information. The proportion of lines with new information was negatively correlated with the average number of lines per progress note ( Figure 2B).   Table 2 gives the performance of the automated identification of new information across different numbers of preceding notes used in the bigram language model. The highest F1 score (0.833) and accuracy rate (0.814) were achieved when at most four preceding notes were employed to build the bigram language model. Therefore, the optimal number (N) of preceding notes was set to 4. Clinical notes from case 1, 5, 6, and 10 ( Table 1) were selected for the user experiment. The new information in each note was highlighted using the bigram language model mentioned above. Four physicians and eight nurse practitioners participated in the experiment. Table S1 lists the characteristics of the participants. Table 3 gives descriptive statistics of task demands, performance, and perceived workload for each testing scenario. No significant differences in task demands were observed between scenarios as quantified by the time to completion as well as the total number of mouse clicks, mouse wheels, or mouse movements.   Table 2 gives the performance of the automated identification of new information across different numbers of preceding notes used in the bigram language model. The highest F1 score (0.833) and accuracy rate (0.814) were achieved when at most four preceding notes were employed to build the bigram language model. Therefore, the optimal number (N) of preceding notes was set to 4. Clinical notes from case 1, 5, 6, and 10 ( Table 1) were selected for the user experiment. The new information in each note was highlighted using the bigram language model mentioned above. Four physicians and eight nurse practitioners participated in the experiment. Table S1 lists the characteristics of the participants. Table 3 gives descriptive statistics of task demands, performance, and perceived workload for each testing scenario. No significant differences in task demands were observed between scenarios as quantified by the time to completion as well as the total number of mouse clicks, mouse wheels, or mouse movements. Table 3. Comparison of task demands, task performance, and perceived workload between scenarios using original notes and highlighted notes.

Testing Scenario with Original Notes
Testing Scenario with Highlighted Notes P As for task performance, there was no difference between scenarios in the sub-scores for questions regarding the admission note. In contrast, the overall scores and the sub-scores for questions regarding the progress notes were significantly higher in the testing scenario with notes in the highlighted condition. The overall perceived workload of reviewing highlighted notes was significantly lower than that of reviewing original notes. The workload in different dimensions of the NASA-TLX decreased significantly except for the dimension of perceived overall performance.

Effects of Information Redundancy
This study found that a substantial proportion of clinical notes contained redundant information instead of new information. The proportion of redundant information increased with the size of notes. However, the bigram language model could effectively identify new information based on preceding notes. By highlighting new information in clinical notes, healthcare professionals could more accurately extract relevant information from clinical notes. In the meantime, the perceived workload associated with reviewing clinical notes was significantly reduced even though the task demand did not change.
The user experiment showed that participants performed well in extracting information from admission notes and highlighted progress notes. In contrast, participants were less likely to accurately collect relevant information from original non-highlighted progress notes. It may be because admission notes contained only brand-new information while progress notes contained a lot of redundant information. With such abundant redundancy of information, users might be cognitively overloaded and unable to retrieve useful information, thus compromising their performance. This problem is likely to get worse with the increasing number of progress notes. A previous study revealed that the uniqueness of progress notes over the course of hospitalization dramatically decreased with time and contained only 27.7% unique information at the end of hospitalization [27].

Merits of Highlighting New Information
The performance in extracting relevant information was significantly improved when new information in progress notes was highlighted. Text highlighting enhances not only searching but also reading performance [28]. In addition, the highlighting of new information may help users relieve information overload and concentrate on new information. The effect of highlighting can probably be explained by the psychological theory of human information processing. As proposed by Schneider and Shiffrin [29], "controlled processing" of information allows humans to read and understand information but requires attention and thus has limited capacity. Their experiments showed that when targets and distractors become more similar, the search tasks become more difficult, especially when the number of distractors increases. Similarly, the excessive redundant information in clinical notes can cause information overload and attention deficit. The highlighting of new information increases the contrast between targets and distractors, leading to a decrease in information overload and improvement in clinical reasoning.
Although a nonsignificant decrease in the time to completion was observed when highlighted notes were reviewed, there was no difference in mouse usage between testing scenarios. The possible reasons are as follows: First, clinical notes, whether highlighted or not, were presented to users in full rather than summarized form. Second, all the participants are trained healthcare professionals. They are accustomed to browsing the whole content of clinical notes to extract relevant information. Third, participants might not be confident in the accuracy of highlighting. Therefore, they would rather thoroughly review the clinical notes than skip non-highlighted redundant information.
It is worth noting that, despite similar task demands between testing scenarios, the overall perceived workload as assessed using the NASA-TLX was significantly reduced when new information was highlighted. Interestingly, the perceived overall performance, one of the six dimensions of the NASA-TLX, did not change between testing scenarios (Table 3). This dimension measures how successful the subject is in performing the task and how satisfied the subject is with their own performance. In the user experiment, even though the participants believed they performed equally well in both scenarios, they actually did worse according to the scores of task performance (Table 3). In other words, the participants were totally unaware of how the increased workload had affected their performance.

Clinical Implications
The study findings hint that the interference of redundant information to clinical practice is a real but under-recognized problem for healthcare professionals. This problem may become even worse in real-world settings where interruptions and multitasking are common [30], leading to medical errors and patient safety issues. The widespread use of EMRs has brought about several advantages, such as flexibility in storage and retrieval of data, easy access across different locations, and simultaneous use by multiple users. On the other hand, the use of EMRs is inevitably associated with some drawbacks. For example, it takes longer to read text on a computer screen than to read printed text [31]. Readers may need extra cognitive processes to gain the corresponding knowledge from a computer screen than from paper [32]. Furthermore, because of the increase in time spent on completing notes [7,13,14], routine use of copy-and-paste has become highly prevalent among physicians [10,11], thus generating a lot of redundant information in EMRs.
Moreover, the layout and structure of EMRs strongly impact the retrieval of information and sometimes influence clinical decision-making in fundamental ways. Poorly designed interactions with information technology can mislead decision-making and create medical errors, ending in patient harm [33]. An analysis found that 73% of EMR-associated patient safety issues were related to human-computer interaction [34]. Being overwhelmed by less important redundant information and failing to identify relevant new information may further interfere with decision-making [22]. Since it is unrealistic to get rid of all redundant information from EMRs, some measures should be taken to facilitate capturing relevant information more easily. In this regard, highlighting new information can effectively reduce the interference from redundant information and help healthcare professionals grasp key information from EMRs and ameliorate their perceived workload.

Limitations
This study has several limitations. First, this is a preliminary study with a small sample size from a single institution. Further studies that recruit larger samples are warranted to verify the impact on clinical practice. Second, the study took place in a controlled environment with minimum distracting elements. In real-world settings, healthcare professionals generally need to manage multiple patients in a short period of time, where a higher cognitive demand is required. Therefore, whether highlighting of new information is similarly effective or even more effective under real-world clinical practice conditions is open to question. Third, highlighted clinical notes were entirely new to the study participants, who might not have confidence in the accuracy of highlighting. They might even spend more time reading non-highlighted text. This might increase the time to completion and interfere with the accuracy of task performance. Fourth, because four consecutive preceding clinical notes are required to optimally determine whether the information is redundant, highlighting may not be worthwhile for patients with a short hospital stay. Finally, the majority of the study participants and the writers of clinical notes are not native English speakers even though clinical notes in the study hospital are documented in English. Language barriers may result in different behaviors in copying and pasting clinical text and incur extra cognitive load in extracting information.

Conclusions
The use of EMRs brings lots of redundant information, which is potentially harmful to patient safety. Nevertheless, this study shows that automated identification of new information in clinical notes is feasible and practical. Highlighting new information enables users of EMRs to have better understanding of patients' conditions and complete their daily work with less perceived workload. In particular, it may help healthcare professionals grasp key information from clinical notes of unfamiliar patients in situations such as consultations and shifting of services, hopefully improving the quality of medical decision-making.