Natural Language Processing-Based Deep Learning to Predict the Loss of Consciousness Event Using Emergency Department Text Records
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is a well-written paper that proposes the use of natural language processing (NLP) algorithms to more accurately identify key clinical features, such as loss of consciousness. By doing so, healthcare providers can respond more quickly and potentially improve patient outcomes, enhancing the quality of patient care.There are some problems, which must be solved before it is considered for publication. If the following problems are well-addressed, this reviewer believes that the essential contribution of this paper are important for Healthcare sector.
1. In ABSTRACT, authors are suggested to start broad in the general background, then narrow in on the relevant topic that will be pursued in the paper. Maybe this part can be improved!
2. In this paper, there are two mentions of Table S1, but the reviewer was unable to find this table.
3. Regarding the issue that "the model trained solely on post-diagnosis notes did not show a sustained performance improvement over training iterations on the training dataset," please provide further clarification.
4. For the Korean text segments in Figure 4, please add some English annotations.
5. In the References section of this paper, some of the cited literature is quite outdated. Please add more recent references that are relevant to the current study.
Author Response
Reviewer #1.
Comments and Suggestions for Authors
This is a well-written paper that proposes the use of natural language processing (NLP) algorithms to more accurately identify key clinical features, such as loss of consciousness. By doing so, healthcare providers can respond more quickly and potentially improve patient outcomes, enhancing the quality of patient care. There are some problems, which must be solved before it is considered for publication. If the following problems are well-addressed, this reviewer believes that the essential contribution of this paper are important for Healthcare sector.
- In ABSTRACT, authors are suggested to start broad in the general background, then narrow in on the relevant topic that will be pursued in the paper. Maybe this part can be improved!
: Thank you for the comment. As the reviewer’s suggestion, we began the abstract with a general statement of the importance and challenges of EMRs as a general background, and then adjusted the narrative structure to narrow in on the specific aims, methodology, and key findings of the study.
Correction [page 1, line 15-28, Abstracts]
The increasing adoption of electronic medical records (EMRs) presents a unique opportunity to enhance trauma care through data-driven insights. However, extracting meaningful and action-able information from unstructured clinical text remains a significant challenge. Addressing this gap, this study focuses on the application of natural language processing (NLP) techniques to ex-tract injury-related variables and classify trauma patients based on the presence of loss of con-sciousness (LOC). A dataset of 23,308 trauma patient EMRs, including pre-diagnosis and post-diagnosis free-text notes, was analyzed using a bilingual (English and Korean) pre-trained RoBERTa model. Patients were categorized into four groups based on the presence of LOC and head trauma. To address class imbalance in LOC labeling, deep learning models were trained with weighted loss functions, achieving a high area under the curve (AUC) of 0.91. Local Interpretable Model-agnostic Explanations analysis further demonstrated the model's ability to identify critical terms related to head injuries and consciousness. NLP can effectively identify LOC in trauma pa-tients' EMRs, with weighted loss functions addressing data imbalances. These findings can inform the development of AI tools to improve trauma care and decision-making.
- In this paper, there are two mentions of Table S1, but the reviewer was unable to find this table.
: Thank you for the comment. Supplemental Table S1 seemed to be missing during the upload process. We have added Table S1 as shown below.
Table S1. Training parameters of the model and methods.
Parameter |
Search space |
Selected value |
Architecture |
klue/Roberta-[small, base] |
klue/Roberta-small |
Optimizer |
AdamW |
AdamW |
Loss function |
Cross-entropy, Focal loss |
Cross-entropy(weighted) |
Batch size |
16, 32, 64, 128 |
16 |
Learning rate |
[1e-04, 1e-08] |
2.4e-06 |
Learning scheduler |
ReduceLROnPlateau |
ReduceLROnPlateau |
- Regarding the issue that "the model trained solely on post-diagnosis notes did not show a sustained performance improvement over training iterations on the training dataset," please provide further clarification.
: Thank you for the comment. At the beginning of the Discussion section, the conclusions of this study are outlined below (highlighted).
Correction [page 8, line 251-252, Discussion]
In this study, we evaluated the performance of an NLP based RoBERTa model to automatically identify a patient’s LOC within the EMR of the ED. The model effectively determined LOC status by analyzing both pre- and post-diagnosis notes, leveraging specific words and contextual cues in these free-text medical records. Our comparative analysis highlighted the importance of pre-diagnosis notes, which provide valuable information for assessing LOC. These results demonstrate the model’s capability to accurately identify injury related variables, specifically LOC, within unstructured ED documentation.
In other words, we conclude that the post-diagnosis note did not provide additional information in classifying LOC or head trauma compared to the pre-diagnosis note. We have added a sentence to the Results section to emphasize this point.
Correction [page 6, line 220-222, Results]
In other words, post-diagnosis note did not provide additional information in classifying LOC or head trauma compared to the pre-diagnosis note.
- For the Korean text segments in Figure 4, please add some English annotations.
: Thank you for the comment. We decided that the unintelligible texts in the figure could be confusing to the reader, so we converted Figure 4 to an English translation of the vocabulary and changed the Korean part to Supplemental Figure.
- In the References section of this paper, some of the cited literature is quite outdated. Please add more recent references that are relevant to the current study.
: Thank you for the comment. We have incorporated references to enhance the understanding of the clinical background and the characteristics of the data used in this study.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe presented study aims to extract injury-related variables from medical records to perform a preliminary classification of patients, specifically focusing on cases involving loss of consciousness and head trauma. The authors also investigate the optimal data source (pre-diagnosis, post-diagnosis, or combined) for achieving high classification performance, a valuable consideration for potential real-world implementation.
ABSTRACT
It is not clear which classes the approach aims at identifying. Indeed, it initially suggestas a focus on detecting loss of consciousness but later refers to head trauma. A clearer and more concise statement of the study's objective is needed.
INTRODUCTION
A comprehensive literature review is missing. If this is indeed a novel research direction, a discussion of the reasons why this approach has not been previously explored would be insightful to understand the contributions and relevance of the work.
A dedicated subsection should explicitly outline the novel contributions of the study.
For instance, the ability to handle multilingual text is a significant contribution.
MATERIAL AND METHODS
The current methodology relies on a single-shot holdout approach. To obtain a more reliable and generalizable assessment of the model's performance, k-fold cross-validation is recommended.
A brief overview of the Roberta model, including its architecture and key components, should be included in the introduction or materials and methods section.
RESULTS
The study's classification objective is not explicitly defined. It is unclear whether the goal is to distinguish between "no injury" and "loss of consciousness/head trauma" or to further categorize cases into "loss of consciousness only" and "head trauma only."
The practice of assessing model performance on both training and test sets is not standard. Typically, only the test set is used to evaluate a model's generalization ability. Why do the authors consider the performance on both sets?
The "C-x" conditions require a clearer definition.
Figure 4, presented in Chinese, is not accessaible to many readers. A translation or an alternative visualization is necessary.
DISCUSSION
The potential benefits of identifying the optimal data source (pre-diagnosis, post-diagnosis, or combined) should be more explicitly discussed. Are the authors aiming to develop a rapid classification system based on pre-diagnosis data? Or are they exploring strategies to reduce computational complexity by identifying redundant information?
It would be valuable to provide information about the number of healthcare professionals involved in creating the medical records. Differences in writing style and terminology among clinicians can potentially impact the model's performance.
Author Response
Reviewer #2
The presented study aims to extract injury-related variables from medical records to perform a preliminary classification of patients, specifically focusing on cases involving loss of consciousness and head trauma. The authors also investigate the optimal data source (pre-diagnosis, post-diagnosis, or combined) for achieving high classification performance, a valuable consideration for potential real-world implementation.
ABSTRACT
- It is not clear which classes the approach aims at identifying. Indeed, it initially suggested focusing detecting loss of consciousness but later refers to head trauma. A clearer and more concise statement of the study's objective is needed.
: Thank you for the comment. As the reviewer’s suggestion, we change the abstract with the following sentences.
Correction [page 1, line 15-28, Abstract]
The increasing adoption of electronic medical records (EMRs) presents a unique opportunity to enhance trauma care through data-driven insights. However, extracting meaningful and action-able information from unstructured clinical text remains a significant challenge. Addressing this gap, this study focuses on the application of natural language processing (NLP) techniques to ex-tract injury-related variables and classify trauma patients based on the presence of loss of con-sciousness (LOC). A dataset of 23,308 trauma patient EMRs, including pre-diagnosis and post-diagnosis free-text notes, was analyzed using a bilingual (English and Korean) pre-trained RoBERTa model. Patients were categorized into four groups based on the presence of LOC and head trauma. To address class imbalance in LOC labeling, deep learning models were trained with weighted loss functions, achieving a high area under the curve (AUC) of 0.91. Local Interpretable Model-agnostic Explanations analysis further demonstrated the model's ability to identify critical terms related to head injuries and consciousness. NLP can effectively identify LOC in trauma pa-tients' EMRs, with weighted loss functions addressing data imbalances. These findings can inform the development of AI tools to improve trauma care and decision-making.
INTRODUCTION
- A comprehensive literature review is missing. If this is indeed a novel research direction, a discussion of the reasons why this approach has not been previously explored would be insightful to understand the contributions and relevance of the work.
A dedicated subsection should explicitly outline the novel contributions of the study.
For instance, the ability to handle multilingual text is a significant contribution.
: Thank you for the comment. We've added a few sentences to the introduction section that relate to this as follows:
Correction [page 2, line 57- 60 & line 68-70, Introduction]
However, the lack of real-world datasets for multilingual text processing like EMRs, the computational power to handle complex content, and the lack of research on whether multilingual clinical NLP has helped with healthcare decisions had made it difficult to conduct multilingual NLP studies using EMRs.
This study aims to develop NLP algorithms for the automatic identification of inju-ry-related variables within free-text medical records from emergency department (ED) set-tings. Specifically, the goal is to classify patients into four categories based on the presence or absence of head trauma and loss of consciousness (LOC). This classification not only identifies cases with LOC but also distinguishes them from other head trauma-related conditions, including cases with no LOC or missing LOC information. By providing a granular categorization of injury-related variables, the study supports clinical decision-making and contributes to more targeted trauma care. In addition, other study explores the feasibility of developing a multilingual NLP model (English and Korean) capable of processing and interpreting EMR data across diverse clinical contexts.
MATERIAL AND METHODS
- The current methodology relies on a single-shot holdout approach. To obtain a more reliable and generalizable assessment of the model's performance, k-fold cross-validation is recommended.
: Thank you for the comment. The reviewer’s comment emphasizes the critical importance of validating the objective performance of the proposed model, and we fully agree with the need for such analysis. To address this, we conducted 5-fold cross-validation to evaluate the model and performed 10 repeated experiments to ensure that the performance metrics reported in our study are not biased.
- A brief overview of the RoBERTa model, including its architecture and key components, should be included in the introduction or materials and methods section.
: Thank you for the comment. A brief description of the RoBERTa model and its characteristics has been included in the manuscript to justify its selection as the most suitable model for the data used in this study.
Correction [page 3, line 120-129, methods]
The RoBERTa model is developed based on BERT, which employs the Transformer architecture. By optimizing the training process of the BERT model, RoBERTa achieves enhanced performance [10]. The RoBERTa model leverages byte pair encoding and dynamic masking, demonstrating superior performance compared to BERT, thereby proving its suitability for handling complex sentence processing [ref]. The pre-diagnosis notes comprising the medical records, exhibit a high proportion of non-professional terms due to being derived from patient experiences or environments of injury. To achieve accurate classification performance, we employed byte pair encoding (BPE) to decompose sentences into sub-word units. Therefore, we utilized the RoBERTa model to develop the classification model.
RESULTS
- The study's classification objective is not explicitly defined. It is unclear whether the goal is to distinguish between "no injury" and "loss of consciousness/head trauma" or to further categorize cases into "loss of consciousness only" and "head trauma only."
: Thank you for the comment. We added the following sentence to the ‘Introduction’ section to help readers understand the goals of this study.
Correction [page 2, line 61-70, Introduction]
This study aims to develop NLP algorithms for the automatic identification of injury-related variables within free-text medical records from emergency department (ED) set-tings. Specifically, the goal is to classify patients into four categories based on the presence or absence of head trauma and loss of consciousness (LOC). This classification not only identifies cases with LOC but also distinguishes them from other head trauma-related conditions, including cases with no LOC or missing LOC information. By providing a granular categorization of injury-related variables, the study supports clinical decision-making and contributes to more targeted trauma care.
- The practice of assessing model performance on both training and test sets is not standard. Typically, only the test set is used to evaluate a model's generalization ability. Why do the authors consider the performance on both sets?
: Thank you for the comment. We completely agree with the reviewer's comments. We included the AUROC of the validation dataset to show the reader that the performance on the test dataset was neither overfitted nor underfitted, with no other intention. For now, we'll leave the results in, but if the reviewer still thinks it's a good idea to remove it, we'll amend it further.
- The "C-x" conditions require a clearer definition.
: Thank you for the comment. The “C-x” condition is related to labeling information, and we have added more information about this in the text and figure.
- Figure 4, presented in Chinese, is not accessible to many readers. A translation or an alternative visualization is necessary.
: Thank you for the comment. We decided that the unintelligible texts in the figure could be confusing to the reader, so we converted Figure 4 to an English translation of the vocabulary and changed the Korean part to Supplemental Figure.
DISCUSSION
- The potential benefits of identifying the optimal data source (pre-diagnosis, post-diagnosis, or combined) should be more explicitly discussed. Are the authors aiming to develop a rapid classification system based on pre-diagnosis data? Or are they exploring strategies to reduce computational complexity by identifying redundant information?
: Thank you for the comment. In response to the reviewer's comments, we've added the following to the discussion about “optimal data source
Correction [page 9, line 281-291, Discussion]
The choice of data source—pre-diagnosis, post-diagnosis, or combined notes—plays a critical role in the development of efficient and accurate NLP-based classification models. Pre-diagnosis notes, which are recorded upon a patient’s arrival at the emergency department, contain a high proportion of patient-reported symptoms and triage assessments. These notes are available earlier in the clinical workflow, making them particularly valuable for rapid classification. Our findings demonstrate that models trained solely on pre-diagnosis notes achieved comparable performance to those trained on combined notes. This suggests that pre-diagnosis data alone can effectively support the rapid identification of LOC information, especially in time-sensitive scenarios where immediate clinical decisions are critical. In addition, utilizing pre-diagnosis data exclusively can offer a practical advantage by reducing data processing requirements and computational complexity.
- It would be valuable to provide information about the number of healthcare professionals involved in creating the medical records. Differences in writing style and terminology among clinicians can potentially impact the model's performance.
: Thank you for the comment. We found that there were 12-13 ED physicians working during this period, but we did not include variables in our study that would have allowed us to identify differences in individual style or terminology, so we were unable to conduct further analysis in this area. We further described this in the 'Limitation' section.
Correction [page 10, line 331-335, Discussion]
Lastly, differences in writing style and terminology among ED clinicians may influence the model's performance. However, we did not capture these stylistic and terminological information in our study. Further study should be conducted with incorporating such variables to evaluate and mitigate potential biases stemming from inter-clinician variability in EMR documentation
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI apprecciate the effort of the authors in revising the paper. This new version of the manuscript has significantly improved in terms of clarity and relevance. By explaining the meaning of the classes, highlighting the innovative contribution, detailing the classification method, and translating the Figures in Korean language, the authors have made the work more accessible and impactful. Also, from an applicative perspective, I believe that the finding that the model can effectively utilize pre-diagnosis data is particularly promising and has the potential to contribute to early disease detection.
However, few concerns are still present and must be addressed:
1. To enhance the clarity of the dataset, a Table detailing the composition of the train and test sets would be beneficial. This Table should include the number of patients per class and the percentage of patients in each class allocated to the train and test sets.
2. To provide a more comprehensive understanding of the model's performance, the results in Table 2 of the new version of the manuscript should be reported with confidence intervals or standard deviations for the various k-fold phases.
In addition, minor concerns may also be addressed:
1. An interesting extension of this work would be to evaluate the model's performance when retrained solely on pre-diagnosis data. This analysis could provide valuable insights into the model's ability to identify potential health issues early on.
2. Given that the model translates all text into English before classification, it would be insightful to discuss the potential impact of using a monolingual dataset. If the authors anticipate similar performance, it would be valuable to compare their approach to other methods in the literature that employ similar techniques, such as:
i) Yang, Zhongliang, et al. "Clinical assistant diagnosis for electronic medical record based on convolutional neural network." Scientific reports 8.1 (2018): 6329.
ii) Kavuluru, Ramakanth, Anthony Rios, and Yuan Lu. "An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records." Artificial intelligence in medicine 65.2 (2015): 155-166.
Author Response
Q1. To enhance the clarity of the dataset, a Table detailing the composition of the train and test sets would be beneficial. This Table should include the number of patients per class and the percentage of patients in each class allocated to the train and test sets.
A1. To clarify the data splitting used for model training and test, a new Table S1 has been created and presented in the Results section, showed the patient distribution across LOC groups. Table S1 includes the data splitting ratio and the number of patients in each group within the training and test datasets.
(Correction, 4 page, line 177-177)
“… splitting for model training.” To “… splitting for model training (Table S1).”
Table S1. Distribution of patients across the entire dataset based on LOC status.
LOC status |
Total |
Training |
Test |
C1 |
15,775 |
11,042 |
4,733 |
C2 |
6,240 |
4,368 |
1,872 |
C3 |
780 |
546 |
234 |
C99 |
513 |
359 |
154 |
(Split ratio) |
(100) |
(70) |
(30) |
Q2. To provide a more comprehensive understanding of the model's performance, the results in Table 2 of the new version of the manuscript should be reported with confidence intervals or standard deviations for the various k-fold phases.
A2. The standard deviations for the performance validated through 5-fold cross-validation are presented in Table S3. The results in Table S3 indicate that pre-diagnosis notes play a crucial role in LOC classification. To provide a comprehensive understanding of the importance of pre-diagnosis notes, additional content has been included in the Discussion section.
(Correction, 9 page, line 288-290)
… combined notes. Notably, significant differences in model performance were observed when using only pre-diagnosis notes or only post-diagnosis notes, as evaluated through 5-fold cross-validation (Student’s t-test, p<1.16-e119) (Table S3). This suggests …
Table S3. Classification performance according to input data evaluated using 5-fold cross-validation.
Input data |
AUROC |
Accuracy |
Precision |
Recall |
Weighted F1 |
Pre-/Post- diagnosis note |
0.914±0.004 |
0.894±0.009 |
0.674±0.026 |
0.746±0.010 |
0.900±0.007 |
Pre-diagnosis note |
0.907±0.004 |
0.886±0.009 |
0.664±0.024 |
0.742±0.012 |
0.892±0.007 |
Post-diagnosis note |
0.747±0.005 |
0.785±0.007 |
0.572±0.062 |
0.461±0.011 |
0.777±0.005 |
Performance scores were calculated as the average with the standard deviation.
(Minor comment)
Q3. An interesting extension of this work would be to evaluate the model's performance when retrained solely on pre-diagnosis data. This analysis could provide valuable insights into the model's ability to identify potential health issues early on.
A3. Thank you for the comment. I'm not sure if we understood your question correctly, but in the revised results in Table 2 and Table S3, we present the performance for training with pre-diagnosis note only, training with post-diagnosis note only, and training with both together. The results show that the pre-diagnosis note has more information related to LOC than the postdiagnosis note. If this is not an adequate answer, please elaborate on this question one more time.
Q4. Given that the model translates all text into English before classification, it would be insightful to discuss the potential impact of using a monolingual dataset. If the authors anticipate similar performance, it would be valuable to compare their approach to other methods in the literature that employ similar techniques, such as:
- i) Yang, Zhongliang, et al. "Clinical assistant diagnosis for electronic medical record based on convolutional neural network." Scientific reports 8.1 (2018): 6329.
- ii) Kavuluru, Ramakanth, Anthony Rios, and Yuan Lu. "An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records." Artificial intelligence in medicine 65.2 (2015): 155-166.
A4. Thank you for the comment. We are also curious about the impact of utilizing a monolingual dataset. In the case of the papers you mentioned, we expect that the results would have been better if the transformer-based RoBERTa model had been used. However, we believe that a discussion of this expectation could cloud the results of the study, so we have added a short comment on bilinguality below.
(Correction, 9-10 page, line 319-328)
In this study, we used a bilingual (English and Korean) dataset, in which all text translated into English prior to classification task. This approach was chosen to leverage pre-trained language models like RoBERTa, which are optimized for English text processing. Our model effectively captures contextual information, achieving a high AUROC score of 0.91. However, it is important to consider the potential impact of using monolingual datasets directly, particularly in scenarios where translation may result in the loss of nuanced clinical details specific to the source language. Therefore, the integration of multilingual and monolingual approaches presents a promising avenue for future research. Comparing performance metrics across translation-based and native language models will help determine the optimal approach for specific clinical contexts.
Author Response File: Author Response.pdf