International Classification of Diseases Prediction from MIMIIC-III Clinical Text Using Pre-Trained ClinicalBERT and NLP Deep Learning Models Achieving State of the Art

: The International Classification of Diseases (ICD) serves as a widely employed framework for assigning diagnosis codes to electronic health records of patients. These codes facilitate the encapsulation of diagnoses and procedures conducted during a patient’s hospitalisation. This study aims to devise a predictive model for ICD codes based on the MIMIC-III clinical text dataset. Leveraging natural language processing techniques and deep learning architectures, we constructed a pipeline to distill pertinent information from the MIMIC-III dataset: the Medical Information Mart for Intensive Care III (MIMIC-III), a sizable, de-identified, and publicly accessible repository of medical records. Our method entails predicting diagnosis codes from unstructured data, such as discharge summaries and notes encompassing symptoms. We used state-of-the-art deep learning algorithms, such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, bidirectional LSTM (BiLSTM) and BERT models after tokenizing the clinical test with Bio-ClinicalBERT, a pre-trained model from Hugging Face. To evaluate the efficacy of our approach, we conducted experiments utilizing the discharge dataset within MIMIC-III. Employing the BERT model, our methodology exhibited commendable accuracy in predicting the top 10 and top 50 diagnosis codes within the MIMIC-III dataset, achieving average accuracies of 88% and 80%, respectively. In comparison to recent studies by Biseda and Kerang, as well as Gangavarapu, which reported F1 scores of 0.72 in predicting the top 10 ICD-10 codes, our model demonstrated better performance, with an F1 score of 0.87. Similarly, in predicting the top 50 ICD-10 codes, previous research achieved an F1 score of 0.75, whereas our method attained an F1 score of 0.81. These results underscore the better performance of deep learning models over conventional machine learning approaches in this domain, thus validating our findings. The ability to predict diagnoses early from clinical notes holds promise in assisting doctors or physicians in determining effective treatments, thereby reshaping the conventional paradigm of diagnosis-then-treatment care. Our code is available online.


Introduction 1.Background
The MIMIC-III database stands as a significant tool for researchers, clinicians, and students keen on delving into critical care medicine to enhance patient outcomes [1].It offers access to real-world data, enabling the examination and hypothesis testing concerning the treatment of critically ill patients.With its application in over 1000 research studies and citations in more than 3500 scientific papers, its impact on medical research is profound.A distinct aspect of the MIMIC-III database is its inclusion of detailed clinical notes [2].These notes, produced by healthcare providers, offer narrative accounts of patient care, presenting deep insights into the management of critically ill patients.These narratives are instrumental in uncovering trends and patterns in patient treatment, enriching the database's value for research purposes.Perhaps the most well-known work on ICD prediction using the MIMIC-III dataset is the 2018 study by James Mullenbach et al., entitled "Explainable Prediction of Medical Codes from Clinical Text".This paper is renowned for introducing and applying the CAML (Convolutional Attention for Multi-Label Classification) model, which combines convolutional neural networks (CNNs) with an attention mechanism to predict ICD codes from clinical text.This work was among the first to incorporate explainability into ICD prediction models-a crucial advancement for fostering trust and understanding in healthcare applications that depend on AI predictions.Although not the top performer in terms of raw accuracy metrics-owing to continual improvements in the field-the CAML model demonstrated competitive results on MIMIC-III at the time.Its explainability features underscored its significance [3].This work has spurred further research in the domain of medical code prediction from clinical texts, influencing methodologies across both academic and practical healthcare settings.Notably, the study titled "An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment Using MIMIC-III Clinical Notes" has become a cornerstone in the field, comparing various deep learning approaches for ICD-9 code prediction and establishing a benchmark for future research.It underscores the potential of deep learning to automate ICD coding tasks [4].Another influential subsequent work is the 2021 paper "TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding" [5].This study introduces a transformer-based architecture named TransICD, which employs a code-wise attention mechanism.This mechanism enables the model to concentrate on specific segments of the clinical notes that are pertinent to each ICD code prediction.The paper is notable for its high micro-AUC score of 0.923, although it does not detail the exact F1 scores for the top 10 and top 50 ICD predictions.It is also important to acknowledge the foundational work in NLP deep learning models, particularly the transformer architecture introduced by Vaswani et al. in their landmark 2017 paper, "Attention is All You Need" [6].While this paper did not focus on clinical applications, it has significantly influenced NLP through its effectiveness in tasks like machine translation and text summarization.Understanding transformers is essential for grasping many clinical NLP studies that utilize this architecture.Subsequently, Google AI's development of BERT in 2018, based on the transformer's encoder mechanism, marked a pivotal shift in how contextual information is processed in NLP models.Moreover, recent research has explored further innovations based on the transformer architecture, such as the Transformer-in-Transformer (TNT) model, which offers a novel approach to visual recognition tasks.Although the TNT model is primarily designed for visual tasks, its methodological innovations provide useful parallels for text-based applications like ICD prediction [7].Similarly, the Multi-Generator Orthogonal GAN (MGO-GAN) introduces a novel approach utilizing multiple generators to enhance output diversity.This method could analogously enhance the diversity in ICD code prediction from clinical texts, potentially capturing a broader array of diagnoses from complex medical narratives [8].
In this context, our current paper utilizes various deep learning models, including RNN, LSTM, and BERT, to predict ICD codes from clinical text data in the MIMIC-III dataset.Our focus is on comparing its performance, particularly with the transformer-based BERT model, which remains a benchmark in many NLP tasks [9].

Data Exploratory and Analysis
The MIMIC-III dataset showcases a broad spectrum of patient demographics, notably featuring a predominance of older adults and males.It encompasses a wide array of clinical notes, diagnostic codes, and possibly additional pertinent details.The below images explain and summarize the details of the exploratory data analysis: Figure 1 illustrates the age distribution of patients through a histogram, with a pronounced peak in the 60-70 age bracket.This suggests a predominant grouping of patients within this age interval.The data lean towards the right, indicating a larger share of older patients over younger ones.Figure 2 showcases a bar chart detailing the gender distribution within the dataset, comparing male (M) and female (F) patients.The male patient count is noticeably higher, as seen in the taller bar for males, highlighting a gender disparity in the dataset.Figure 4 features a bar chart displaying the top 10 diseases or ten most common ICD-9 diagnosis codes as an example.The chart, with the y-axis for occurrence counts and the x-axis for the codes, shows a clear standout with the code 401.9 marking a significantly higher occurrence than its counterparts.These visualizations and statistics can help us and any researchers or analysts better understand the characteristics and structure of the MIMIIC III dataset before conducting further analyses.For our study, two relevant tables will be considered: note events and ICD-9 diagnosis.The note events table has more than 2 million rows and columns for patient ID, admission ID, and discharge note text.The notes contain details like medical history, including symptoms, medications, lab tests, hospital course, and final diagnosis, including the ICD-9 code given by doctors.The ICD diagnosis table has 651,000 rows and columns for patient ID, admission ID, and ICD-9 diagnosis codes.There are 6984 unique codes.Each time a patient is admitted, they may receive between 1 and 38 diagnosis codes, which indicate the order of importance of their conditions and reasons for their visit.In summary, the two key tables contain patient admission records with unstructured discharge note text and structured ICD-9 diagnosis codes for analysis and mapping between text and codes.Table 1 describes the size of the dataset and their respective unique values in the initial dataset.

Data Processing
The first step was to examine the list of ICD-9 diagnosis codes present in the MIMIC-III dataset.Subsequently, these codes were matched with their respective ICD-10 counterparts, and the accuracy of this mapping was validated using a Python script.After that, the notes and diagnosis tables from MIMIC-III were merged based on unique patient and hospital admission IDs.This created a unified dataset with each patient's admission ID, ICD-10 codes, and discharge summary text.The data were then filtered to create multiple datasets: one with the top 10 ICD-10 codes by frequency, one with the top 50, and one with all codes.
The distributions across these datasets were compared.To mitigate potential out-of-memory issues when processing the full dataset, smaller randomized samples of the data were taken such as 30%, 70%, and 100% of the full dataset.This allows initial testing on smaller sizes before scaling up.The results of these steps were processed and sampled datasets containing patients' admission IDs, ICD-10 codes, and textual discharge summaries, ready for the application of natural language processing and machine learning models to predict diagnosis codes from the text.The multiple sampled datasets allow model performance to be tested at different data volumes.Below Table 2 describe diagnosis statistics.

Methodology
Our methodology consists of the following steps: data pre-processing and building the language model and classifier model.Specifically, we use Python 3.10 for data preprocessing and Python, NumPy, Pandas, and Sklearn for feature extraction.PyTorch is the main framework for training and testing models.We used Jupyter Notebooks to run our experiments on a private cloud platform called Runpod.io. Figure 5 below describes the methodology used: This above diagram illustrate an overview of our methodology pipeline for processing and classifying medical text notes, likely from electronic health records, using machine learning models.Here is a brief explanation of all stages presented in our pipeline: MIMIC-III Database: This is a publicly available dataset that contains de-identified health-related data associated with over forty thousand patients who stayed in critical care units.The pipeline uses two main tables from this database: -NOTEEVENTS: This table includes admission text notes, which are free-text descriptions of patient encounters.-DIAGNOSIS-ICD: This table lists the ICD-9 diagnosis codes for the conditions diagnosed during the hospital stay.
Data Pre-Processing: Relevant data from the NOTEEVENTS and DIAGNOSIS-ICD tables are merged to create a new dataset.Stop words (commonly used words that usually do not contain important meaning, like "the", "is", etc.) are removed from the text to reduce noise and focus on significant words.The most common ICD-9 codes (top 10/50) are extracted and then mapped to ICD-10, which is a more current and detailed classification system for medical diagnoses.The text from the notes is tokenized, which means it is split into meaningful pieces (tokens) such as words or terms, and then these tokens are associated with the corresponding diagnostic labels (this process is called label encoding).
Data Modeling: Bio-ClinicalBERT [10] is utilized as the primary tokenizer for the text notes.This is a version of the BERT model that has been pre-trained on biomedical and clinical text, making it more effective for understanding medical language.Classifier models are built, and their hyperparameters are fine-tuned.Hyperparameters are the settings for the algorithm that guide the training process and are set before the training starts.
Classifier: Tokens generated from the text are used for classification purposes.The main classifier models mentioned are recurrent neural networks (RNNs), long short-term memory (LSTM) networks, bidirectional LSTM (BiLSTM) [11], and BERT (Bidirectional Encoder Representations from Transformers).These are different neural network architectures commonly used in natural language processing tasks.The performance of these models is analyzed using metrics like the F1 score (a harmonic mean of precision and recall that balances the two), precision (the number of true positive results divided by the number of all positive results), and recall (the number of true positive results divided by the number of positives that should have been retrieved).Overall, this pipeline is a structured approach to converting free-text medical notes into structured data that can be analyzed and used for various purposes, such as predicting diagnoses, by leveraging advanced machine learning techniques.

Experimental Setup
Data Splitting: The dataset was split into 80% training data and 20% test data using the scikit-learn library in Python.This ensured that we had sufficient data to train the models while retaining a subset to evaluate performance.The train-test split allows for an unbiased assessment of the models.
Input Encoding: The text data were then encoded into numeric vectors suitable for machine learning using a pre-trained Bio-ClinicalBERT tokenizer from the Hugging Face company.This state-of-the-art language representation model is designed specifically for the biomedical domain, allowing it to better handle medical terminology.The texts were tokenized and encoded into input vectors for the training and test sets.
Model Selection: Based on initial experiments, several model architectures were selected for comparison: recurrent neural networks (RNNs), long short-term memory (LSTM), bi-directional LSTM, and BERT fine-tuning.These represent both traditional and cutting-edge deep learning approaches for NLP text classification tasks.
Evaluation Metric: The weighted average F1 score was chosen as the single metric to track during experiments.F1 score balances both precision and recall while weighting accounts for class imbalance.This offers a comprehensive assessment of performance.Additionally, various performance metrics such as precision, recall, and accuracy values are utilized to assess disparities in performance across different datasets and classifier models.
Model Optimization: To improve results, various optimization techniques were employed: • Hyperparameter tuning to find optimal model configurations.Model Selection: Finally, the best-performing model architecture was selected based on the experiments.The top model was retrained on the full 80% training corpus and saved for future use.The pre-processed encodings were also retained for reuse in subsequent experiments.

Results and Discussions
Table 3 illustrates the performance of each model concerning their respective datasets, focusing on the top 10 and top 50 ICD-10 codes for diagnosis.The performance of the top 10 ICD-10 prediction using BERT is better, with accuracy above 87% and 81% when using a single LSTM model.However, performance decreased slightly when we tried predicting top 50 ICD-10 as we obtained an accuracy of 81% for the BERT model and 67% for a single LSTM model.The precision and recall scores for the top 10 are also better than those for the top 50 data.In assessing these three metrics, our approach involves the calculation of average values rather than the examination of micro-or macro-level data points.The best results were achieved using the hyperparameters below after model-tuning.Table 4 provides a summary of the best hyperparameters for different models, including RNN, LSTM, BiLSTM, and BERT, with both top 10 and top 50 diagnoses.The hyperparameters include the batch size, number of epochs, embedding dimension, hidden dimension, optimizer, activation function, dropout rate, and learning rate.Table 5 below presents a comparison between the main findings of previous studies and the results we obtained.Indeed, our research indicates that models previously considered as having lower performance exhibited suboptimal results primarily because of inadequately chosen hyperparameters and the absence of a fine-tuned decision boundary.Through our updated comparison, we illustrated that when we trained our models using our configuration, it led to a reduction in the gap between the highest and lowest F1 scores.This confirms the results collected in the latest ICD-10 prediction research [20].Additionally, Figures A1-A4 in Appendix A illustrate the precision, recall, and F-1 score for the LSTM/BERT classifiers built.Overall, the classifier with the top 10 diagnoses has higher scores when compared to the classifier with the top 50 diagnoses.
Previous studies have explored the feasibility of deep learning models for predicting ICD-10 codes.However, it is important to note that these deep learning models did not demonstrate high performance when applied to the MIMIC-III database.
To summarize, this experimentation utilizes a diverse array of deep learning models, including RNN, LSTM, BiLSTM, and BERT, with a specific emphasis on the Bio-ClinicalBERT model, which is pre-trained for biomedical texts.The study takes advantage of various neural network architectures, particularly focusing on a specialized version of BERT pre-trained for biomedical contexts.This approach enhances the model's ability to interpret clinical language effectively.Furthermore, these results showcase significant advancements in the automation of ICD coding and present the most comprehensive F1 score metrics available to date.These scores are internationally recognized for evaluating the balance between precision and recall in classification tasks.

Limitation and Future Work
One of the main challenges we faced during our work was a lack of computational resources to execute high-end operations necessary for training and optimizing complex models like RNN, LSTM, and BERT.Indeed, handling the extraction of 7 GB from the MIMIC-III dataset, which has an initial total size of 3 TB and consists of 26 tables, demands significant computational resources and time.Using a private cloud server such as the RTX A6000 GPU helped us overcome environments limited by resources, enabling more efficient data processing and model training.
During the training of RNNs, LSTM, and particularly BERT models using the MIMIC-III dataset, we encountered several additional challenges.Firstly, the complexity and heterogeneity of healthcare data present in MIMIC-III can lead to issues such as imbalanced classes and missing values, which significantly affect the performance of predictive models.Addressing these data quality issues required sophisticated preprocessing steps, which themselves are resource-intensive.
Moreover, the temporal dependencies and high dimensionality of the data make RNNs and LSTMs computationally expensive to train.These models also suffer from issues like vanishing and exploding gradients, making it challenging to train deep networks effectively without careful tuning of hyperparameters and the adoption of techniques like gradient clipping and batch normalization.BERT and other transformer-based models, while powerful in capturing contextual information from clinical notes, demand even more computational resources due to their attention mechanisms and large number of parameters.Training these models from scratch on a dataset like MIMIC-III can be prohibitively expensive, often necessitating the use of pre-trained models followed by fine-tuning on specific tasks.However, the adaptation of these models to domain-specific medical language and tasks requires careful calibration and validation to ensure that the models do not perpetuate biases or errors inherent in the training data.
Future work will focus on enhancing prediction models for ICD codes or diagnoses by using an ensemble approach rather than relying on single models.Such an approach may leverage the strengths of various model architectures to improve accuracy and robustness.Additionally, refinements are necessary to boost the performance and accuracy of models when predicting a larger number of diagnoses, such as the top 20, top 50, or even more than 100 diagnoses.
Furthermore, adopting more advanced validation techniques, such as k-fold crossvalidation, will be explored to ensure the robustness and generalizability of the models.Unlike the traditional approach of splitting the dataset into a fixed training and test set, k-fold cross-validation provides a more comprehensive evaluation of model performance by partitioning the data into multiple subsets for training and validation.This helps in assessing the model's performance across different subsets of the data and provides a more accurate estimate of its true performance on unseen data.
Lastly, addressing the limitations in explainability and transparency of these complex models is crucial, especially in a high-stakes field like healthcare.Developing methods to interpret model decisions and ensure they align with clinical reasoning will be critical in future work, enabling clinicians to trust and effectively use AI-driven tools in their decision-making processes.

Conclusions
In conclusion, this research examines the efficacy of deep learning models such as LSTM and BERT architectures, specifically the BERT model, for automated extraction of medical concepts from clinical notes in the MIMIC-III database.Empirical results demonstrate that deep learning natural language processing techniques can effectively encode clinical texts and assign appropriate ICD codes without manual supervision.The proposed methodology establishes a competitive baseline for concept extraction, achieving strong diagnostic code prediction from discharge summaries.Compared to the top 10 ICD code prediction with an F1 score of 0.72 [21], we achieved a better F1 score of 0.87.Furthermore, similar to the top 50 ICD code prediction with an F1 score of 0.75 [22,23], we achieved a final F1 score of 0.81.Moreover, the generalizability of the current LSTM/BERT models creates promise for holistic, unified systems that can extract multiple data types such as diagnosis codes, simultaneously from unstructured electronic health records.This research thereby underscores the capability of artificial intelligence methods to unlock clinical knowledge from textual data sources and meaningfully impact healthcare delivery.Furthermore, large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.Indeed, in the latest paper of Zelalem et al. [24], selfverification represents a crucial milestone in harnessing the capabilities of large language models (LLMs) within healthcare contexts.As LLMs consistently enhance their overall

Figure 3
Figure 3 offers a deeper dive into the dataset's notes categories, displaying a bar chart of the variety of note types, where "Nursing/other" is the most frequent category.

Table 2 .
Statistics for diagnosis tables with top 10 and top 50 prevalent codes.

Table 3 .
Summary of results of our experiments.

Table 4 .
Summary of best hyperparameter values.

Table 5 .
Comparative evaluation of different studies from the literature review.