Digital-Reported Outcome from Medical Notes of Schizophrenia and Bipolar Patients Using Hierarchical BERT

: Patient-reported (PRO) and clinician-reported (CRO) outcomes are assessment instruments that are completed by patients and trained healthcare professionals, respectively. A PRO is a report of the direct experience of the patient with a given disease condition. A CRO is an assessment of the condition of the patient by the healthcare provider. PROs may not be accessible to all patients, especially those suffering from severe disease conditions. CROs are time-consuming and therefore administered infrequently. In the present study, we introduce a new form of assessment, the digital-reported outcome (DRO), which is automatically derived from the medical notes of the patient. DROs have a low overhead and can be generated at each patient’s visit to complement other outcome-assessment instruments and enhance clinical decision support by identifying at-risk patients. In this study, a DRO is developed to evaluate the functional impairment in the daily activities of two cohorts of patients suffering from bipolar disorder and schizophrenia. The input of the DRO is a single medical note from the electronic medical record of the patient. This note is submitted to a hierarchical bidirectional encoder representations from transformers (BERT) model. First, a sentence-level embedding is produced for each sentence in the note using a token-level attention mechanism. Second, an embedding for the entire note is constructed using a sentence-level attention mechanism. Third, the ﬁnal embedding is classiﬁed using a feed-forward neural network. The model is trained to classify patients into moderate or severe functioning impairment levels according to the general assessment of functioning (GAF) scale, a CRO instrument for the assessment of the impact of mental illness on the daily activities of the patient. The DRO is validated using medical notes that were labeled by multiple healthcare providers from different healthcare institutions. The results indicate that a general DRO is able to classify patients from the two cohorts according to the two functioning impairment levels (severe versus moderate) prior to the onset of disease with an AUC of 76%. Disease-speciﬁc DROs are only applicable after the onset of the disease and produced AUCs of nearly 85%. The methodology introduced in the present paper is practical and can support the automated monitoring of the severity of the functioning impairment of bipolar and schizophrenia patients. Extending the proposed DRO to other psychiatric conditions and types of impairments is the subject of ongoing research work.


Introduction
Schizophrenia and bipolar disorder affect millions of people worldwide [1].Years of evidence-based medicine have increased our understanding of risk factors and optimal treatments for mental diseases.However, these diseases have a high burden due to their clinical heterogeneity, variances in severity, and progression paths [2,3].This presents a unique opportunity for the use of machine learning (ML) in mental health to aid in the diagnosis, treatment, and monitoring of patients at risk.For instance, an ML model was shown to enhance suicide prevention by using data from multiple sources such as IoT devices and social media to identify patterns in the data associated with suicidal behavior [4].In other studies, natural language processing (NLP) techniques were used to understand psychopathology [5] and to detect mood changes from social media data [6].
The aim of the present study is to develop a low-maintenance, low-overhead digital instrument (DRO) for the assessment of the severity of functioning impairment in patients with mental illness.Two different types of instruments are currently being used for this purpose: patient-reported outcomes (PRO) and clinician-reported outcomes (CRO).Example PROs include the patient health questionnaire (PHQ) [7] and the young mania rating scale (YMRS) [8].The PHQ went through several revisions in order to reduce the time necessary for its completion.This PRO was initially a revision of another instrument called the primary care evaluation of mental disorders.Subsequently, it was further reduced into PHQ-9, a PRO focusing on depression [9].YMRS is a PRO that consists of 11 questions, thereby making it easy to administer [8].
While PROs are completed by the patients or their caregivers, CROs are administered by healthcare providers and are time-consuming.The positive and negative syndrome scale (PANSS) [10] is a CRO that consists of 30 questions covering three components: a positive, a negative, and a general component.The general assessment of functioning (GAF) [11] is an assessment of the psychological, social, and occupational functioning of the patient.This CRO is a revised version of the global assessment scale [12], which was initially introduced in 1970.Healthcare providers assign a severity score to the patient ranging from 0 to 100, where higher scores represent superior functioning in daily activities.The GAF score is primarily derived from information in the patient's medical record or information gathered during the patient's encounter with the healthcare provider.
The GAF was widely used.It was included in the Diagnostic and Statistical Manual of Mental Disorders (DSM) version IV.It was then replaced by the WHO Disability Assessment Schedule 2.0 (WHODAS 2.0) [13].This latter CRO consists of 36 questions addressing cognition, mobility, self-care, getting along, life activities, and participation [14].While the WHODAS 2.0 is more recent, the present study used medical notes that were annotated according to the GAF score.Because this CRO was administered over an extended period of time, it provides a large number of data that were annotated by multiple healthcare providers, which are needed for the development and validation of the proposed DRO.Ideally, the DRO must learn from the available data to assess the functional impairment of the patients with a level of accuracy comparable to that produced by health providers.

Related Work
The purpose of the proposed DRO is to classify each medical note according to the functional impairment level of the patient suffering from bipolar disorder or schizophrenia.Medical notes consist of text data, and their processing relies on advanced NLP techniques.Specifically, large language models have emerged as an enabler for various applications that require the processing of text data from various sources.Language models leverage the concept of transfer learning.They are pre-trained on a large corpus of text data and subsequently fine-tuned for various downstream applications as needed.These models rely on a self-supervised training methodology [15] coupled with a transformer architecture [16] and were shown to facilitate the extraction of efficient contextualized features from text data.Self-supervised training eliminates the need for labeled data, thereby allowing language models to be trained with large corpora.Language models start by tokenizing the text following a pre-established vocabulary.The size of the vocabulary is a trade-off between fewer out-of-vocabulary words and higher model complexity.The encoder-decoder architecture and self-attention mechanism of the transformer architecture enable the capture of complex dependencies from text data.This mechanism was initially introduced for machine translation in [17].The encoder-decoder transformer architecture was extended by allowing the encoder to identify important input keywords for each target keyword produced by the decoder (i.e., self-attention).Vaswani et al. [16] subsequently proposed the first model based entirely on attention by replacing the recurrent layers common in previous encoder-decoder architectures with multi-headed self-attention.
The present paper investigates three language models: BERT [18], BERT mini [19], and clinical BERT [20].As mentioned above, self-attention in language models attempts to learn the dependencies between word pairs.The attention range and the size of the input text have a direct impact on the computational complexity of the model.For instance, the BERT [18] and Longformer [21] language models can process at most 512 and 4096 tokens, respectively, at a time.Moreover, BERT was trained on a general English corpus from sources such as Wikipedia.This language model is large and produces embeddings with dimension 768.BERT mini is a distilled version of BERT which generates embeddings of size 256.A reduced embedding size has the benefit of requiring a smaller dataset for fine-tuning the model to the downstream application.Clinical BERT is a BERT language model that was fine-tuned with emergency medical notes, making it more suitable for medical applications.
The capabilities of the language models have been demonstrated for various applications, including question-answering [22] and text summarization [23].These example applications focus on general corpora from Wikipedia and news abstracts, respectively.In the medical domain, language models were used for multi-task information extraction from medical notes [24].They were also used for feature extraction from Twitter posts to identify patients at risk of developing depression [25].In [26], the BERT language model was used to detect incoherence among sentences transcribed from interviews of schizophrenia patients.The language models BERT and clinical BERT were also successfully used to extract phenotypes related to major depressive disorder (MDD) from medical notes [27].
Despite the above-mentioned successful applications, only a small fraction (around 6% according to [28]) of research studies use medical notes.Medical notes pose a major challenge to pre-trained language models: the input token limit can be restrictive [29], especially if the attention range among related tokens is scattered throughout the long text of the medical note [30].To overcome this constraint, some studies either truncate the input text or rely on topic modeling in order to focus text processing to a reduced set of selected keywords [31,32].These strategies are able to address the input limit constraint of language models.However, they can overlook crucial information or are unable to leverage relevant context embodied in dispersed keywords in the medical notes.Therefore, a strategy that can accommodate the entire medical note is needed.
In the present study, we propose to adapt the hierarchical attention network (HAN) initially proposed in [33] for the development of the proposed DRO.HAN is a language model architecture that can accommodate long input text by progressively building the embedding representation of the text.HAN starts with the GloVe [34] embeddings for the tokens in the vocabulary.The size of the GloVe vocabulary is small, and the embeddings are non-contextual.Therefore, in the present study, we replace the GloVe embeddings with BERT embeddings, a language model that considers bi-directional context and benefits from a larger vocabulary.This is especially beneficial for the semantic disambiguation of homonyms and negations prevalent in medical notes.Moreover, as mentioned above, a larger vocabulary has more coverage, with fewer out-of-vocabulary words.The hierarchical architecture was also explored in a recent study [35] where the target application consisted of automatically assigning the proper international disease codes (ICD) [36] to medical notes.In this case, a single attention mechanism was applied at the sentence level and the sentence embeddings were then combined using average pooling.In contrast, the hierarchical mechanism used in the proposed DRO promotes a dual-level attention mechanism at both the token and sentence levels.
The methodology and application introduced in the present study extend the use of machine learning models to further improve the quality of care for patients suffering from bipolar disorder and schizophrenia.These models apply the transformer-based architecture to the entire medical note by using a hierarchical attention mechanism that can overcome the input size limit of traditional language models.

Methods
The aim of the present study is to develop a DRO that can reliably identify bipolar and schizophrenia patients with severe functioning impairment in their daily activity from their medical notes.The DRO consists of two stages.First, an embedding of the medical note using the hierarchical attention mechanism is produced.Second, this embedding is submitted to a neural network that classifies the functioning impairment of the patient into two classes: moderate or severe.The study cohort, the architecture of the classifier, training, and evaluation methodologies, is described next.

Study Cohort
The patient data used in the present study were obtained from the Indiana Network for Patient Care (INPC) over the period from 1 January 2005 to 31 December 2019.INPC is an operational community-wide electronic health network that currently includes data from 19 hospitals in seven healthcare systems, the Marion County Health Department, and various physician practices.
Two cohorts of psychiatric patients are considered: • Schizophrenia: This cohort includes patients with at least two clinical visits with schizophrenia disease codes and anti-psychotic medication use for at least 3 continuous months.• Bipolar: The bipolar cohort consists of patients with at least two clinical visits with bipolar type I or mixed bipolar disease codes and anti-psychotic medication use for at least 3 continuous months.
It is possible for patients to be assigned to the schizophrenia and bipolar cohorts if they satisfy both selection criteria.
Patients were included in the study if they met the selection criteria for one of the cohorts.The index date for the patients in both cohorts is defined as the first date of diagnosis.Additionally, patients had to have at least one interaction with INPC every year during the study period and be 18 years or older on the index date.This criterion made sure that patients included in the study are adults and active users of INPC.Patients are excluded if they belong to a protected group (prisoners, patients living in nursing homes, etc.) or have schizophrenia or bipolar diagnoses during the incubation period (i.e., from 1 January 2005 to 31 December 2005).The latter exclusion criterion was enforced to ensure that each patient had at least one year of medical data prior to the index date.

Data Annotation and Preprocessing
Annotating sufficient medical notes to enable the training and validation of the proposed DRO is time-consuming and costly.To overcome this challenge, the present study relies on medical notes that were scored according to the GAF scale by multiple healthcare providers from several institutions affiliated with INPC.These medical notes are semistructured and follow the DSM-IV format.The structured section consists of multiple axes where Axis I includes information on clinical disorders such as major psychiatric disorders and Axis V includes the GAF score for the specific visit.This section is followed by an unstructured free-text section used by healthcare providers to document the status of the patient.The structured section is used to determine the label (i.e., impairment level) of each data sample, and the unstructured section is the input of the proposed DRO.
The GAF impairment scale consists of ten levels where the lower and higher levels describe patients with poor and good functioning, respectively [37].One important advantage of the GAF scale is that a score is assigned to each medical note.Most other instruments (e.g., PHQ, PANSS,and YMRS) are the result of a survey questionnaire.Therefore, there is no direct relation between the medical notes in the patient record and the assigned score.Despite the benefits of using an already annotated large dataset for the development of the DRO, there are two issues that needed to be addressed.First, when each medical note is considered independently, the assigned GAF score can be subjective.This issue is mitigated by the large number of medical notes in each cohort and the diverse pool of healthcare providers that authored and scored these notes.Second, the distribution of the medical notes across the 10 levels of the GAF score is not uniform.Therefore, the original GAF scale was mapped to a modified scale with only two levels: • The severe impairment level corresponds to a GAF score of less than 50.As per the GAF scale, scores in this range are assigned when the patient is in danger of severely hurting himself or others; exhibits delusions; experiences hallucinations; or shows major impairment in several areas, such as work, school, family relations, judgment, thinking, or mood.

•
The moderate impairment level corresponds to a GAF score above 51.This GAF range is characterized by the absence or presence of moderate symptoms such as occasional panic attacks, depressed mood, mild insomnia, and difficulty concentrating.
Multi-class classifiers require training datasets proportional in size to the number of classes with sufficient samples in each class.The revised binary functioning impairment scale for daily activities (i.e., severe versus moderate) overcomes this issue while still identifying the most vulnerable patients from the bipolar and schizophrenia cohorts.Similar mappings of the GAF scale were previously used to reduce inter-rater variability and to mitigate the non-uniform distribution of the GAF scores [12].
As mentioned above, the GAF score, which is reported in the structured section of the medical note, is used to obtain the appropriate functioning impairment label needed to train the proposed DRO.The unstructured section is representative of medical notes that are entered by healthcare providers during an encounter with the patient.This free-text section is preprocessed and used as input to the DRO.During preprocessing, XML tags are removed, the text is forced to lowercase, and non-alphanumeric characters are replaced with spaces.Moreover, all the hex, Unicode symbols, and numbers are removed while preserving sentence delimiters.Sentence delimiters are used to split the text into sentences.Sentences with more than 30 words are split into sentences with at most 25 words, thus ensuring that the number of tokens associated with each sentence is within the imposed limits of pre-trained language models.Finally, common stop-words are removed using the NLTK standard English stop-word collection [38], and the mention of drug names is replaced by their respective anatomical therapeutic chemical (ATC) [39,40] group name.

Hierarchical Attention Model
The proposed DRO follows the HAN architecture [33] and consists of three components: (1) the token encoder [41], (2) the sentence encoder, and (3) the prediction layer (Figure 1).The token encoder generates the sentence-level embedding using the BERT, BERT mini, or clinical BERT bi-directional gated recurrent unit (GRU) token encoder.The original HAN architecture used GloVe embeddings.In the present study, BERT embeddings are used because they are context-sensitive.
Each medical note consists of n sentences, and each sentence i ∈ [1, n] is a sequence of L i tokens T i j where j ∈ [1, L i ].The forward GRU reads the sentence i starting from token T i 1 to T i L i , and the backward GRU reads the sentence starting from token T i L i down to T i 1 .An annotation h i j for the token T i j is obtained by concatenating the hidden layers of the forward and backward GRU encoders.An attention mechanism is then used to identify the tokens that are relevant to the impairment class.The aggregate representation of these tokens constitutes the overall sentence embedding s i as shown in Equation (1).
where α i j = Attention(h i j , u j ) = So f tmax(Tanh(W j h i j + b j ) T u j ), W j is a weight matrix, b j is a bias vector, and u j is a context vector.To generate the sentence embedding s i , the annotation h i j of the token T i j is first processed by a linear layer with a Tanh activation function.This layer is followed by a So f tmax layer after multiplication by the context vector u j .The output of the So f tmax layer, α i j , represents the normalized attention weights.The second component of HAN shown in Figure 1 is the sentence encoder.This component follows the same approach as the token encoder to produce the embedding of the entire medical note.However, instead of starting from the token embeddings, it uses the sentence embeddings produced by the token encoder.The third component is the prediction layer.This layer uses the output of the sentence encoder to classify the medical note according to the appropriate functioning impairment level.
The above architecture is used to develop two types of DROs.The first type, general DRO (GDRO), is trained using the medical notes from the two cohorts prior to the index date.Three language models are evaluated for the token-level embeddings: BERT, BERT mini, and clinical BERT.The second type is a disease-specific DRO.To produce each disease-specific DRO, GDRO is fine-tuned with medical notes from each of the cohorts after the index date.The two disease-specific DROs are labeled SDRO and BDRO for the schizophrenia and bipolar disease conditions, respectively.

Results
The number of patients in the bipolar and schizophrenia cohorts is 1746 and 1767, respectively.These patients can have one or more medical note over the study period.Table 1 shows the distribution of these notes across the two impairment levels before and after the index date.For the two cohorts, the percentage of notes assigned to the severe functioning impairment level is higher than that of those assigned to the moderate functioning impairment level.Moreover, for both the bipolar and schizophrenia cohorts, there is a significant increase (≈20%) in the percentage of notes assigned to the severe impairment level and consequently a significant decrease in the percentage of notes assigned to the moderate impairment level during the post-index period compared to the pre-index period.This trend indicates a deterioration in the impairment level of the patients post-diagnosis.The number of samples from each cohort used to train and test the DRO models is shown in Table 2.The notes available for the development of GDRO are from the pre-index period, whereas the notes available for the fine-tuning of the disease-specific DROs (i.e., BDRO and SDRO) are from the post-index period.To compare the different DROs and adjust for class imbalance, an equal number of notes was randomly selected from the two impairment levels followed by a 67/33 training/testing dataset split from each impairment level.This two-third/one-third split ratio was selected to ensure sufficient testing data from each impairment level.For GDRO, 2324 notes are selected from each impairment level and cohort over the pre-index period.This number is dictated by the severe impairment level class of the bipolar cohort, which has the lowest number of samples over the preindex period (Table 1).A total of 9296 samples with an equal number of samples from the two impairment levels and the two cohorts are collected for the training and testing of GDRO (Table 2).Similarly, 1487 notes were selected from the two impairment levels for the fine-tuning and evaluation of each disease-specific DRO.This number was dictated by the moderate impairment level of the schizophrenia cohort during the post-index period.

Impairment Classifiers
The DRO models are trained using the Adam optimizer [42] over seven epochs with a batch size of 16 and a learning rate of 5 × 10 −5 .The values of the hyper-parameters are established using a three-fold nested cross-validation.Moreover, three language models for token-level embeddings are evaluated: BERT, BERT mini, and clinical BERT.
The mean and standard deviation of the AUC, accuracy, sensitivity, and specificity of each DRO are included in Table 3.These performance metrics were obtained using a threefold cross-validation.The So f tmax function in the prediction layer (Figure 1) produces a probability score indicating the likelihood that a medical note belongs to either the severe or moderate impairment level.The AUC is derived from this probability and is a measure of the area under the receiver operating characteristic curve (ROC).This curve is a plot of the sensitivity of the classifier against (1-specificity) and represents the ability of the classifier to discriminate between the two impairment levels as the discrimination threshold is varied.The ROC curves for all DRO models developed using the BERT, BERT mini, and clinical BERT embeddings are shown in Figure 2. The accuracy, sensitivity, and specificity metrics are calculated based on the binary assignment derived by using a specific threshold.The values of the metrics shown in Table 3 are obtained using a threshold of 0.5.Varying the threshold can increase either the sensitivity or the specificity and decrease the other.The general, pre-index GDRO has an AUC greater than 75% with all three embeddings (Figure 2).BDRO and SDRO, the disease-specific DROs, have higher AUCs, with the AUC of BDRO exceeding 85%.For most of the models, the standard deviation of the AUC is less than 1%, indicating that the models are stable.Moreover, the variation in AUC due to the use of different token-level embeddings is less than 2%.According to [43], an AUC in the range of 70% to 80% is considered acceptable for diagnostic applications, and a range of 80% to 90% is considered excellent.
Despite the lack of variability in AUC resulting from different token embeddings, Table 3 shows that there could be significant variances in sensitivity and specificity.For instance, the disease-specific SDRO for schizophrenia shows lower sensitivity and higher specificity with the token embedding BERT compared to BERT mini and clinical BERT.This observation indicates that, in practice, a calibration process is needed to select the appropriate cut-off threshold for each token embedding prior to deploying the DRO in production.This threshold will allow healthcare systems to adjust the true positive and true negative rates of the model to the desired referral rates for patients with severe impairment.

Attention Mechanism
In order to investigate which sentences the DRO model is attending to in the medical notes, the weights from the attention layers are visualized in Figures 3 and 4 for samples taken from the bipolar severe and moderate impairment levels, respectively.This heatmap shows the combination of the token-level and the sentence-level attentions α i j α i after normalization.The terms α i j and α i represent the token-level (Equation ( 1)) and the sentence-level attentions, respectively.The combination of the two attentions scales-up with important tokens in important sentences.Figure 3 shows that BDRO is attending to patient-reported outcomes (e.g., "chronic feelings emptiness chronic suicidal ideations") and clinicians reported outcomes (e.g., "mental status time discharge alert oriented time place person situation").In general, the model assigns high weights to sentences indicative of severe impairment (e.g., "active suicidal homicidal ideations") and low weights to sentences that show moderate impairment (e.g., "denying feeling hopeless worthless").The latter example also illustrates the importance of context (i.e., "denying") for the correct classification of the medical notes.
Compared to the severe impairment (Figure 3), the moderate impairment example (Figure 4) shows fewer sentences with high attention weights.Moreover, most of the sentences attended to reflect moderate functioning impairment (e.g., "patient reporting worsening hallucinations", and "slightly irritable early morning").

Discussion
Machine learning (ML) methodologies have been effectively utilized to enhance clinical decision support for mental health.Among others, these methodologies were used to estimate treatment outcomes for patients suffering from depression [44]; to identify bipolar patients from a cohort of psychiatric patients [45]; and to identify obsessive-compulsive disorder symptoms from the medical records of patients diagnosed with schizophrenia, schizoaffective disorder, or bipolar disorder [46].
The present paper contributes to these applications with a DRO framework that can be deployed along the electronic health record of a healthcare system in order to continuously monitor the severity of the functioning impairment of bipolar and schizophrenia patients.The results show that the GDRO was able to identify psychiatric patients with severe functioning impairment from the two target cohorts with an AUC of 76% prior to the onset of the disease.After the onset of the disease, bipolar and schizophrenia patients with severe functioning impairment were correctly classified with AUCs of 86% and 84%, respectively, using the respective disease-specific DROs.
The proposed DROs not only show good performance according to accepted standards [43] but also utilize data that are readily available for all patients.Compared to other frameworks that rely on the results of MRI images or gene expression profiles [45], which may not be available for all patients, the proposed DROs use only medical notes routinely collected during the encounters between the patients and their healthcare providers.
The present study demonstrated the potential of the proposed DROs to enhance the healthcare services of patients suffering from bipolar disorders and schizophrenia.However, these DROs suffer from a few limitations.First, it is desirable to extend the training and validation of the GDRO to other mental disease conditions.Moreover, this model had approximately a 10% lower AUC than the disease-specific BDRO and SDRO.Access to additional training data from other patients cohorts may help improve the performance of GDRO and its transfer learning potential to disease-specific DROs for other mental disease conditions.Second, the medical notes used in the development of the DROs were annotated according to the GAF score.The GAF scale was widely used and was shown to align with several other scales such as the psychiatric Apgar scale [47] and the Zung depression scale [48].However, it is being replaced by the WHODAS scale.Therefore, future work should investigate the development of DROs using the new WHODAS scale once sufficient samples become available.Third, the results show that the three tokenlevel embeddings produced DROs with comparable AUCs.However, these DROs had different sensitivities and specificities.In the short term, future work should investigate these differences and propose automated calibration mechanisms that can help select the appropriate cut-off threshold for each type of token-level embedding.In the long term, it may be beneficial to develop a language model that is specifically trained with routine care medical notes.

Conclusions
An automated method for classifying the medical notes of patients with psychiatric disorders according to their impairment severity can help improve healthcare for the patients and enhance resource allocation for healthcare institutions.Several previous studies demonstrated the utility of ML models in identifying symptoms, predicting readmission, and assessing treatment outcomes for psychiatric patients.The present study extends these studies to the assessment of the functioning impairment severity in daily activities for two cohorts of patients suffering from bipolar disorder and schizophrenia.The proposed DROs can be deployed in healthcare systems and used to monitor these patients.Patients assigned to the severe impairment level may need additional evaluation using other CROs and laboratory tests.Future work will explore the application of the proposed framework to other mental disease conditions, will consider training and validating the framework with medical notes that are scored using the more recent WHODAS impairment scale, and will seek additional training data in order to improve the accuracy and generalizability of the proposed DROs.Institutional Review Board Statement: This study was approved by the Review Board of Indiana University (IRB number: 2011632512).We confirm that all of the methods were performed in accordance with the relevant guidelines and regulations.

Informed Consent Statement:
The Ethics Committee of Indiana University waived the need for informed consent due to the retrospective nature of the study.

Data Availability Statement:
The data that support the findings of this study are available from the Regenstrief Institute, but restrictions apply to the availability of these data, which were used under a research agreement for the current study and so are not publicly available.Data are, however, available from the corresponding author upon reasonable request and with permission of the Regenstrief Institute.

Acknowledgments:
The authors would like to thank Jarod Baker and Anna Roberts of the Regenstrief Institute for their support.The high-performance computing services used in this study are supported in part by Lilly Endowment, Inc. through its support for the Indiana University Pervasive Technology Institute.

Conflicts of Interest:
Ben Miled has a financial interest in DigiCare Realized and could benefit from the results of this research.Boustani serves as a Chief Scientific Officer and Co-Founder of BlueAgilis, Inc. and as the Chief Health Officer of DigiCare Realized, Inc.He has equity interest

Figure 1 .
Figure 1.Hierarchical attention architecture of the DRO model.

Figure 2 .
AUCROC curves for all the proposed DROs.(a) DROs with BERT embeddings, (b) DROs with BERT mini embeddings, and (c) DROs with clinical BERT embeddings.

Figure 3 .
Figure 3. Attention weights produced by BDRO for a section of a medical note in the medical record of a bipolar patient with severe impairment.

Figure 4 .
Figure 4. Attention weights produced by BDRO for a section of a medical note in the medical record of a bipolar patient with moderate impairment.

Table 1 .
Distribution of the medical notes across the two impairment levels before and after the patient's index date.

Table 2 .
Number of severe and moderate medical notes in the training and testing datasets of each DRO model.

Table 3 .
Three-fold cross-validation means (standard deviation) of the AUC, accuracy, sensitivity, and specificity of the DRO models with the BERT, BERT mini, and clinical BERT embeddings.