Word2vec Word Embedding-Based Artificial Intelligence Model in the Triage of Patients with Suspected Diagnosis of Major Ischemic Stroke: A Feasibility Study

Background: The possible benefits of using semantic language models in the early diagnosis of major ischemic stroke (MIS) based on artificial intelligence (AI) are still underestimated. The present study strives to assay the feasibility of the word2vec word embedding-based model in decreasing the risk of false negatives during the triage of patients with suspected MIS in the emergency department (ED). Methods: The main ICD-9 codes related to MIS were used for the 7-year retrospective data collection of patients managed at the ED with a suspected diagnosis of stroke. The data underwent “tokenization” and “lemmatization”. The word2vec word-embedding algorithm was used for text data vectorization. Results: Out of 648 MIS, the word2vec algorithm successfully identified 83.9% of them, with an area under the curve of 93.1%. Conclusions: Natural language processing (NLP)-based models in triage have the potential to improve the early detection of MIS and to actively support the clinical staff.


Introduction
Major ischemic stroke (MIS) affects over 600,000 patients/year, being among the top five causes of death and the first cause of disability in the United States [1]. The MIS evolution time is 10 h on average (range 6-18 h) and it has been estimated that the patient loses 1.9 million neurons for each minute that MIS is untreated [2]. The misdiagnosis of MIS has been associated with false positives (stroke mimics) and false negatives (stroke chameleons) in up to 26% and 43% of cases, respectively [3]. Randomized trials demonstrated that the best outcome is achievable within 4.5 h from the onset of stroke [4][5][6][7][8]. Accordingly, an early and accurate diagnosis of possible MIS patients and their aggressive treatment are mandatory [2,3,[9][10][11][12]. While vital, the involvement of human resources such as nurses, neurologists, and radiologists has been reported to act as a time-limiting step in the stroke triage and imaging pathway, especially because this expertise may not be available at 2 of 10 all sites or times [2]. These are the main reasons for the increasing interest toward the automatization of the acute management of MIS. Machine learning-based technology has already been used in acute ischemic and hemorrhagic stroke imaging [2,13,14]. However, the semantic models of representation languages and their potential advantages in the optimization of the MIS management still remain largely underestimated.
The aim of the present study is to test the feasibility of the implementation of the word2vec word embedding-based AI model in decreasing the risk of false negatives during the triage of patients with a suspected diagnosis of MIS in the emergency department (ED).

Methods
The python code for this project is available in the GitHub repository at the following link: https://github.com/pimorandi/MIS_in_ED_admissions (accessed on 14 November 2022).

Data Collection
The study was approved by the Internal Review Board of Humanitas Research Hospital. The patients' data were retrospectively collected from clinical notes at triage of the ED and referred to the timeframe January 2015-March 2021.
Admission diagnoses were derived from the assigned International Classification of Diseases 9th revision (ICD-9) code after the first visit. The ICD-9 codes specifically selected for their relevance to an MIS were as follows: 434.01 (cerebral thrombosis with cerebral infarction); 434.90 (cerebral artery occlusion, unspecified without mention of cerebral infarction); 434.91 (cerebral artery occlusion, unspecified with cerebral infarction).

Text Preprocessing
The text data underwent "tokenization" consisting of some preprocessing steps to clean and normalize the variables and to separate the paragraphs into words (tokens). Text words were lowercased and normalized through the removal of punctuation, numbers, and non-ASCII characters. A white space character was used as a delimiter for each token, transforming the paragraphs into lists of tokens. Stop words, such as prepositions and articles, were removed to further clean the texts from undesired tokens. The last preprocessing step was the "lemmatization", aimed at reducing the number of different tokens. The TreeTagger library was used for this step [15].

Text Data Vectorization
The word2vec word-embedding artificial intelligence algorithm was used for the text data vectorization. To produce the embedding, word2vec builds a shallow neural network able to predict a word given its context. The values assumed by the intermediate layer during this prediction are then used as embedding for the given word. The embedding dimension N chosen in this setup is 300, meaning that each word is transposed to a numerical vector of 300 dimensions ( Figure 1). The training of the word2vec model was performed using the Gensim Python library [16].
The final vector for each paragraph was obtained averaging the values of the embedding tokens.

Classification and Model Training
Prior to the training, we employed Propensity Score Matching (PSM) [17] to our available confounders (age and gender) to mitigate the bias effect that may skew the results from our model. We devised this latter methodology to retain 100 controls with matched confounders for each MIS sample. The model performances were evaluated via stratified five-fold cross-validation using the scikit-learn Python library [18]. The chosen model was a Gradient Boosted Classification Tree (LightGBM library [19]) and the optimal choice of hyper-parameters was performed using a Bayesian optimization framework (scikitoptimize library) [20]. A logistic regression and a single hidden-layer neural network were also tested, and their performance can be found in Appendix A. The chosen optimization metric was the F1 score since it is a metric particularly fit to deal with imbalanced datasets defined as the harmonic mean of precision and recall. To deal with the data imbalance, different weights were associated with the two classes.

Classification and Model Training
Prior to the training, we employed Propensity Score Matching (PSM) [ available confounders (age and gender) to mitigate the bias effect that may sk sults from our model. We devised this latter methodology to retain 100 con matched confounders for each MIS sample. The model performances were eva stratified five-fold cross-validation using the scikit-learn Python library [18]. T model was a Gradient Boosted Classification Tree (LightGBM library [19]) and t choice of hyper-parameters was performed using a Bayesian optimization f (scikit-optimize library) [20]. A logistic regression and a single hidden-layer work were also tested, and their performance can be found in Appendix A. T optimization metric was the F1 score since it is a metric particularly fit to deal w anced datasets defined as the harmonic mean of precision and recall. To deal wi imbalance, different weights were associated with the two classes. Figure 2 summarizes the flowchart of the data collection and processing.   The final vector for each paragraph was obtained averaging the values of the embedding tokens.

Classification and Model Training
Prior to the training, we employed Propensity Score Matching (PSM) [17] to our available confounders (age and gender) to mitigate the bias effect that may skew the results from our model. We devised this latter methodology to retain 100 controls with matched confounders for each MIS sample. The model performances were evaluated via stratified five-fold cross-validation using the scikit-learn Python library [18]. The chosen model was a Gradient Boosted Classification Tree (LightGBM library [19]) and the optimal choice of hyper-parameters was performed using a Bayesian optimization framework (scikit-optimize library) [20]. A logistic regression and a single hidden-layer neural network were also tested, and their performance can be found in Appendix A. The chosen optimization metric was the F1 score since it is a metric particularly fit to deal with imbalanced datasets defined as the harmonic mean of precision and recall. To deal with the data imbalance, different weights were associated with the two classes. Figure 2 summarizes the flowchart of the data collection and processing.

Dataset
The dataset was composed of 305,227 ED admissions divided into 648 MIS and 304,579 non-MIS. The number of female admissions in these two groups is respectively 305 (47.1%) and 148,464 (48.7%). The mean age is 75 (Q1 = 63.9, Q3 = 83.9) for MIS observations and 55 (Q1 = 38.4, Q3 = 73.8) for non-MIS (Table 1). Since age is strictly correlated with the outcome, the control class had to be subsampled to account for its covariate effect using a PSM technique. The subsampling ratio was 100:1, so for each MIS observation, 100 control observations were selected. After PSM, both gender and age have a non-significant p-value related to the outcome. The final cohort is composed of 65,448 observations divided into 648 MIS and 64,800 controls ( Table 2).

Classification
In Table 3 is shown the average performance in both the train and test steps of the cross-validation using different metrics. As can be seen, the model is able to learn and generalize to new data. In Figure 3 are plotted the mean ROCs for the train and test steps during cross-validation.

Dataset
The dataset was composed of 305,227 ED admissions divided into 648 MIS and 304,579 non-MIS. The number of female admissions in these two groups is respectively 305 (47.1%) and 148,464 (48.7%). The mean age is 75 (Q1 = 63.9, Q3 = 83.9) for MIS observations and 55 (Q1 = 38.4, Q3 = 73.8) for non-MIS (Table 1). Since age is strictly correlated with the outcome, the control class had to be subsampled to account for its covariate effect using a PSM technique. The subsampling ratio was 100:1, so for each MIS observation, 100 control observations were selected. After PSM, both gender and age have a non-significant p-value related to the outcome. The final cohort is composed of 65,448 observations divided into 648 MIS and 64,800 controls ( Table  2).

Classification
In Table 3 is shown the average performance in both the train and test steps of the cross-validation using different metrics. As can be seen, the model is able to learn and generalize to new data. In Figure 3 are plotted the mean ROCs for the train and test steps during cross-validation.   The word2vec algorithm was able to identify the top 15 words positively correlated to MIS diagnosis using the cosine similarity as a metric between the average stroke patients text vector and the different word vectors. Dysarthria and aphasia were the text words more strongly correlated with the correct diagnosis of MIS (Figure 4). The word2vec algorithm was able to identify the top 15 words positively correlated to MIS diagnosis using the cosine similarity as a metric between the average stroke patients text vector and the different word vectors. Dysarthria and aphasia were the text words more strongly correlated with the correct diagnosis of MIS (Figure 4). Afasia or afasico/a: aphasia/aphasic (masculine and feminine adjective); clonie: clonic movements; disartria/disatria: dysarthria, the second word is misspelled/orthographically wrong; disartrico/a: dysarthric (masculine and feminine adjective); disorientamento: disorientation; eloquio: language; espressivo: expressive, a type of aphasic speech (e.g., expressive aphasia); ipostenia/ipoastenia: weakness, the second word is misspelled/orthographically wrong; plegia: plegy; sguardo: gaze.
A brief analysis of the predictive performance of the model stratified per color code (Table 4) shows that for those that are labeled low priority (green) at ED entrance, the model correctly identifies MIS patients when the clinical staff do not; in other words, 61.3% of patients would have been assigned as low priority when in reality they were MIS patients. Of course, due to the low precision for green codes (0.009), the model would trigger far too many false positives to be implemented in an actual clinical setting. Afasia or afasico/a: aphasia/aphasic (masculine and feminine adjective); clonie: clonic movements; disartria/disatria: dysarthria, the second word is misspelled/orthographically wrong; disartrico/a: dysarthric (masculine and feminine adjective); disorientamento: disorientation; eloquio: language; espressivo: expressive, a type of aphasic speech (e.g., expressive aphasia); ipostenia/ipoastenia: weakness, the second word is misspelled/orthographically wrong; plegia: plegy; sguardo: gaze.
A brief analysis of the predictive performance of the model stratified per color code (Table 4) shows that for those that are labeled low priority (green) at ED entrance, the model correctly identifies MIS patients when the clinical staff do not; in other words, 61.3% of patients would have been assigned as low priority when in reality they were MIS patients. Of course, due to the low precision for green codes (0.009), the model would trigger far too many false positives to be implemented in an actual clinical setting.

Diagnosis of Major Ischemic Stroke
The present study strived to test the feasibility of the implementation of an NLPbased classification model to optimize the acute management of MIS from triage clinical notes. More than 80% of strokes result from ischemic damage to the brain due to an acute reduction in the blood supply. The goal in the management of acute ischemic stroke is early arterial recanalization to limit the brain damage, since the delay in starting the treatment is associated with worse physical and cognitive outcomes, with a high level of disability and comorbidities [2,21,22]. Although faster triage, improvements in neuroimaging techniques, thrombolysis, and thrombectomy represent the major advances of MIS management, the overall outcome of patients affected by stroke is still largely dependent on a prompt and accurate diagnosis at admission at the ED [12,[23][24][25][26][27][28]. Based on our results, keywords-based analysis seems to point to promising results that may yield to a more rapid diagnosis of stroke. The cross-validation performance shows that stroke patients were identified with a recall of 83.9% and an AUC of 93.1%. Dysarthria and aphasia were the text words most importantly correlated with the stroke diagnosis. It is noteworthy that the model was still able to correctly associate a suspected diagnosis of stroke with those misspelled text words that were accidentally recorded during the triage. "Disatria" instead of "disartria", namely, dysarthric speech, was an example. The practical implication of such a model in daily practice would be non-negligible, since it may contribute to the optimization of the acute management of patients affected by MIS. In a combined vision, where the machine learning models are integrative rather than substitutive of the human resources, the availability of a computer alert generated by the algorithm may be of help to nurses and others to more rapidly recognize those patients suspected to be affected by ischemic stroke. Further algorithms such as those reported in the present study may also be adopted for hemorrhagic stroke, as well as other vascular and non-vascular pathologies of the central nervous system for which a multifactorial genesis is now recognized [29][30][31][32][33].

Word2vec Word Embedding-Based Artificial Intelligence Model
One-hot encoding and word embedding are two of the most popular concepts for vector representation in natural language processing. Word2vec is an algorithm created in 2013 that uses a neural network model to identify words that are associated starting from a big matrix of datasets, and once trained, it can select words with similar meaning from the words surrounding it. It represents each word identified by a list of numbers called vectors. The vectors are selected with a simple mathematical function and share a certain level of semantic similarity between the words associated with those vectors [34]. The choice of word2vec embedding-based algorithm lets us work on a large volume of data in a simple way. This algorithm selects words with intrinsic meaning, starting with a numeric vector obtained from a dependent variable. From the numeric vector (whose length is about 300, established by our team), we process data with a statistic model that can interpret artificial neural networks obtained using the word2vec algorithm. Another algorithm that could be used because of the ease of implementation is "one-hot encoding", working in a faster way than word embedding: every word has its own value in a vector, but in this process, it loses the semantic meaning of the word in a sentence. One-hot encoding was one of the first techniques used in artificial intelligence models, but with the birth of word embedding, it becomes obsolete, especially in scientific fields. Furthermore, by using a one-hot encoding algorithm, the size of the embedding vector grows with the vocabulary, so it could be difficult to elaborate those data because of the entity of the matrix of embedding obtained, so it does not work well in applications that require a large amount of data. Word2vec, with its implementation, could be a good middle ground because the precision of word embedding depends on the volume of the dataset, so it works well on large datasets obtaining the best word embedding with the smallest matrix. Other algorithms for word embedding include GloVe and FastText. With word2vec, we train a neural network with a single hidden layer to predict a target word based on its context. With FastText, each word is composed of a character n-gram so it can help to generate better word embeddings for rare words or for out-of-vocabulary words; a big limit of this algorithm is that it takes longer to do the embedding and as the dataset grows, the memory required grows too, so in this way is no different to one-hot encoding. The GloVe is a word-embedding technique similar to word2vec, but it differs from it because it is a count-based model instead of a predictive model. In fact, GloVe focuses on word co-occurrences over the whole corpus, while word2vec leverages co-occurrence within a local context (neighboring words). GloVe embeddings relate to the probability that two words appear together. Word-embedding techniques, with respect to count-based methods, are used in different language tasks such as semantic relatedness, synonym detection, concept categorization, and analogy. With word2vec, we observe large improvements in the accuracy at a much lower computational cost, e.g., it takes less than a day to learn high quality. As reported, the need for continuous training of the model, by means of the increase of the data collected from other clinical studies, is a key aspect for the further improvement and optimization of the model itself [35,36].
Lastly, it should be highlighted that the word2vec model has a non-negligible rate of false negatives. Despite this aspect raising concerns about the overall accuracy, it must be stressed that in the authors' experience, the model was proven to be able to emulate human performance, decreasing the rate of human error, but keeping the clinical biases. For this reason, the model cannot theoretically overcome the overall human performance. We consider this aspect an intrinsic limitation of the model rather than a weakness of the study. Other promising scenarios are worthy of mention since they may prove more accurate in the near future, as suggested by some groups [37][38][39][40].

Limitations of the Study
The first limitation of the present study lies in the exclusion of hemorrhagic stroke or TIA, considering only MIS. Furthermore, this word-embedding-based model did not explore the vital signs, which are extremely useful to detect the critical issues of the patient. Using word2vec, we obtained the classification of a word strongly associated with MIS in terms of clinical features, but this algorithm does not work on the definite diagnosis of the disease. With AI models, it would be easy to create a warning signal with those "embedded words", popping up on computers of triage's nurses, but the meaning of that "alert" must be evaluated according to the cases. For example, one of the words most associated with stroke diagnosis, according to the word2vec model, is "disorientation", but only in a few cases is this clinical feature observed in patients. Another limitation of the algorithm is that the detection of true positive cases is not well balanced by the identification of true negative rates. It could overestimate the real impact of the disease in triage. With word2vec, the word embedding obtained using the algorithm is "static", which means that the model has no awareness of the context in which the word is found. By using recurrent neural networks, the word embedding could become dynamic and more accurate: this new model is able to detect the hidden relationship between inputs as well as to provide a precise sequence prediction of words, giving a high level of accuracy to the results. Future perspectives could involve dynamic models of word embedding such as BERT. Outcome selection is another limitation of this study since we only used the ICD-9 at hospital discharge. Potentially, we would need verified outcomes at 14/28 days and 6 months for every suspected case of MIS at ED admission that was not hospitalized. Those outcomes would further alleviate clinical and other biases.

Conclusions
The present feasibility study demonstrated that the word2vec word embedding-based AI model was reliable in identifying a suspected diagnosis of MIS during patients' triage in the ED.
Further studies on larger patient cohorts are mandatory to definitively validate the proposed model. Institutional Review Board Statement: All procedures performed in the study were in accordance with the ethical standards of the institution and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In addition to the Gradient Boosted Trees, the model selection process also considered a logistic regression and a feed-forward neural network. The logistic regression underwent the same hyper-parameter optimization described in Section 2.4. The neural network is composed of a single hidden layer whose dimensionality has been set by manual investigation to six neurons. The performances are shown below. As can be seen, both of these models seem to lead to better classifications compared to the Gradient Boosted Trees, but a more in-depth analysis of the performances across color codes shows that the ensemble method generalizes better to low priority code (green).