This paper is an extension of the work originally presented in the 16th International Conference on Wearable, Micro and Nano Technologies for Personalized Health [1
]. According to Dhamdhere et al., 80% of all healthcare data is unstructured [2
]. Besides MRI, CT-scan, and imaging, a large amount of medical data collected within medical records is written in natural language. Moreover, despite using EMR, free text is still widely used because of some of its advantages such as more precise and flexible descriptions. However, free narrative text does not satisfy the criteria of semantic interoperability, cannot be analyzed by statistical tools, cannot be processed by a decision support system, and is not available for improving case-based systems. At the same time, the demand for medical free-text processing systems is increasing nowadays.
There are two approaches to natural language text processing: a traditional one, based on language rules and domain ontologies, and methods that utilize machine learning. Machine learning algorithms show very good results in natural language processing (NLP) tasks [3
], including data extractions [7
] and named entity recognition [8
A multi-layer perceptron (MLP) is an artificial neural network composed of an input layer to receive data, an output layer that makes a decision or prediction about the input, and hidden layers. The specific feature of this type of network is that each neuron from one layer is connected to every neuron from the next one. These layers are also called fully-connected or dense. As for any supervised learning technique, training involves adjusting the parameters, particularly the weights and biases, of the model in order to minimize error.
One of the successful approaches is to engage convolutional neural networks (CNN) for text classification [11
]. CNNs have shown superior results in image recognition [12
], and this neural network architecture can be applied to vector-represented text.
Convolutional neural networks (CNNs) [12
] differ from fully-connected networks and have a more complex structure. CNNs have achieved state-of-the-art performance in image classification, speech recognition, and sentence classification. Convolution layers of CNNs, in contrast to fully-connected layers, are only connected to a small region of the previous layer and have a much smaller number of parameters. Technically, this allows us to use deeper models requiring the same memory but having better performance.
There are three main types of layers required to construct a CNN: the convolutional layer, pooling layer, and fully-connected layer. Convolutional layer parameters are a set of learnable filters. Each filter slides across the input data represented in the form of a matrix and computes a convolutional function. As a result, we obtain an activation map that gives the responses of a filter at every spatial position. In the process of training, the filters change to activate when they receive an input, which can be a feature of desirable output. Pooling layers are put in between convolutional layers to reduce the spatial size of the representation, reduce computation, and prevent overfitting. Pooling combines the outputs of neuron clusters into a single neuron as the maximum value of a cluster. Dense layers applied in CNN are the same layers as described in the MLP section. It makes the final prediction and returns the predicted class.
Long short-term memory networks (LSTMs) [14
] are a type of recurrent neural network (RNN) that were developed to enable networks to take into account the information that has previously been fed into the network and can be useful for current predictions. RNNs are applied for sequential data such as time series, text, or any other sequence. In contrast to CNNs and conventional feed-forward networks, RNNs are able to model sequential data because they have a state variable to keep patterns in data. The state variables are updated over time. RNNs are able to predict the next value of the sequence.
Advanced RNN LSTMs are widely used in language modelling and machine translation. LSTMs have better performance, providing more data and allowing for the stororage of memory for much longer than ordinary RNNs, and can also learn long-term dependencies. LSTM unit comprises a memory block—a cell and three regulators called gates: an input gate, an output gate, and a forget gate. The cells are responsible for remembering, and the gates manipulate this memory. The gates are continuous functions between 0 and 1 and the value controls how much information flows through the gate.
All the neural networks mentioned above require text represented in some digital form. Word representation is an important and non-trivial task in NLP [15
]. Good representation should preserve semantics and word context in a language. Word2vec [16
] is a technique used to learn word embedding or feature representation of words. Word2vec is a distributed representation—the semantics of words is captured by the activation pattern of the resulting vector in contrast to a more simple one-hot representation. Word2vec can analyze a given text corpus and obtain numerical representations—vectors—such that words occurring in similar contexts have similar numerical representation. For vectors, this means that two vectors will be located closer to each other in a multidimensional vector space for words from similar contexts. Here, semantic meanings of words are learned from the context and words from similar contexts having similar semantics; as a result, their vector representations are located close to each other.
Another issue is that the algorithm requires large labelled datasets for learning like any supervised learning algorithm does. Collecting such datasets is a hard and complicated task, taking a lot of human time and effort.
Existing works report about successful results in different NLP applications in medicine. Danilov et al. [17
] obtained 2.8 days mean absolute error (MAE) in hospital stay prediction. Zhou et al. [18
] achieved 93% accuracy for medical event detection in Chinese clinical notes. However, every reported system aims and is trained to solve its exact task. Additionally, despite the progress in NLP field, the question about better embedding is still open [19
]. In this work, we implemented three different embedding models and compared their performance in application to neural network classifier. We developed a prototype of a medical data extraction system for extraction diagnosis from free medical records in Russian.
The list of entities for extraction comprised 24 items and the total of 983 samples. Those 24 entities were the more common diagnoses and complains in the processed records. Each sample in the obtained dataset was a pair of the snippet of text and the name of the entity. SNOMED codes of entities and their distribution in the dataset are shown in Table 1
For word embedding, we applied three word2vec models. Table 2
provides the characteristics of the models. The “Corpus” column indicates what corpus was used to train the model, “Total words” indicates the total amount of words or tokens in the corpus, normalization is “yes” if tokens were normalized, “Words in the model” is the number of words included in the model and having vector representation, and “Vector size” is the dimensionality of the vectors in the model.
For model 1 and model 2, we used the same corpus of 220 medical records that was composed of 1,418,728 words in total. Model 3 was adopted without any changes. It was originally trained on a corpus of 788 million words. Words in the model were normalized. The final amount of vectors representing words in the model was 248,000.
According to the pipeline described in Section 2.2
of this work, model 1 corresponded to pipeline 1 and did not have normalization. Because of this, all words in the model remained in the form they had originally been in within the text. This led to a bigger amount of words in the resulting model. Model 1 had vectors for 7505 words, whereas model 2 was trained on the corpus of the same medical records but only had 3879 word representations. The corpus was normalized for model 2 before training, which decreased the amount of unique words.
provides F-scores for three prediction models based on MLP, CNN, and LSTM, and the samples’ representation corresponded to three pipelines. The highest score (0.9763) was achieved with CNN and samples embedded with the biggest word2vec trained on RNC and Wikipedia (pipeline 3).
Scores grouped by pipelines and grouped by predictive models are shown in Figure 2
. In this figure, it can be seen how better word representation influences the efficiency of each model (Figure 2
a) and which model fits better for each pipeline (Figure 2
4. Discussion and Further Work
Lemmatization is often used in NLP, including preprocessing for the operation of word embedding. However, a lot of information goes missing while retrieving lemmas. We showed that lemmatization before obtaining a word2vec representation led to better performance for all three classifiers. When lemmatization was complete, the obtained model had much less elements but more contexts for more precise representation of each one.
One of the key points in our research was comparing different word2vec models in order to evaluate their efficiency in the task of entity extraction. The three pipelines used mostly differed in their word embedding models. Thus, we can make a conclusion regarding the models considering the results presented in Figure 2
. The pre-trained model from the Russian National Corpus was the biggest model in our research and was applied within pipeline 3. The amount of words in the corpora dramatically exceeded the other two models. As a result, the score for the third pipeline achieved the highest values among all classifiers. Our word embedding models turned to be too small and the corpora that was collected for training word2vec models did not reach a sufficient scale. Due to this fact, we cannot make a conclusion about the better performance of embedding models trained on texts from the relevant field as compared with models trained on non-specific texts. To make such a conclusion, much bigger corpora of medical texts have to be used.
Three classifiers were applied to extract entities from snippets of text. MLP and CNN classifiers showed similar results to all three embedding models. MLP exceeded convolutional network on pipeline 2 that used the embedding model trained on medical records with preliminary lemmatization. Nevertheless, the highest F-score was achieved by CNN. CNN slightly exceeded MLP when the biggest word2vec model was applied.
LSTMs have a great potential in dealing with free text. However, the format of training data did not enable the advantages of these kinds of neural networks. Usage of LSTMs requires text data as sequences of tokens, not as snippets with fixed length.
The presented work is not yet a real-life task, but a simplification. This can be seen from the number of entities for extraction and the size of training dataset. The amount of entities was limited by 24, and all samples in the dataset were collected manually. We expect that results from real data would not be so high. To further improve this, the training dataset should be extended with more entities and more samples.
The next step in this direction is collecting a bigger comprehensive corpus of medical texts that includes not only records but also articles and guidelines. Having such a corpus, we will be able to train a word2vec model tailored for the medical field. The next step is the development of a complete system for data extraction with an increased size of the training dataset and user interface development. Dataset extension should be done in both directions by adding new entities and gathering more samples for each entity. Because only an entity having enough samples in the dataset can be successfully extracted, gathering a properly labeled dataset is the most challenging task. Nonetheless, such a system can be easily adopted and deployed within one particular department for structuring and second usage of medical data from unstructured narrative medical records.