Improving Medical Entity Recognition in Spanish by Means of Biomedical Language Models

: Named Entity Recognition (NER) is an important task used to extract relevant information from biomedical texts. Recently, pre-trained language models have made great progress in this task, particularly in English language. However, the performance of pre-trained models in the Spanish biomedical domain has not been evaluated in an experimentation framework designed specifically for the task. We present an approach for named entity recognition in Spanish medical texts that makes use of pre-trained models from the Spanish biomedical domain. We also use data augmentation techniques to improve the identification of less frequent entities in the dataset. The domain-specific models have improved the recognition of name entities in the domain, beating all the systems that were evaluated in the eHealth-KD challenge 2021. Language models from the biomedical domain seem to be more effective in characterizing the specific terminology involved in this task of named entity recognition, where most entities correspond to the "concept" type involving a great number of medical concepts. Regarding data augmentation, only back translation has slightly improved the results. Clearly, the most frequent types of entities in the dataset are better identified. Although the domain-specific language models have outperformed most of the other models, the multilingual generalist model mBERT obtained competitive results.


Introduction
Healthcare professionals generate huge amounts of medical literature and clinical data that are stored digitally, resulting in a large availability of medical texts, many of which are stored in an unstructured text format.Thus, one of the most important challenges in medical data processing is the transformation of unstructured information into well-defined data.
NER is a Natural Language Processing task to identify entities in text and classify them into predefined categories.In the medical domain, NER plays a crucial role by extracting medical terminology, i.e., meaningful text segments, such as diseases, symptoms, drugs, etc.
A recent survey of the state-of-the-art proposals for NER [1] focused on deep learning approaches for the recognition of generic Named Entities (NEs) (e.g., organization, person and location) in English language.It concluded that deep-learning-based NER benefits from the advances made in pre-trained embeddings in modeling languages without the need for complicated feature engineering.Focused on clinical texts, ref. [2] presents a survey on NER and Relationship Extraction, concluding that linguistic model-based approaches are likely to continue to increase in the coming years.In addition, the authors of [3] compared domain-specific models and generalist models for NER in clinical trials in English, and concluded that domain models performed better than generalist models.NER in the medical domain, and in languages other than English, is hampered by the scarcity of corpora and other resources and tools.In the case of Spanish, some annotated biomedical corpora have recently been created on certain (sub)categories of entities, e.g., only entities of a specific family of diseases [4][5][6][7][8][9].An imbalance between classes of entities is common, leading supervised systems to recognize mainly the most frequent entities.Fortunately, although most of the available language models have been trained on general texts, models from the biomedical domain are becoming available.
Regarding the work for Spanish, ref. [10] presented an extension of the Freeling Spanish analyzer [11], FreelingMed, by extending the resources of Freeling with medical dictionaries and SNOMED-CT.Others works, such as [12,13], have focused on crosslingual approaches in order to take advantage of the English resources and make different projections into Spanish.However, both works have Oracle terms as their starting point.Reference [14] presented UMLSMapper, a lexically/knowledge-driven system that relies on several terminological resources from UMLS.In [15], UMLSMapper is combined with crosslingual approaches obtaining very promising results.Proposals for Spanish NER based on Bidirectional Long Short-Term Memory (Bi-LSTM) networks and Conditional Random Fields (CRFs) are presented in [7,16].In [17], a Bi-LSTM network is used to resolve the NER task of clinical notes in Spanish and Swedish, evaluating several types of embeddings, both generated from in-domain and out-of-domain text corpora; the authors concluded that, with in-domain embeddings, the NER task is improved compared to with shallow learning methods.In [18], a pre-trained BERT language model on Spanish biomedical literature, finetuned for detecting pharmacological substances, compounds, and proteins, is presented.In 2020, the Cantemist (https://temu.bsc.es/cantemist/(accessed on 3 July 2023)) evaluation campaign was presented, with the aim of exploring the automatic detection of mentions of tumor morphology in medical documents in Spanish, as well as the assignment of eCIE-O (ICD-O is an acronym for International Classification of Diseases for Oncology.It is an extension of the International Statistical Classification of Diseases and Related Health Problems applied to the specific domain of tumor diseases, and is the standard coding for the diagnosis of neoplasms https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_o_3.html (accessed on 3 July 2023)).In this case, the NEs involved are very specific to tumor morphology.The two best participant teams were [19,20].In [19], NER is regarded as a machine reading comprehension problem, whose task is to answer questions regarding different types of entities based on given passages.The authors used a BERT model which was further pretrained using the CANTEMIST corpus.[20] and used an end-to-end deep-learning-based system from pre-trained BERT models as the basis for the semantic representation of the texts.
In this work, we focus on the NER scenario proposed in the 2021 eHealth Knowledge Discovery (eHealth-KD) challenge (https://ehealthkd.github.io/2021(accessed on 3 July 2023)) [21].The goal was to identify four types of entities that are relevant terms representing semantically important elements in a sentence.The most successful systems in this NER challenge were based on the use of contextual language models.The winning team was PUCRJ-PUCPR-UFMG [22] with a transformer-based model, the multilingual version of BERT [23] (mBERT), with an end-to-end architecture, which not only addresses the NER task but jointly extracts relationships between entities and other eHealth-KD tasks.The second best team was Vicomtech [24] with the IXAmBERT transformer model [25], a multilingual model for English, Spanish, and Basque, and a classifier formed by a Neural Network that received the input tokens and jointly produced predictions for the NER and relation extraction tasks.The third best team was IXA [26] with a system designed as a pipeline for classifiers, each independently tuned for NER and relation extraction.For the NER task, texts were encoded using an XML-RoBERTa transformer model [27] and a Feedforward Neural Network (FNN) was used as the classifier.The top three teams in the NER task approached the task jointly with the relation extraction task, indicating that joint training improves entity identification.
To summarize, previous work shows that the use of Bert-type generalist large language models improves the results of other non-transformer-based approaches.On the other hand, the limited availability of Spanish open annotated corpora hampers the fair comparison of different approaches.In this sense, the eHealth-KD challenge provided a framework for experimentation that allows for comparing different techniques and models.In the latest 2021 campaign, participants did not use contextual language models from the biomedical domain, but instead used generalist contextual language models.
In this paper, we present a proposal for recognising named entities in Spanish medical texts using transfer learning and data augmentation techniques.Our goal is to improve the identification of medical entities in Spanish using recent public domain-specific language models, trying to overcome the limitation imposed by the imbalance between different types of entities.We hypothesize that the use of domain models can improve NER task performance in the experimental framework of the eHealth-KD Challenge 2021.As far as we know, no previous work has used this combination of techniques to improve the identification of biomedical entities in Spanish.The key contributions are as follows: (i) the use and fine-tuning of public transformer-based models previously trained on Spanish biomedical datasets to improve results in the NER task; and (ii) the selection of back translation as the data augmentation technique to mitigate data imbalance.

Materials and Methods
The eHealth-KD 2021 challenge proposed a framework to the automatic sentence-level annotation of multi-token entities and binary relations among them, attempting to capture a large part of factual semantics.Thus, two tasks were addressed: entity recognition and relation extraction.Here, however, we only focus on the entity recognition task.

NER Task
Four types of entities are considered [21]: • Concept: identifies a relevant term, concept, or idea.• Action: identifies a process or modification of other entities.It can be indicated by a verb or verbal construction, such as "afecta" (affects), but also by nouns, such as "exposición" (exposition), where it denotes the act of being exposed to the sun, and "daños" (damages), where it denotes the act of damaging the skin (see Figure 1).• Predicate: identifies a function or filter of another set of elements, which has a semantic label in the text, such as "mayores" (older), and is applied to an entity, such as "personas" (people) with some additional arguments such as "60 años" (60 years) (see Figure 1).• Reference: identifies a textual element that refers to an entity, of the same sentence or of different one, which can be indicated by textual clues such as "esta" (this), "aquel" (that one), etc.
Thus, not only biomedical terms are considered name entities in this challenge, but also some general language elements.

Figure 1. Examples of named entities
, where text in green with labels "Con", "C" refers to "Concept" type; text in red to "Action" type; text in yellow to "Predicate" type; and text in grey with "Ref" label refers to "Reference" type [21].
Figure 1 shows the entities appearing in a set of sentences with their respective entity types.Note that some entities, such as "vías respiratorias" (airways) and "60 años" (60 years), are multi-word entities.When managing multi-word entities, IOB (Inside-Outside-Beginning) notation is often used to represent them.IOB provides labels to indicate the entity boundaries: B-entity (first word of the entity); I-entity (subsequent words); and O (non-entity words) [28].For example, in Figure 1, the text segment "... las vías respiratorias" is represented as O for "las" (the), B-Concept for "vías", and I-Concept for "respiratorias".

Dataset
The dataset provided by the organizers contains sentences extracted from MedlinePlus (https://medlineplus.gov/ (accessed on 1 September 2023)) (health information resource), Wikinews (https://www.wikinews.org/(accessed on 1 September 2023)) (news), and the CORD-19 corpus [29] (scholarly articles about COVID- 19), which are all related to health topics.All sentences are in Spanish, except those in the CORD corpus which are in English.The dataset is divided into three collections: training, development, and testing, as shown in Table 1 [21].Figure 2 shows the frequency of each type of entity and the total number of entities.Regarding the training dataset, it is unbalanced in terms of the number of entities of each type, the most frequent being "Concept" and the least frequent "Reference".The development dataset presents a similar imbalance, but it includes part of the CORD-19 corpus.We use both datasets for training the models.Note that there are 50 documents about COVID-19 in English (CORD-10 corpus) in the datasets used for training the models, but half of the test dataset contains that type of document.Transfer learning techniques may reduce the difficulty of correctly identifying and classifying COVID-19-related entities in the test dataset.On the other hand, the imbalance among the different types of entities can be a key factor in the performance of the models for the types that have very few examples.To overcome this limitation, we used data augmentation techniques.

Evaluation Metrics
For evaluation, we used Precision, Recall, and F1-Score metrics as defined by the eHealth-KD organizers, where "correct" (C), "partial" (P), "missing" (M), "incorrect" (I), and "spurious" (S) matches are based on the start and end of text spans and the corresponding entity type.
A "Correct" match is when the spans and entity type are equal; when the start and end values match, but not the type, this is an "incorrect" match; a "partial" match is when there is a partial match in the interval of [start, end] values; "missing" matches are those that appear in the goldstandard, but not in the output file; and "spurious" matches are those that appear in the output file but not in the goldstandard.Thus, Precision (P), Recall (R), and F1-Score metrics are defined as follows (Equations ( 1)-( 3), respectively).

Models and Techniques
In the following, we present the pre-trained models, the data augmentation techniques, and the proposed models.

Pre-Trained Language Models
Two generalist and three domain-specific models were selected.One of these models was trained on multilingual texts, and the other four on Spanish texts only.Generalist models: • mBERT: BERT multilingual base model [30] trained on 104 languages with data from Wikipedia to perform Masked Language Modeling (MLM).• BETO: a Spanish version of BERT [31] trained with texts from Wikipedia, Wikinews, and Wikiquotes in Spanish, among other sources.
Domain-specific models: • RoBERTa Bio [32]: based on the model RoBERTa [33], but trained with several Spanish biomedical corpora, such as Spanish Biomedical Crawled Corpus [34] and SciELO Spain (https://scielo.isciii.es/scielo.php(accessed on 13 July 2023)).• RoBERTa Clinical : trained with the same resources as RoBERTa Bio , but, in addition, a corpus of clinical reports with more than 278,000 documents and clinical notes was used.• RoBERTa NER : fine-tuned for a NER task.This model is a refinement of RoBERTa Bio , fine-tuned with the PharmaCoNER dataset (https://temu.bsc.es/pharmaconer(accessed on 13 July 2023)) and annotated with substance, protein, and compound entities.

Data Augmentation
Class imbalance is a common problem in this task.We propose the use of data augmentation techniques with the minority classes "Predicate" and "Reference" to mitigate it.We implemented two approaches to increase the number of samples: entity synonym generation and Back Translation (BT).
WordNet https://wordnet.princeton.edu/(accessed on 17 July 2023) was used to generate synonyms.It is a lexical database that groups nouns, verbs, adjectives, and adverbs that are synonyms forming a synset, each of which expresses a different concept.In particular, we used the Open Multilingual Wordnet (https://github.com/globalwordnet/OMW (accessed on 17 July 2023)), which aims to facilitate the use of wordnets in multiple languages.For each entity of the classes we wanted to augment, we selected the most frequent synonym.
Back Translation [35] consists of translating the entities into another language, from Spanish to English in this case, and then translating them back into Spanish, assuming a high probability that some of the resulting entities are not exactly the same as the originals, but have the same meaning.For translating, we used the models provided by the Language Technology Research Group at the University of Helsinki, both for Spanish into English (https://huggingface.co/Helsinki-NLP/opus-mt-es-en (accessed on 17 July 2023)) and English into Spanish (https://huggingface.co/Helsinki-NLP/opus-mt-en-es (accessed on 17 July 2023)).
As a result of each type of augmentation process in training and development collections, the entity classes "Predicate" and "Reference" doubled in frequency: in the case of "Predicate" from 955 to 1910, and from 274 to 548 in the case of "Reference".

NER Models
We used the pre-trained models presented in Section 2.4.1 as base models, and finetuned them for the NER task using the training and development collections.These base models have been pre-trained using Masked Language Modeling (MLM), except the RoBERTa NER model.To fine-tune a base model, it receives as input a sequence of tokens, i.e., a sentence from the dataset, and returns a sequence of labels, where each label corresponds to each given input token.Thus, the architecture of the model is that of BERT, fine-tuned for the NER task. Figure 3 shows the pipeline of the proposed system, including a BT data augmentation step.The input words are a sample of the training data; the rest of the data presented are a simulation.First, the input data could be augmented by increasing the number of samples of the entities by BT (step 1 in Figure 3).Then, the input sentence is tokenized (step 2) using the WordPiece tokenization method [36].This strategy tries to achieve a good balance between vocabulary size and out-of-vocabulary words.The algorithm segments the words into smaller parts and builds the vocabulary using the combination of these individual parts.Each of these parts or tokens will be converted to a 768-dimensional vector.Then, the input embedding is formed by a series of layers of embeddings (step 3), of which the output is the input of a classification layer (step 4).Specifically, the encoder module consists of 12 multi-head attention layers, where the self-attention mechanism is implemented, while the classification module is formed by a Feed Forward layer and a Softmax layer.Regarding the hyperparameters, the models were trained with 25 and 40 epochs, a batch size of 64, and a learning rate of 2 × 10 −5 .We used the Adam optimizer [37].
Text preprocessing techniques, such as stemming or lemmatization, as well as the removal of punctuation marks, were tried, but discarded because they did not improve the results.Finally, only non-alphanumeric characters and accents were removed.Adding an extra layer to process the POS tag of the entities was also tested, but the results did not improve.
Regarding computational resources, the training was performed on a private computer, with the following characteristics: CPU AMD Ryzen 9 5900HX, 16 GB RAM, ND 1T SSD M.2.No GPUs, cloud computing services or corporate servers were used.The training time with this configuration was between 4 and 5 h.

Results
Table 2 presents the results.The first column shows the pre-trained model and the number of epochs used in the fine-tuning.As, in the test collection, half of the documents are written in English, we show the overall results by language.The table is organized in three parts: the first corresponds to the original data sets, the second to the datasets augmented with WordNet, and the third part to the datasets augmented with Back Translation.
Table 2. Joint results by language of the fine-tuned transformer models: the first part corresponds to the original data sets, the second part to datasets augmented with WordNet, and the third part to those augmented with Back Translation, both augmentations being for the entity classes "Predicate" and "Reference".RoB Cli stands for RoBERTa Clinical , RoB Bio stands for RoBERTa Bio , and RoB NER stands for RoBERTa NER .The best partial results are in bold and the best overall results are also underlined.

Error Analysis
The first part of Table 4 shows the evaluation metrics of the best model (RoBERTa Bio -BT40-BT) corresponding to each type of entity shown with the IOB notation, while the second part shows the metrics for each class of entity, calculated by making a weighted average of the metrics for the IOB entity tags.Clearly, the model performs better in the case of components of the entities with a larger number of samples.On the other hand, in the cases with a smaller number of samples, such as I-Action and I-Predicate, no case is correct, but since the frequency is small, it hardly penalizes the overall results.The model is more accurate in recognizing the beginnings of the entities (B tag) than the rest of components (I tag), probably because there is a majority of one-word entities.In the case of class B-Reference, some predictions have been achieved despite the small sample size, and may be due to the fact that it is one of the classes to which Data Augmentation was applied.The I-Reference class does not appear in the test set.

Discussion
Most of our systems based on current and public domain-specific language models improve the results of the best eHealth challenge participants based on the use of generalist linguistic models.The best proposal so far was that in reference [22] based on mBERT, but with an end-to-end architecture that also extracts relationships between entities.Our way of using mBERT fine-tuned for the NER task obtains better results.The second [24] and third [26] best results so far were also based on generalist linguistic models but with a final classification layer using neural networks.These approaches have also been surpassed.All of our systems based on domain-specific models and fine-tuned with the original data sets outperform the best eHealth participant, indicating that they are a good choice even without performing any data augmentation.
Two of the domain-specific models, RoBERTa Clinical and RoBERTa Bio have beaten the other models in the two scenarios we evaluated: the as-is dataset, and the dataset enriched with data augmentation techniques.The main limitation of their use is that, although data augmentation by BT has improved the results of the NER task, the difficulty of recognizing certain entities that are very poorly represented in the datasets persists.
The only generalist model to compete with the domain-specific models was mBERT, the multilingual version of BERT, which achieved second place using the as-is corpus and third place using the augmented corpus, which may be due to its good performance in English.If we only consider the results with the Spanish sentences and the original datasets, the best configuration of mBERT moves from second to fifth place.mBERT has improved the results of the Spanish generalist model BETO, even when only considering the Spanish part of the dataset.This indicates that a generalist model trained with huge amounts of text from different sources and languages can also obtain competitive results.
Interestingly, the domain-specific model, RoBERTa NER , which was fine-tuned for a NER task, has shown the worst performance of the three.This may be due to the fact that the type of entities defined in the eHealth campaign framework do not fully correspond to those defined in other contexts.This is the only domain-specific model that performs worse than the generalist models when data augmentation is applied.
Of the two data augmentation strategies we have studied, only Back translation improves the results, especially with the RoBERTa Bio model, regardless of the number of epochs, which is shown to be the most suitable model for this task with this experimental framework.The improvement is due to the Spanish part, as the data augmentation does not in any way improve the results in the English part.With respect to the competence of the best models with the different types of entities, the most frequent types of entities are clearly better identified.

Conclusions
In this paper, we have presented a system for NER in Spanish medical texts using transformers, transfer learning, and data augmentation techniques.The results allow us to conclude that these techniques are suitable for the task.Our hypothesis that the use of domain-specific models fine-tuned to the Spanish NER task can improve performance has been proven to be true.We have used the experimental framework proposed in the last eHealth challenge, outperforming the best systems that participated using generalist models.The Back Translation data augmentation technique has slightly improved the results, which invites further research along this line.A possible line of future work could be prompting large language models to generate new data.In this work, we have not focused on changing the properties of the models (such as the optimizer, number of layers, and learning rate) nor the optimization of the parameters, so a future line of work would be to explore other possible configurations of the system pipeline.Other future lines to explore would be the use of transfer learning with biomedical texts in other languages, such as English, which could be used in conjunction with multilingual models such as mBERT.Moreover, the dataset provided by the eHealth challenge is small; it combines specific entities of the biomedical domain with other more general ones, so another next step could be to evaluate our proposals in other Spanish corpora with different characteristics and annotation guidelines.

Figure 2 .
Figure 2. Frequencies by type of entity in the dataset.

Figure 3 .
Figure 3. Pipeline of the proposed model.

Table 1 .
Composition of the collections and their size in numbers of sentences.

Table 4 .
Evaluation metrics for each entity shown with the IOB notation (first part) and calculated by the weighted average of the metrics for each class of entity (second part).