On the Use of Parsing for Named Entity Recognition

: Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from ﬁnancial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.


Introduction
Named entity recognition (NER) is a task originally defined at the 6th Message Understanding Conference in 1996 [1], and it consists in finding relevant named entities in the text belonging to a set of predefined categories. Typically, the categories considered include personal names, organizations, locations, dates or times (e.g., [2]), but they can be more fine-grained in specialized settings. For example, information about protein-protein interactions can be extracted by relating protein entities [3], drug-drug interactions from drug entities [4], or adverse drug events by relating drug entities to disease entities [4,5]. As a result, NER is a challenging problem that requires advanced natural language processing (NLP) techniques, as entities tend to have numerous synonyms and variations that include long phrases and abbreviations [6].
Currently, NER is essential to any information extraction task, while also being the basis of other related or dependent tasks, from relation and event extraction to knowledge discovery and management [7], semantic indexing or question answering [8], with their performance being conditioned by the effectiveness of the entity recognition process. Effective NER is also crucial for the anonymization of documents required in some domains (e.g., clinical documents) before making them available for research purposes [9]. All these tasks can be applied after recognizing the entities involved through a pipeline architecture [10] or by using joint models to learn entities, relations, and/or events at the same time [4].
Most approaches to NER are shallow, sequence labeling systems that are directly trained to recognize entities without regard for the structure or meaning of the text. However, analyzing said structure is helpful for NER, as it provides cues both for detecting the presence of entities (e.g., a direct object of the verb "prescribe" in English, or "pautar" in Spanish, will typically indicate the presence of an entity of type drug) and for delimiting the exact span of entities (e.g., in "post-COVID-19 pneumonia pulmonary fibrosis," the constituent boundaries of "pulmonary fibrosis" delimit an entity distinct from "pneumonia"). Syntactic parsing, the task of determining the structure of a sentence, is a key task in many NLP applications that need to process text or utterances beyond a shallow level, extracting meaning or relations between objects, entities, or events referred to in the text. Syntactic parsing takes two main forms, depending on the intended output: in dependency parsing, the syntactic structure is expressed by means of binary directed relations between words, called dependencies; while in constituency parsing (or phrase structure parsing), it is represented as a phrase structure tree that divides the sentence recursively into its constituent units. Semantic parsing goes a step further by converting natural language sentences to logical forms following various representation languages, such as Abstract Meaning Representations (AMRs) [11].
In this article, we focus on the use of the information resulting from the parsing process in the NER task. Toward this aim, we start in Section 2 by defining NER, discussing the use of sequence labeling in NLP, and framing NER as a sequence labeling task. We also review the main resources applied to the task and the evaluation measures commonly used. In Section 3 we define parsing and analyze the latest developments in this area that make parsing a convenient tool to be used in NER. We continue in Section 4 with a study of the most relevant NER systems that use parsing information, including both those that use it as a source of features in a sequence labeling setting and those that make use of parsing results to guide the NER process. In Section 5 we discuss the results achieved, summarize the different approaches in tabular form, present a new proposal for the use of parsing in NER based on the consideration of the parsing process itself as a sequential labeling process, and we compare it with a sequence-to-sequence approach. In Section 6 we analyze work related to this article, to finally elaborate the conclusions in Section 7.

Named Entity Recognition
Named entity recognition is the task of locating references to entities in texts and classifying them into predefined categories. NER is a crucial component of any text mining application, as has been repeatedly shown in many different domains, from fashion industry intelligence [12] to legal document mining [13], although it has proven especially relevant in biomedical applications [6,[14][15][16][17], where the categories of entities to extract include chemical terms [18,19], pharmacological substances [20,21], other drug-related information such as dosages or adverse events [22], diseases [23], problems, tests and treatments [24,25], or genes and proteins [15], among others. From an implementation point of view, NER has been linked since its inception with the sequence labeling paradigm.

NER as a Sequence Labeling Task
Sequence labeling (also known as sequence tagging) is a type of classification task where the input is a sequence of observed values and the output is a categorical label for each member of the sequence. Sequence labeling has long been used to model and solve NLP tasks, where the input values are typically words, although, depending on the task, they can also be smaller units like individual characters [26] or larger units like sentences [27]. To a certain extent, in the context of NLP, sequence labeling can be considered analogous to a software-engineering design pattern [28], since it provides a template of how to solve a problem that can be reused and adapted to different tasks.
It is worth clarifying that sequence labeling models are not to be confused with sequence-to-sequence (seq2seq) models [29]. The main difference is that while sequence labeling models assign exactly one categorical label to each word of the sentence, seq2seq models generate another sequence as output, which can be of arbitrary length. This means that seq2seq models can be used for tasks that do not fit the sequence labeling framework, such as machine translation. Unfortunately, it also means that seq2seq models require more complex architectures to run (with the use of neural attention being a must) and are significantly slower.
Part-of-Speech (PoS) tagging is probably the most archetypical example of sequence labeling applied to NLP, because the task itself is defined as assigning one label (a part of speech, such as verb or noun) to each element (word) in a sequence. Thus, any implementation of PoS tagging could be said, by definition, to be performing a form of sequence labeling, although pioneering implementations [30] were task-specific (as well as languageand tagset-specific), since they were based on handwritten rules. However, more modern approaches feature trainable models that learn the correspondence between words and tags in context as a supervised learning problem, and thus are instantiations of true generic sequence labeling models, applied to the particular task of PoS tagging. These include early statistical taggers [31], trainable rule-based taggers [32], Hidden Markov Model (HMM) taggers [33], maximum-entropy taggers [34] and, lately, deep learning approaches [35].
Chunking, sometimes referred to as shallow parsing, was incorporated into the sequence labeling paradigm in the mid 90s. This task consists in finding relevant phrases (typically, verb, noun and/or prepositional phrases) in a text. While early approaches were ad hoc [36], the problem was reformulated as a tagging problem by Ramshaw and Marcus [37], who introduced an encoding scheme called IOB (or BIO) tagging. In this approach each word is assigned a tag "I", "O" or "B" depending on whether it occurs "Inside," "Outside," or at the "Beginning" of a chunk. Since then, most approaches to chunking, including machine learning approaches [38] and more recent deep learning approaches [39], have used the IOB tagging scheme or variants of it.
Advances in machine learning (and in particular, new architectures for sequence labeling) have made it possible to apply the pattern to a wider range of tasks. In this line, after the introduction of Conditional Random Fields (CRF) [40] and the averaged perceptron [41], sequence labeling was applied to model a broader spectrum of problems. In sentiment analysis, Choi et al. [42] used a CRF architecture for opinion source extraction, while Jakob and Gurevych [43] did so for opinion target extraction. In shallow discourse parsing, Ghosh et al. [44] used a sequence labeling model to extract arguments, given discourse connectives; and in question answering, Yao et al. [45] presented the first sequence labeling approach to answer extraction. All of these approaches obtained competitive results at their time and were based in IOB tagging or variants thereof. With a different kind of encoding, Frazee [46] applied a CRF architecture to semantic role labeling, predicting argument labels directly (and empty labels for words that did not play the role of arguments). Similarly, language identification in code-switching texts was addressed by Sikdar and Gambäck [47] using a CRF architecture where each word was directly tagged with its corresponding language. On a different note, the task of extractive summarization was also addressed as a sequence labeling task by Shen et al. [48]. In this case, the elements of the input sequence are sentences rather than words and the output labels are binary, representing whether each sentence is chosen to be part of the summary or not.
NER is closely related to chunking, in the sense that the goal is to extract relevant segments of words in a text (although the nature of those segments changes), and the concept of IOB tagging can also be applied. Therefore, it is not surprising that sequence labeling approaches to NER, based on variants of IOB tagging, have been popular from early years and that they have followed the general evolution of sequence labeling architectures outlined above. This way, early approaches used HMM [49] followed by techniques like Maximum Entropy (MaxEnt) [50], CRF [51], Support Vector Machines (SVM) [17], and Structural Support Vector Machines (S-SVM) [52]. Lately, following the general trends in NLP, deep learning techniques have become popular [53,54], including approaches like Convolutional Neural Networks (CNN) [55], Capsule Networks [56,57], Bidirectional Long Short-Term Memory (BiLSTM) [58], or the combination of BiLSTM and CRF [59]. Neural techniques also made it viable to define sequence labeling models on characters instead of words. Misawa et al. [60] used an architecture combining Long Short-Term Memory (LSTM), CNN, and CRF for a NER task in Japanese where entities do not necessarily follow word boundaries. Krantz et al. [26] used a similar architecture combining LSTM, CNN, and CRF for language-agnostic syllabification.

Shared Tasks and Data Sets for NER
Since its inception, the NER task has been characterized by the gradual development of collaborative annotated resources that were materialized mainly through the organization of shared tasks. In these competitive evaluation workshops, the organizers provide annotated data sets for training in advance, which are used by participating teams from all over the world to fine-tune their systems. Later, test data sets are released for a limited period of time before the official results are provided by the participants. After a shared task has finished, these data sets are used to evaluate emerging new systems, thus enabling comparison between systems. A list of the most relevant shared tasks on NER follows: , and dedicated to NLP in Iberian Languages, has always paid attention to entity recognition and processing. At first they proposed tasks on abbreviation recognition and resolution in the Spanish biomedical domain [84,85] and, after that, they decided to include tracks on Portuguese NER [86], Spanish NER [87] and Spanish eHealth Knowledge Discovery [88].
In addition, the following are some of the data sets that are routinely used by NER researchers for evaluation purposes: • GENIA (http://www.geniaproject.org/genia-corpus) [89], a collection created to support the development and evaluation of information extraction and text mining systems for the molecular biology domain. It contains subcorpora annotated with PoS, constituency (phrase structure), terms, events, relations, and coreferences. • OntoNotes (https://www.gabormelli.com/RKB/OntoNotes_Corpus) [90], a data set on English, Chinese, and Arabic resulting from the project with the same name. Apart from named entities from several domains (weblogs, news, talk shows, broadcast, Usenet newsgroups, and conversational telephone speech) it also contains structural information (constituency trees and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference

Evaluation Measures for NER Systems
Three measures are usually considered for evaluating NER systems: precision, recall, and F-measure. In the context of NER, the entities labeled in the test data set are considered as ground truth. To compute the metrics, we must consider the numbers of true positives (TP), false positives (FP), and false negatives (FN) with respect to said ground truth, where: • A true positive is counted for each entity that is returned by a NER system and also appears in the ground truth; • A false positive is counted for each entity that is returned by a NER system but does not appear in the ground truth; • A false negative is counted for each entity that is not returned by a NER system but does appear in the ground truth.
Precision, P, refers to the percentage of a system's results that correspond to correctlyrecognized entities and is computed as indicated below in Equation (1). Recall, R, refers to the percentage of the total entities in a text that are successfully recognized by a system and is computed as indicated in Equation (2). Finally, F-measure combines precision and recall by means of their harmonic mean, as shown in Equation (3).
Nowadays, NER is far from being a solved task. Even in a well-studied and resourcerich language like English, state-of-the-art approaches obtain F-measures around 80% in some biomedical data sets (e.g., 80.5% on GENIA [95] (Yu et al., 2020) or 77.1% on the ShARe/CLEF eHealth Task 1 Corpus [96]). As such, NER is currently a highly active topic in NLP research, the subject of frequent shared tasks with a high number of participants. Examples from 2020 are the Spanish CAMTEMIST-NER shared task, with 23 participating teams, featured in IberLEF 2020 [97]; or the English W-NUT-2020 Task 1, with 13 participants, featured in EMNLP 2020 [98].

Syntax and Semantic Parsing as Building Blocks for NLP Applications
In spite of some claims about the possibility that large language models could make explicit syntax redundant, which are based on artificial benchmarks [99], the trend in real applications is just the opposite. As a matter of fact, in recent years, improvements in syntactic parsing models have made the constructions resulting from parsing more and more commonly used in various downstream tasks like machine translation [100], opinion mining [101], relation and event extraction [102], question answering [103], or summarization [104]. On the semantic side, semantic parsers have increased their accuracy to the point of becoming useful in applications like summarization [105] or machine translation [106].
In order to make successful use of parsing in NLP applications, we need (1) efficient and accurate parsing models, and (2) an adequate way of using the obtained structures.

Recent Advancements in Parsing Efficiency and Accuracy
The availability of more powerful machine learning architectures has greatly improved syntactic parsing accuracy, both in dependency parsing [107,108] and in constituency parsing [109]. Moreover, parsing algorithms have been subject to a process of simplification, which has resulted in models that are simpler, more generic and easier to tune:

•
In the context of dependency parsing, transition-based parsers that used to require rich feature models to attain an acceptable accuracy with pre-neural models [110] have become viable with a minimal set of generic features by using BiLSTMs [111].

•
Mildly non-projective exact-inference parsers that were barely implementable in practice due to the complexity of features needed [112] have become viable, too [113].

•
In the case of constituency parsing, transition-based parsers with simple features [114] and reduced transition sequences [115] now obtain good results.
On the semantic side, semantic parsers have also begun to have sufficient accuracy to prove useful in NLP applications [105,106], hinting that semantic representations can be as versatile (or more so) as syntax, as long as accuracy becomes good enough.

Recent Advancements in the Representation of Parsing Results
The use of the linguistic structures resulting from parsing in NLP applications is not a trivial problem at all. This is in contrast to PoS tagging, for which the fact that PoS tags are simply a sequence of one tag per word, provides a highly standard and universal way of using such information in any neural NLP model: in the form of embeddings that are plugged as input to the network [116]. This makes it extremely easy to plug them into different models, regardless of architecture, or to try embeddings of different kinds of linguistic units (e.g., fine or coarse-grained PoS tags, lemmata, etc.). However, with constructions resulting from parsing, the situation is very different: since syntactic trees (unlike PoS tags) are structures that go beyond the linear order of the words in the sentence, it is not so obvious how they can be used in a way that takes full advantage of syntax while being modular and pluggable, i.e., not conditioning the rest of the model or requiring special resources.
A classic way to use parsing is to extract features from parse trees (such as individual dependencies from dependency trees) and inject them into the application model, as Joshi and Penstein-Rosé [117] or Vilares et al. [118] do with dependency trees for opinion mining. While this approach is relatively simple and pluggable, it does not really make full use of syntax as it is applying a lossy encoding of dependency trees into "bags of dependencies," without regard for the overall structure or the relations between dependencies. In a much more involved approach, Socher et al. [119] syntactically annotated a sentiment treebank to then train a recursive neural network that learns how to apply semantic composition for sentiment analysis. Although this does exploit syntax more fully, it has no modularity as it requires training a special ad hoc model, apart from requiring an ad hoc corpus where tree nodes are annotated with sentiment, a rarely available resource. Vilares et al. [120] forwent the need of special corpora by just computing a dependency parse of the target sentence and then applying handwritten rules to extract polarity from it, but, again, this is an ad hoc approach and cannot be used to improve existing models or to try different syntactic representations. In a similar way, several recent papers have used tree-specific encoders like graph convolutional networks [101]; but these are specialized models that are not easily pluggable.
Recently, a new paradigm of parsing as sequence labeling has arisen; in it, parsing is performed by encoding syntactic trees into sequences of one categorical label per word. This paradigm has been successfully applied to both constituency parsing [121,122] and dependency parsing [123], or even both at once [124]. Apart from providing very fast and environmentally friendly [125] parsers for practical applications, this approach makes it possible to plug syntax into models in the same generic way as for PoS tags, as well as opening possibilities for multitask learning with other sequence labeling tasks, such as NER. Unfortunately, to the best of our knowledge, no sequence labeling approach is available for semantic parsing at the moment.

Parsing for NER
As explained before, syntactic information has been shown to be an important asset to improve the accuracy of various NLP tasks. Thus, it is to be expected that such information can be used to improve NER as well as related tasks like relation extraction. However, existing approaches to integrate syntax into NER systems have been affected by the difficulty of using syntax in downstream applications in a pluggable way, as we discussed in the previous section. In the same line as before, some models use syntactic information extracted from parse trees as a feature for standard sequence labeling NER architectures. However, these strategies are limited to using very specific syntactic features and cannot take advantage of the whole parse tree. On the other end of the spectrum, there are models that do use complete parse trees, but have to resort to ad hoc, complex architectures to do so. The most relevant recent references for both approaches are discussed in the rest of this section.

Syntactic Information as a Feature for Sequence Labeling NER
Sasano and Kurohashi [126] presented a system for Japanese NER based on an SVM classifier that uses several types of structural information, such as that obtained from the head verb of a sentence by means of a syntactic parser and the surface case of the phrase that includes a target entity. To deal with head verbs that do not appear in the training data, case frames are introduced. Case frames describe what kinds of cases each verb can have and what kinds of nouns can fill a case slot. They are learned from a corpus of five hundred million sentences: firstly, entities are detected by a primitive NER system that uses only local features; secondly, case frames are constructed from the sentences containing such entities. Thus, if a given threshold percentage of the examples of a case are classified as pertaining to a certain entity class, the corresponding label is attached to the case. By using all structural information, the performance improves significantly for all data sets, which means that structural information improves the performance of Japanese NER. In particular, syntactic features improve the performance not dramatically, but consistently and independently from the data set. This result also shows that case frame features are general features that can be effective for data from different domains.
In their work [127], Ling and Weld described FIGER, a fine-grained entity recognizer that identifies references to entities in natural language text and labels them with appropriate tags from a set of 112 tags. The training set for these tags is created by exploiting the anchor links in Wikipedia text to automatically label entity segments with suitable tags. A CRF model is trained for segmentation, identifying the boundaries of each text segment that mentions an entity. An adapted perceptron algorithm is used as the final classifier in charge of assigning tags to the detected entities, considering both word-based features (including unigram and bigram features) and syntax-based features such as the head of the segment containing the entity and the syntactic dependency of said head. Compared to standard NER systems, FIGER shows a higher performance. An error analysis detects that most errors originate from noise in the training data. It is worth mentioning that training data were created without resorting to parsing.
The proposed approach of Luo et al. [128] to chemical NER is based on a neural network. The base classifier is an attention-based bidirectional LSTM with a conditional random field layer, thus trying to leverage document-level global information obtained by an attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. This approach achieves better performance on chemical compound and drug name recognition than other state-of-the-art methods, while requiring little feature engineering. In particular, the authors investigated the effect of linguistic features such as PoS tags and chunks obtained through shallow parsing. The baselines take word and character embeddings as inputs to the model while the additional features are introduced into the deep learning classifier as additional embeddings. Without the attention mechanism, the highest F-score is achieved when only the chunking embedding is added, the main reason being that some entity boundary errors can be revised by the chunking information. When only the PoS embedding is added, the model achieves a smaller improvement. However, with the attention mechanism in place, the contribution of these features to the performance of the model is negligible.
Although these studies demonstrate the potential benefits of incorporating syntactic information, they are limited in either treating noisy syntactic information as gold references for training their taggers, or using direct concatenation to combine that information with context information without weighing it with respect to its contribution to the NER task. Tian et al. [129] tried to find a better way to incorporate syntactic information into deep learning models for NER. For this purpose, they built BioKMNER, a NER model for biomedical texts based on Key-Value Memory Networks (KVMN) [130]. They parsed biomedical text sentences to extract three types of morpho-syntactic information: PoS tags, syntactic constituents, and dependency relations. The KVMN weighs the corresponding syntactic information (values) according to the importance of context features (keys) and combines the weighted syntactic information with the output of the encoder Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (bioBERT) [131]. Finally, the decoder receives the combined embedding and tags the input sequence accordingly. BioKMNER outperforms baselines without memories and achieves new state-of-the-art results on four biomedical data sets.
The system by Tian et al. [129] is based on bioBERT [131], a pre-trained biomedical language model designed for biomedical text mining tasks. It is worth remarking that there is a recent trend in end-to-end NLP systems that use powerful pre-trained language models with huge parameter spaces based on transformers to solve a variety of tasks, as in the case of the Bidirectional Encoder Representations from Transformers (BERT) [132] or Generative Pre-trained Transformer 3 (GPT-3) models [133]. It is important to highlight that it has been shown that explicit syntactic information helps these pre-trained transformer models [134], despite the fact that some authors have questioned the use of explicit syntax in these settings [99].

Using Complete Parse Trees for NER
In the case of Shi et al. [135], they proposed to use a statistical parsing technique to simultaneously identify biomedical named entities and extract subcellular localization relations for bacterial proteins. In their approach, sentences are automatically annotated by a statistical parser. Then, the constituency parse trees are decorated with annotations on relevant protein, bacterium, and location named entities; and annotations on the path linking related entities in the parse tree of each sentence. Experiments with purely supervised learning showed that in order to be effective, the model required a large curated set to minimize the sparse data problem. Unfortunately, domain-specific annotated corpora are rare and expensive, so the authors decided to add noisy texts (i.e., with automatically labeled named entities and relations) to the training set. By doing this, the system reaches a competitive performance.
Finkel and Manning [136] proposed a joint model of context-free parsing and named entity recognition, based on a discriminative CRF-based constituency parser. They found that combining parsing and NER improves performance on both tasks, the joint model producing an output that has consistent parse structure and named entity spans and, moreover, is also doing a better job at both tasks than separate models with the same features. This joint model proceeds as follows. In a first phase, the set of constituency trees from the OntoNotes 2.0 data set is modified. As named entities correspond to phrasal nodes, the labels of such phrasal nodes and their descendants in each parse tree are augmented with the type of those named entities. Some additional manual modifications are required, such as removing final periods from the entity annotation, flattening nested noun phrases, and moving adjectives from nested noun phrases into the main noun phrase. The augmented nodes give place to extra copies of the source grammar where named entities are taken into account. This representation will even be able to handle nested named entities in a natural way, although the data set used for evaluation does not contain such entities. This grammar is used by a CRF-based parser that considers features over both the parse rules and the named entities. In the testing phase, the parser analyzes each sentence, and the named entities are extracted from the parse tree. As we can see, this approach differs from the others described in this article in that NER is not considered as a sequence labeling task but rather as a by-product of a parsing process. One of the practical difficulties of this approach is that the size of the corpus employed is much smaller than the treebanks on which parsers are routinely trained, at least for English.
Instead of constituency parsers, Jie et al. [137] proposed a NER model guided by a dependency parser. The basis of this approach is that named entities tend to be covered by single or multiple adjacent dependency arcs, since certain internal structures are expected to exist for most named entities that convey semantically meaningful information. As a result, words inside each named entity typically do not have dependencies with words outside the entities, except for certain words such as head words, which often have incoming arcs from outside words. Thus, the authors derive a semi-Markov CRF model by restricting the space of all possible combinations of entities to those that strictly contain only valid spans, where a valid span either consists of a single word, or is a word sequence that is covered by a chain of dependency arcs where no arc is covered by another. This model performs competitively with respect to conventional linear CRF-based models and exhibits the same time complexity.
Finally, Yu et al. [95] introduced a method to handle both flat and nested named entities by adopting ideas from the biaffine dependency parsing model [107]. The particularity of this system is that instead of using the information resulting from the parsing process, it uses the parsing process to derive the plausibility of each of the possible entities found in the text, without actually building a parse tree. Toward this aim, the authors use a biaffine model on top of a multilayer BiLSTM to assign scores to all possible spans in a sentence. The results are used to rank the candidate spans by their scores and return the top-ranked spans that comply with constraints for flat or nested NER. The experiments show that the system improves over state-of-the-art results on three nested NER corpora and five flat NER corpora. The biaffine mapping and the BERT embedding used as input to the BiLSTM are the components that contribute most to the accuracy of the system.

Discussion
Parsing makes it possible to represent the structure of a text and has been repeatedly shown to be useful for improving NER accuracy. Table 1 compiles a summary of the main characteristics of the most relevant NER systems that make use of information derived from parsing. As we explained in the previous section, existing approaches have been limited by the difficulty of integrating hierarchical information such as a parse tree into a task that is linear in nature. Thus, either they make limited use of such syntactic information [128,129], or they develop ad hoc architectures that result in more complex, less generic, and less efficient models [95,136,137]. What we know for sure is that the use of information from parsers is beneficial but, since they have been tested on different data sets, it is difficult to determine which of those approaches for incorporating parsing information is more effective in general terms. Paradoxically, this issue stems from the success of the NER task: In recent years NER has been applied to so many domains that it has been necessary to create at least one data set for each of them, which has the pernicious effect that the research community has dispersed, with researchers creating systems to work effectively in a particular domain. At this point, a promising line of research that has not been tried yet is to integrate the hierarchical information provided by parsing processes into a linear setting by casting syntactic parsing itself as a sequence labeling task [121,123]. Until recently, full syntactic parsing was considered infeasible in practice within the sequence labeling framework: While it was theoretically possible to cast it as sequence labeling, learning algorithms like averaged perceptron or CRF were not powerful enough to achieve practical results. As an example, Spoustová and Spousta [138] presented a sequence labeling approach to dependency parsing as early as 2010, but they reported accuracies 5%-10% behind the state of the art of the time. Thus, while their work was an interesting exploration and proof of concept, it could hardly be considered a competitive system. It was not until recent years, with the popularization of dense vector representations of linguistic units (embeddings [139]) and the use of recurrent neural networks (especially BiLSTMs) to enrich these representations with context information [140], that generic sequence labeling models started being capable of doing full syntactic parsing. In a pioneering work, Gómez-Rodríguez and Vilares [121] introduced an encoding to represent any constituency tree for a sentence of length n as a sequence of n labels. Several sequence labeling architectures were tried, showing that BiLSTMs were capable of achieving good parsing accuracy (and very fast speeds) where simpler architectures failed. In later work [123], the same approach was tried with dependency parsing, exploring four different ways to cast the problem as sequence labeling and achieving competitive accuracies with two of them, including the one where Spoustová and Spousta [138] had previously obtained impractical results using pre-neural techniques.
This new paradigm of parsing as sequence labeling will be usable not only to integrate deep syntactic trees with NER, but to do it in such a way that we will effectively use the full syntactic information without needing to forgo the standard sequence labeling architectures of NER. Thus, we will obtain NER systems that are fast, scalable, and easily integrable with upstream tasks while also boosting accuracy, thanks to the use of deep syntax. This approach can also be extended to semantic parsing, which generates meaning representations that go beyond syntax trees. However, this will require a reduction of semantic parsing to sequence labeling.
Some NER systems, notably [129], resort to pre-trained language models. End-toend-models based on large pre-trained language models suffer from high computational costs, with the associated environmental costs [141]; reduced inclusivity in multilingual settings (e.g., GPT-3 is currently only available for English, and training it for a new language has been estimated to cost more than USD 4 million with current hardware [142]); as well as lack of explainability, which can be provided with parsing. In this respect, a practical characteristic of sequence labeling approaches to parsing is that they are more efficient than seq2seq models. For example, the single-core speeds of the seq2seq constituent parsers of Fernández-González and Gómez-Rodríguez [143], albeit optimized for speed, are an order of magnitude slower than those of sequence labeling constituent parsers [121,122]. This is compounded by the fact that sequence labeling is much easier to parallelize, so that the differences can be even larger in multi-core settings. For all these reasons, and while recognizing the usefulness of end-to-end setups and large pre-trained models, non-end-to-end setups that use intermediate tasks explicitly are still preferable if we wish to achieve efficient, green, inclusive, and explainable systems, and will continue to be in the foreseeable future.

Related Work
There have been a number of articles reviewing the state of the art in NER in a given moment, but none of them had the use of information derived from parsing processes as their main focus, as in this case.
The work of Nadeau and Sekine [144], for example, is a classical reference that reviews 15 years of research in NER, from 1991 to 2006. They detected that early systems were making use of handcrafted rule-based algorithms, while modern systems most often resorted to machine learning techniques. Handcrafted systems provided good performance at a relatively high system engineering cost. For machine learning systems, a prerequisite was the availability of a large collection of annotated data, a rather rare resource and limited in domain and language coverage. Indeed, most of the work at that point had concentrated on limited domains and textual genres such as news articles and web pages. The application of syntactic information was limited to the use of fixed syntactic constructions for finding candidate named entities and to the use of syntactic relations (e.g., subject-object) to discover more accurate contextual evidence around entities.
Regarding the reviews carried out in the last decade, Vazquez et al. [145] studied the achievements in the recognition of chemical entities mentioned in text, the determination of their chemical structures, and the identification of relationships between chemicals and other entities. It must be taken into account that chemicals may be referenced in documents in a variety of forms: systematic nomenclatures, common names, trade names, database identifiers, or IUPAC International Chemical Identifier strings; with different types of names having different word morphologies. They classified NER approaches into three categories: dictionary-based, morphology-based, and context-based, the latter category being the only one that involves some form of syntactic parsing guided by manual rules, in contrast to current parsing techniques based on treebank data. At that time, hand-made context-free rules had been proposed to describe a kind of "chemical language." Shallow or template-based parsing had also been considered to mine relationships for entities such as proteins and genes, pharmacogenomics entities, or drug and cytochrome proteins. As a result, parsing was limited most of the time to determining certain components of sentences (e.g., subjects), which were then used in a template matching strategy. NER for the chemical domain was reviewed again a few years later by Eltyeb and Salim [146]. They considered a different classification of NER systems in this domain: dictionarybased, rule-based, machine learning-based, and hybrid approaches. Rule-based systems used a set of hand-made rules to extract the names of entities. The handcrafted models consisted of pattern-based and context-based rules, the latter involving, as before, the use of shallow parsing.
In the biomedical domain, Campos et al. [147] analyzed machine learning tools for NER in this context, where it is used to detect entities such as gene, protein, drug, and disease names. It is a complex domain, where many entities are descriptive (e.g., "normal thymic epithelial cells"), several entity names can share one head noun, one entity name can have several forms of spelling, and ambiguous abbreviations are frequently used, among other phenomena. The authors detected that three approaches were used at that time to deal with this variety in entity forms: rule-based approaches for names with a strongly defined orthographic and morphological structure, dictionary-based approaches for closely defined vocabularies of names (e.g., diseases and species), and machine learning approaches for highly dynamic vocabularies of names exhibiting strong variability (e.g., genes and proteins). They focused their survey on the latter and they detected that shallow syntactic parsing benefits pre-processing of gene and protein names, particularly when using chunking to divide the text into syntactically correlated parts of words (e.g., noun or verb phrases). They also observed that, given that these linguistic units only provide a local analysis of some tokens in the sentence, additional information can be derived from dependency parsing to collect the relations between a wider range of tokens. NER for the biomedical domain was reviewed again several years later by Alshaikhdeeb and Ahmad [148]. At that time, most methods were relying on machine learning techniques and they reviewed some features that could be used in such techniques, such as morphological features, dictionary-based features, lexical features and distance-based features, but not syntactic features.
A different perspective was taken by Marrero et al. [149], who analyzed the evolution of NER from a theoretical and practical point of view, arguing that the task was actually far from being solved and showing the consequences for the development and evaluation of tools. They focused their review around what the task of NER is in itself, analyzing the different meanings of the term named entity. They also analyzed the resources and metrics that were used to solve the task and to measure the results attained, concluding that systems were overfitting to the training corpora, leading to serious limitations in the external validity of NER evaluations, given that systems did not perform well in general but for a particular user and document type.
Another context-specific review work is that of Shaalan [150], who studied the features of common tools used for NER in the Arabic language. This language poses particular challenges for NER, such as the use of the Arabic script; the co-existence of Classic Arabic, Modern Standard Arabic, and Colloquial Arabic dialects; lack of capitalization; lack of uniformity in writing styles; optional short vowels; and agglutination. As a sample of the complexity of the task in this language, we have that two mentions to entities may appear in one word, given that a pronominal can appear as a suffix pronoun to a nominal. One of the primary approaches for Arabic NER was based on handcrafted local grammatical rules. The structure of Arabic sentences allows a named entity to appear anywhere in the sentence and at different distances from lexical triggers, which complicated the structure of the rules. This led to using base-phrase chunks such as noun phrases and verb phrases, identified by means of shallow syntactic parsing. The other primary approach was based on machine learning classifiers, where syntactic information could also be used, giving rise to hybrid approaches.
Goyal et al. [151] presented the status of NER techniques developed by the research community and identified the issues and challenges (nested entities, ambiguity, annotation of training data, lack of resources) as well as factors (language, text genre, text domain) affecting NER performance, all of them to be considered when designing these systems. They found that earlier systems were most often based on handcrafted rules, including rules based on syntactic-lexical patterns to identify and classify named entities. These systems are highly efficient because they exploit the properties of language-related knowledge, employing domain-specific features to obtain sufficient accuracy. However, they are quite expensive, domain-specific and non-portable. In the case of NER systems based on machine learning, some of them consider chunks of text detected by means of shallow parsing as features.
Névéol et al. [152] offered an overview of clinical NLP for non-English languages. In the case of NER, they found that, similar to approaches for English, the methods for other languages are rule-based, statistical, or a combination of both. Although they did not consider parsing to be one of the most widely used resources, they cited it as one of the NLP techniques used in NER systems. In the clinical domain, NER essentially focuses on two types of entities: personal health identifiers in the context of clinical document de-identification and clinical entities such as diseases, signs/symptoms, procedures or medications, as well as their context of occurrence: negation, assertions, and experiencer (i.e., whether the entities are relevant to the patient or a third party such as a family member or organ donor). They claimed that negation may be easily adapted between languages of the same family that express negation using similar syntactic structures.
Yadav and Bethard [54] presented a survey of deep neural network architectures for NER and contrasted them with previous approaches to NER based on feature engineering and other supervised or semi-supervised learning algorithms. With respect to the use of parsing and syntax, they only briefly cited that shallow syntactic knowledge can be useful as a feature for unsupervised NER systems.
More recently, Hahn and Oleynik [153] reported the latest developments in medical NER for two selected semantic classes, diseases and drugs (or medications), and relations between them. They focused their review on the methodological paradigm shift from standard machine learning techniques to deep learning. They concluded that deep-learningbased approaches outperform classical machine learning ones but, at the same time, smallsized and access-limited corpora create intrinsic problems for data-greedy deep learning. The same applies to special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies. No mention was made of the use of syntactic information beyond indicating that clinical notes and reports often exhibit syntactically ill-formed, telegraphic language.
Finally, Li et al. [53] reviewed in detail existing deep learning techniques for NER, systematically categorizing approaches based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Although several deep artificial neural network models try to represent long-distance dependencies that have a clear syntactic component, they do so in an implicit way, through a sequential process in which relevant information is remembered and propagated so that the representation associated with a given word can include non-local information coming from a different location in the sentence. However, some systems resort to information derived by parsers, such as dependency roles, to build complex distributed representations of words. Minaee et al. [154] presented a comprehensive survey of deep learning models for other classification tasks.
IOB is the most widely used labeling scheme in NER, but it is not the only one. Recently, Zhong et al. [155] proposed several constituent-based (The term constituent in Constituent-based tagging scheme does not refer to constituents in the sense of constituency parsing, but to each of the elements that constitute (are part of) a named entity or time expression) labeling schemes instead of the traditional IOB positional labeling scheme.
More specifically, they define a TOMN scheme to model temporal expressions, where T refers to Time token, M to Modifier, N to Numeral and O to Outside time expressions; and an UGTO scheme to model named entities, where U refers to Uncommon word, G to General modifier, T to Trigger word and O to Outside named entities. Experimental results show that CRF-based methods using these constituent-based labeling schemes perform equally to, or more effectively than, representative state-of-the-art methods on time expression extraction and named entity extraction.

Conclusions
Written text is the fundamental element by which human beings record their ideas, desires, aspirations, creations, and the events that occur in their environment; it is ultimately the main medium by which knowledge is transmitted. A clear example of this can be found in articles published in scientific journals like this one. The huge amount of text that is currently generated on a daily basis makes its manual examination unfeasible, making it necessary to create automatic tools, that is, NLP tools, to extract knowledge from it. NER emerges as a basic NLP task for this purpose. NER is a difficult task. Probably because of this, most systems have opted for an approach based on sequence labeling that makes a limited use of the inherent structure of text. Although this approach has not been able to solve the NER task, it has managed to yield systems with sufficient performance to be applicable in practice. In particular, recent developments in neural architectures have allowed us to increase sequence labeling performance in NLP tasks [27,35,[156][157][158], in part due to the use of contextualized embeddings from language models like Embeddings from Language Models (ELMo) or BERT [159,160]. However, to continue improving the performance of NER systems, it is necessary to incorporate the information provided by the techniques that can analyze, process, and elaborate the structural information of sentences-in other words, parsing. Throughout this article we have shown how NER systems that use parsing information manage to improve over the results of those that do not use it, and how improvements in parsing techniques in recent years allow a smoother and more efficient incorporation of structural information to NER systems.
Regarding the future evolution of the integration of syntactic and semantic information in NER systems, we advise to cast parsing itself as a sequence labeling task, making it much more straightforward to integrate with NER. This way, we will be able to use complete syntactic trees, while at the same time not having to resort to non-standard architectures and retaining the simplicity, genericity, and efficiency of a sequence labeling architecture for NER. This approach can also be used to apply the parsing component of a NER system in a multilingual setting, thanks to the availability of Universal Dependencies (https://universaldependencies.org/) (UD), a unified parsing framework that currently supports 111 languages, with 32 more to be added soon. For example, this approach can be applied to incorporate a parsing component to a NER system in Arabic [161], Persian [162], or French [163] by training a parser with the UD treebank for each of those languages. Moreover, the strategy used in [120,164] to build multilingual sentiment analysis systems can be applied toward building truly multilingual NER systems.
Finally, we would like to point out what we consider to be the main challenges for the successful application of parsing in NER systems: • A standard framework for NER resources. Although the availability of UD makes it possible to have a multilingual parsing component with common annotation criteria across languages, the same is not the case for the rest of components of a NER system. The NER community needs to move in this direction, which will also facilitate the creation of truly multi-domain NER systems. • Larger data sets. The deep learning techniques that currently represent the state of the art in both parsing and NER require the largest possible data sets to exploit their full potential.
• Semantic parsing. The latest developments in syntactic parsing indicate that the proposed approach is fast enough to be applicable in large-scale NER systems, as well as accurate enough to provide useful information for the task. It is still unknown whether similar performance can be achievable with respect to semantic parsing, although the prospects are encouraging.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.