Review On Collocations and Their Interaction with Parsing and Translation

We address the problem of automatically processing collocations—a subclass of multi-word expressions characterized by a high degree of morphosyntactic flexibility—in the context of two major applications, namely, syntactic parsing and machine translation. We show that parsing and collocation identification are processes that are interrelated and that benefit from each other, inasmuch as syntactic information is crucial for acquiring collocations from corpora and, vice versa, collocational information can be used to improve parsing performance. Similarly, we focus on the interrelation between collocations and machine translation, highlighting the use of translation information for multilingual collocation identification, as well as the use of collocational knowledge for improving translation. We give a panorama of the existing relevant work, and we parallel the literature surveys with our own experiments involving a symbolic parser and a rule-based translation system. The results show a significant improvement over approaches in which the corresponding tasks are decoupled.


Introduction
Multi-word expressions-"idiosyncratic interpretations that cross word boundaries" [1]-are widely acknowledged as a key problem for Natural Language Processing (NLP). Indeed, they are seen as a "pain in the neck" for NLP [1] or a "hard nut to crack" [2]. Multi-word expressions (henceforth, MWEs) cover a broad range of phenomena, 1 such as named entities, multi-word function words, nominal compounds, verb-particle constructions, verbal expressions, idioms, proverbs, and so on, which all have in common the fact that they have to be treated as a whole rather than on a word-by-word basis and, therefore, require a special, holistic treatment in NLP systems.
A particularly important subclass of MWEs is represented by so-called "institutionalized" phrases, or collocations (e.g., heavy rain, heavy smoker, serious injury, to meet a need, to extend thanks, deeply in love). Collocations are expressions that are relatively regular from a syntactic and semantic point of view, but are statistically idiosyncratic. The component words are, in principle, associated by means of regular grammatical processes, such as the combination of a noun with a modifier yielding a nominal phrase whose meaning can be deduced from the meaning of the parts. Yet, what is peculiar, idiosyncratic or irregular about such expressions is that they are highly preferred over alternative lexicalizations: compare, for instance, traffic light, which is a collocation, an institutionalized phrase, with combinations like *traffic director or *intersection regulator, which barely occur in language (example from [1]). Such combinations are highly language-specific, and they indicate, to a large extent, the degree of fluency of a language utterance or of the output produced by an NLP system. Even if they are decomposable into parts, they still have to be treated by computational systems in a holistic way, to avoid unnatural or awkward formulations.
According to several researchers (e.g., [4][5][6]), collocations are the most numerous amongst all types of MWEs. As a matter of fact, "no piece of natural spoken or written English is totally free of collocation" [7]. The importance of collocations "stands in their omnipresence" [5]. It is also important to note that, unlike most other types of multi-word expressions, collocations may occur in a wide range of syntactic patterns. The following is a list of syntactic configurations commonly associated with collocations in English: adjective-noun (heavy smoker), noun-(predicate)-adjective (effort [be] devoted), noun-noun (suicide attack), noun-preposition-noun (round of negotiations), noun-preposition (inquiry into), adjective-preposition (crazy about), subject-verb (problem occurs), verb-object (meet requirement), verb-preposition-argument (bring to boil), verb-preposition (depend on), adverb-verb (fully support), adverb-adjective (highly important), adjective-coordination-adjective (nice and warm).
In addition, lexicographic evidence shows that this list can be considerably extended [8]. Summing up, collocations are idiosyncratic syntagmatic combinations, which are not restricted to a given word class or to a given set of syntactic patterns [9].
Researchers have long since attempted to characterize the phenomenon of word collocation by addressing it from many different angles. However, there is still no agreed-upon definition, and the collocation concept is generally accompanied by vagueness and confusion. Collocations remain less studied and less well understood than other types of MWEs, notably, idioms (kick the bucket), light-verb constructions (to take a walk) or verb-particle constructions (to look up).
In this article, we put emphasis on a particular aspect that distinguishes collocations from other expressions and that contributes to making them particularly difficult to process by computational systems: the high morphosyntactic flexibility of collocations. The component words in a collocation may, in principle, undergo the full range of morphological and syntactic transformations that are possible for regular combinations in language (see Examples 1 and 2).
In contrast, other expressions, like named entities (New York City), compounds (wheel chair) or idioms (to be over the moon 'to be extremely pleased'), are relatively fixed, or frozen, this characteristic acting as a useful discrimination feature and permitting a more local (and, therefore, more computationally inexpensive) automatic treatment.
With respect to the flexible nature of collocations, it is worth noticing that in the fields of NLP and translation, the ISO 12620 data categories standard describe collocations as "[a] recurrent word combination characterized by cohesion in that the components of the collocation must co-occur within an utterance or series of utterances, even though they do not necessarily have to maintain immediate proximity to one another" (emphasis added). This definition highlights an essential feature of collocations, namely, the discontiguity of the component words, which is the consequence of the syntactic flexibility of these expressions.
Indeed, this discontiguity is arguably one of the biggest challenges that NLP systems face when processing collocations. As collocations exhibit (almost) full syntactic variability, processing them requires dealing with a wide range of syntactic transformation in which collocations can occur. Their high variability calls for linguistically sophisticated approaches, capable of accurately identifying collocations in many syntactic contexts and of accounting for long-distance dependencies, in order to ultimately enable their proper treatment in applications such as parsing or translation.
Another aspect on which our work is particularly focused is the integration of collocations into actual NLP pipelines, i.e., their use in client natural language applications. Developing accurate collocation identification techniques has been a main concern in the NLP field for a couple of decades already; however, the exploitation of collocations in other NLP applications has received considerably less attention. In this article, we address the problem of connecting the application of collocation extraction (or identification) with two major NLP applications, namely, syntactic parsing and machine translation. We investigate whether a synergetic approach, one in which information is shared between the task of collocation identification and the other two tasks, is more efficient than the standard approach, in which the tasks are performed independently of each other.
The article explores four main directions of work that have been pursued to a greater or lesser extent in the NLP field until now. By focusing on the interrelation between the tasks of collocation identification and syntactic parsing, we look at the benefits that can be obtained from relying on syntactic parsing for collocation extraction, and, vice versa, from using collocations during parsing. Then, by focusing on the interrelation between collocation identification and translation, we investigate whether translation technologies can contribute to the task of automatically detecting collocations in text corpora and, conversely, whether collocations are useful for machine translation. For each main topic, a literature overview is provided, paralleled by reports on our own experiments, confirming that a synergetic approach is preferable over individual approaches. Backed by findings from other studies on synergetic approaches (for instance, on using parsing for semantic analysis [10]), these results suggest that the NLP work, often fragmented, would benefit from greater interaction between various tasks.
The article is structured as follows. In Section 2, we focus on using syntactic parsing for collocation extraction. We survey related work and outline our own extraction methodology, which relies on full syntactic parsing for acquiring collocations from text corpora in several languages. In Section 3, we review the work in which translation-related technologies, such as sentence and word alignment, are used for identifying collocations. Furthermore, we outline our own method of detecting translation equivalents for collocations by exploiting translation archives. Sections 4 and 5 focus on exploiting collocational knowledge for parsing and translation. In Section 4, we discuss the extent to which collocations are presently taken into account in parsing systems; then, we present an approach in which collocation identification and syntactic analysis are performed simultaneously, rather than separately, as in previous work. Section 5 addresses the question of integrating collocations and other types of multi-word expressions into machine translation systems. It also presents a study aimed at assessing the impact of collocations on the results of an in-house rule-based translation system. Finally, Section 6 concludes the paper by taking stock of the current treatment of collocations and, where appropriate, indicating more adequate processing alternatives.

Using Parsing for Collocation Identification
The development of collocation extraction as an area of research has seen the increasing adoption of linguistic analysis as an essential preprocessing step. This step allows for a more accurate identification of candidates, which are then scored by using statistical methods and, in particular, so-called association measures (e.g., mutual information, t-score, z-score, χ 2 , log-likelihood ratio; see [11][12][13][14] for descriptions and comparative evaluations of association measures).
The preprocessing techniques gradually evolved from shallower to deeper forms of analysis, as increasingly advanced technology became available, from tokenization, stemming and lemmatization to chunking, shallow parsing, dependency parsing or full parsing. The need for performing a linguistic analysis of the input text is justified by the necessity to account for the high morphosyntactic variation characterizing collocations. Stubbs [15] analyzed, for instance, the occurrences of the pair bear-resemblance in a corpus and found the following distribution of inflected forms for the verbal component, to bear: bears 18%, bear 11%, bore 11%, bearing 4%. Put together, these forms make up a high proportion (44%) of the total number of collocates of the noun resemblance (1,085). Example 1 summarizes this information in Stubbs' notation.
Example 1. Morphological variation in collocations: Stubbs' notation for grouping inflected forms of collocates. resemblance 1,085 < bears 18%, bear 11%, bore 11%, bearing 4% > 44% This example illustrates the importance of performing a lexical analysis of the input text in order to better pinpoint potential collocations. As a matter of fact, a large body of collocation extraction work [16][17][18][19] relies on lexical analysis, combined with part-of-speech (POS)-based filtering of combinations considered in a five-word window called collocational span.
In addition to the lexical analysis, the syntactic analysis of the input text has often been argued as necessary, particularly for languages that exhibit a freer word order, such as German or Korean. For such languages, the extraction techniques developed for English (e.g., Xtract [20]) are inefficient, since they fail to recover systematic long-distance dependencies and to account for the positional ambiguity of arguments. As reported, for instance, by Breidt [21], even distinguishing subjects from objects in German is difficult without parsing. The author therefore proposed to shorten the collocational span to three words in order to exclude nouns that are unrelated to verbs. This strategy led to increased precision, but the improvement came at the expenses of recall. Similarly, Kim et al. [22] reported that a technique like Xtract [20], which is very popular for English and is based on selecting collocation candidates among the word pairs co-occurring at a stable distance in text, is completely unsuitable for Korean, because of the high syntactic flexibility.
Given the marked flexibility of collocations, some researchers have indicated that collocation extraction should ideally rely on the syntactic analysis of the source corpora [12,13,20,23,24]. However, despite their theoretical arguments, parsing has only been used in a minority of practical works. In such (exceptional) cases, collocation candidates are identified as pairs of words in a syntactic relationship, rather than as pairs of words in a collocational span, as in the prevailing syntax-free approaches. There are reports, for instance, on collocation work making use of full parsing for German [25], Chinese [26] and Dutch [27]. Similar work has been performed for English [28], a language in which collocation extraction experiments have also been carried out by exploiting manually annotated syntactic treebanks [29,30]. In addition, dependency parsing has been used for a number of languages, including English [31,32], French [33] and Czech [34]. Furthermore, a relatively larger amount of work has been devoted to collocation extraction based on shallow parsing, e.g., for English [35], German [36], French [11,37,38] and, notably, in the Sketch Engine multilingual system [39].
An important factor differentiating these syntax-based extraction systems is the performance of the parser involved. In some cases, the authors report a rather high parsing error rate, as well as robustness issues, leading them to exclude longer sentences made up of 20 words or more [27,31,32]. In other cases, the grammatical coverage of the parser is reported as limited, the extraction system being unable to deal with certain types of syntactic transformations, like relativization [28].
Apart from the underlying preprocessing technology and the specific association measures used to rank candidates, the extraction systems also vary greatly in the range of the syntactic configurations they take into account. Some systems identify candidates of a single type or a few specific syntactic types, e.g., verb-preposition [29], preposition-noun-preposition, prepositional phrase-verb [27], verb-object, noun-adjective, verb-adverb [26] or noun phrases [11,37,38]. Yet, other systems aim at a broader coverage, e.g., [25,39].
Although generally considered necessary for obtaining high quality results, syntax-based extraction is not always seen as a viable solution in the NLP community. Sometimes, it is discarded because of the unavailability of syntactic parsers; in other cases, the reason for not preprocessing the source corpora with a syntactic parser is taken on the basis of various arguments, such as time inefficiency, lack of precision or lack of robustness. To augment the skepticism, no comparative evaluation has been performed to clearly prove the superiority of syntax-based extraction over the simpler syntaxfree alternative.
Our own work [44] was devoted to devising a fully-fledged extraction methodology based on full syntactic parsing. We relied on the Fips multilingual parser [40] to preprocess the source corpora and to select as candidates the combinations of words found in specific syntactic relations (see Section 1). A range of association measures can be applied to rank the selected candidates; the measure proposed by default is the log-likelihood ratio, which is argued to be efficient for both high-frequency and low-frequency data [41]. The extraction system, initially developed for English and French, has later been extended to the new languages supported by the Fips parser, i.e., Spanish, Italian, German, Greek and Romanian.
Fips is a robust symbolic parser based on generative grammar concepts. It performs a "deep" syntactic analysis of the input sentence, making use of co-indexation to keep track of extraposed constituents, i.e., constituents that "moved" from the initial (canonical) position to the surface position, due to syntactic transformations, such as the ones shown in Example 2: Example 2. Syntactic variation in collocations: sample transformations.

relativization
various global challenges that we inevitably have to face 2. passivisation the challenges faced by the pharmaceutical industry today

interrogation
Which challenges do online media face in terms of press freedom?
The parser's ability to account for extraposition is essential for coping with cases of long-distance dependency in collocations, when the component words may not occur in the same clause. In Example 1, for instance, the verb face occurs in the subordinate clause, while its object challenges is in the main clause. As can be seen from Example 3 showing the (simplified) parsing output, the parser correctly identifies the "deep" object of face by creating a co-indexation chain (labeled i) that contains the empty constituent in the object position of the verb face, e i , the relative pronoun that and the noun phrase headed by challenges. Thanks to this mechanism, the verb-object pair face-challenge can be successfully identified as a potential collocation.
Two large-scale evaluation experiments have been performed to assess the impact of parsing on the quality of collocation extraction results. The results obtained using the syntax-based extraction method have been compared against those of the syntax-free baseline. The baseline consists of applying the so-called sliding window method to lemmatized and POS-filtered data, which means that all possible combinations within a five-word window that comply with the selected POS patterns are considered as candidates. The log-likelihood ratio measure [41] was applied in both cases.
The first experiment was performed in a monolingual setting, on French data from the Hansard corpus of Canadian Parliament proceedings totaling about 1.2 million words. The top 500 pair types have been manually evaluated by three judges, and the results showed a statistically significant improvement in precision with respect to the syntax-free baseline (99% vs. 78.3% in terms of grammaticality 2 ; 65.9% vs. 57% in terms of lexicographic interest of the results. 3 ) 2 A two-sample t-test was conducted to compare the number of grammatical pairs in the two methods' output. There was a significant difference in the output: t(982) = 10.78, p < 0.001. 3 A similar two-sample t-test was conducted to compare the number of pairs considered as worth of storing in a lexicon. The second experiment was performed cross-linguistically, on French, English, Italian and Spanish parallel data from the Europarl corpus of European Parliament proceedings, totaling about 3.7 million words on average per language. A stratified sampling strategy was used to select the evaluation data, with sequences of 50 pair types taken from various levels in the output list (top-0%, 1%, 3%, 5% and 10%). A total of 2,000 pair types have been manually evaluated by teams of two judges. The results showed, again, statistically significant improvements over the baseline (88.8% vs. 33.2% average grammatical precision; 43.2% vs. 17.2% average lexicographic precision; and 32.9% vs. 12.8% average collocational precision 4 ).
It is important to note that the very top results of the baseline are relatively noise-free, since the association measure succeeds in eliminating many erroneous pairs from the top positions. However, the precision degrades rapidly for the items on lower positions: as the frequency of pairs decreases, the measure alone is inefficient in removing the noise. In contrast, the syntax-based approach guarantees a better global quality of the results, meaning that even candidates with lower scores are noise-free. This is particularly important, since lexical data has a skewed, Zipfian distribution, and most of the candidate combinations receive low scores because of their low co-occurrence frequency; yet, they are potentially interesting from a lexicographic point of view. Carrying out lexicographic work on a noise-free list is one of the main benefits of syntax-based approaches to collocation extraction.
The results of these experiments cleared the doubts on the feasibility of a syntax-based approach to collocation extraction and showed the benefits obtained over the syntax-free approach. They confirmed that parsing information contributes to a statistically significant increase in collocation extraction performance and corroborated similar findings obtained by using syntax-based approaches to other tasks, e.g., term extraction [42], semantic role labeling [10] and semantic similarity computation [43].

Using Translation for Collocation Identification
In this section, we focus on translation-based approaches to the identification of multi-word expressions (MWEs) in general, and of collocations in particular. In the work reported in the literature on the topic, we distinguish between two distinct trends: • Approaches that exploit translation archives represented by parallel corpora or source-target pairs of monolingual corpora for identifying translation equivalents for MWEs/collocations; • Approaches that take into account word alignment information for detecting and ranking monolingual MWE/collocation candidates.

The First Trend: Exploiting Corpora for Collocation Identification
The first trend is a traditional and more popular trend, represented, for instance, by [26,[45][46][47][48][49]. The work in [45][46][47] was devoted to acquiring translation equivalents for noun phrases, whereas [26,48,53] deal with collocations of several syntactic types and [49] with MWEs in general. In what follows, we provide further details on this work. (Note that evaluation results are systematically reported in this type of work, whereas for collocation extraction, they may be missing or replaced by small output samples.) In [45], Kupiec identifies noun phrase correspondences between English and French by relying on the sentence-aligned Hansard parallel corpus. Both source and target corpora are POS-tagged, then NPs are detected with a finite-state recognizer. The matching is done by using Expectation Maximization (EM), an iterative re-estimation algorithm. The precision reported is 90% for the top 100 translations obtained.
In the same vein, van der Eijk [46] used a similar method for the language pair Dutch-English, except that the matching is done using two main heuristics: the target noun phrase is selected depending on (a) its frequency in the target sentences; and (b) its relative position in the source sentence. The reported performance was lower (68% precision and 64% coverage), which was explained by the fact that the evaluation was performed on a larger test set of 1,100 noun phrases.
In addition, Dagan and Church [47] made use of word alignment to find candidate translations for noun phrases in parallel corpora. Once the source noun phrases have been identified, the text span between the alignments of the first and the last words of the phrase is proposed as a translation candidate. Candidates are sorted in decreasing frequency order. The authors argue that unlike the previous systems, their system, Termight, has the advantage of finding translations even for infrequent terms. The method was tested on 192 English-German correspondences and achieved a precision of 40% (when the first translation alternative was considered only).
All the systems discussed above are limited to a particular type of construction, namely, the nominal compound, which is relatively fixed. In contrast, the Champollion system built by Smadja et al. [48] is the first system devoted to collocations proper, and it can handle both rigid and flexible combinations. It relies on the Xtract collocation extractor for English [20]. For each source collocation, it attempts to detect a translation equivalent in the aligned French sentences from the Hansard corpus. The matching relies on a statistical correlation metric, the Dice coefficient. The method used requires an additional post-processing step in which the order of words in a flexible collocation is decided, given that no syntactic analysis is performed on the target side. The system has been evaluated by three annotators and showed a precision of 77% and 61%, respectively, on two different test sets of 300 collocations each. These collocations have been randomly selected among the medium-frequency results. The difference in the precision obtained is explained by the lower frequency of collocations from the second set.
The method of Lü and Zhou [26] can also deal with flexible collocations; moreover, these are validated syntactic constituents, as they are extracted using a parser. The syntactic types considered are verb-object, adjective-noun and adverb-verb. Collocations are extracted from monolingual corpora in English and Chinese by applying the log-likelihood ratio measure on syntactically related pairs identified by a dependency parser. The matching between a source and target collocation is performed by using a statistical translation model, which estimates word translations with EM. The method (whose reported coverage is 83.98%) was evaluated on a test set of 1,000 randomly selected collocations. It achieves between 50.85% and 68.15% accuracy, depending on the syntactic type. The availability and the quality of bilingual dictionaries are essential for the performance of this method.
Our own method for finding translation equivalents for collocations in parallel corpora [53] consists of source-side and target-side collocation extraction performed with the system presented in Section 2 and a linguistically-motivated matching procedure for pairing a source collocation and its potential translations. The matching procedure takes into account the syntactic type of the collocation, by requiring the translation candidates to be of a compatible syntactic type (for instance, a verb-object collocation in English may have either a verb-object or a verb-preposition-argument equivalent in French: face challenge, relever défi; meet need, répondreà besoin). The frequency is also taken into account when trying to pinpoint the translation equivalent. The most frequent target candidate is retained as the potential translation; in the case of tied results, these are ranked according to their association score. Information from bilingual dictionaries is also used, if available for the specific words involved in the source collocation. 5 More precisely, we exploit the translations of the collocation base only: according to theoretical stipulations [5], the base, i.e., the semantically autonomous component of a collocation, can be translated literally, whereas the collocate, the component that is semantically dependent on the base, cannot. For instance, in meet need, the base need is translated literally as besoin; therefore, the combinations with the noun besoin are considered as potential translation equivalents. The method has been evaluated on 4,000 pairs extracted from the Europarl corpus [55] in English, French, Spanish and Italian. The results showed a performance of 81.6% according to the F-measure (84.1% precision and 79.2% recall), which means that our method compares favorably against previous methods.
The more recent method of Bai et al. [49] detects English equivalents for MWEs extracted from Chinese corpora using a parser. The matching method is very similar to that in [48], also making use of the Dice coefficient, but applying additional frequency filters. The performance has been measured extrinsically by performing a task-based evaluation, in which the extracted translation equivalents have been used for statistical machine translation; these have been found to lead to a significant improvement of translation results.

The Second Trend: Exploiting Word Alignments
Next to the approaches aimed at detecting translation equivalents from (parallel) corpora, we find an emergent class of approaches in which translation information, in particular, word alignment, is used for the monolingual identification of MWEs [50,51]. Such alignment-based approaches rely on tools that use standard models derived from statistical machine translation.
The hypothesis put forward, for instance, by Villada Moirón and Tiedemann [50] is that the idiosyncrasy of MWEs is reflected in their translational entropy: unlike for regular (compositional) constructions, the components of an idiomatic expressions are not translated literally, but are harder to translate. Therefore, there is a larger variety of translation links for them and, consequently, a higher average entropy for such expressions. The authors first extract verb-preposition-noun candidates in Dutch using parsing and association measures, then compute their average translational entropy into English, Spanish and German in order to re-rank these candidates. They evaluate their method on the top 200 candidates by comparing their new ranking against the old ranking using UAP, the uninterpolated average precision [52]. It was found that the alignment significantly improves the ranking of candidate expressions and, thus, the extraction performance.
In the work of Caseli et al. [51], the word alignment is used for the actual identification of MWEs, as opposed to their mere ranking as in [50]. The authors consider as MWE candidates the word sequences that are aligned systematically to the same target sequence, regardless of the length of the latter. Candidates are POS-filtered, and a frequency threshold is applied. When evaluated by taking into account human judgments, their method showed an overall precision of 49.28%. The precision was found higher for high-frequency candidates and for specific patterns, such as verb-preposition/particle.

Exploiting Collocations for Syntactic Parsing
As stated by Sag et al. [1], MWEs are a key problem for Natural Language Processing, because they cannot be treated by compositional methods, as these would lead to overgeneration (e.g., the production of a phrase like *intersection regulator instead of traffic light). Moreover, they cannot be analyzed by means of regular grammatical processes since they are idiosyncratic (for instance, the expression in line lacks the determiner: *in the line).
To account for the MWEs present in language, practical NLP work has generally adopted a solution consisting of listing MWEs in a lexicon and tokenizing the input text by using a words-with-spaces approach to recognize MWEs. This approach suffers from two main drawbacks. The first is the lack of flexibility, as many expressions allow lexical material to occur between the component words, and the words-with-spaces approach is inadequate for handling such situations. The second is the lexical proliferation problem, as expressions generally exhibit variation, and listing all possible forms in a lexicon would be impractical. The words-with-spaces approach therefore fails to generalize and handles variation badly [1].
These problems are even more acute in the case of collocations. As stated in Section 1, collocations are the most flexible amongst all kinds of MWEs, which makes the words-with-spaces approach completely inadequate for their representation and treatment. There is no systematic restriction on the number of forms of a collocation component (e.g., a verb), the order of components of a collocation or the number of words that may intervene between them (see Example 2). Collocations are situated at the intersection of lexicon and grammar; therefore, they cannot be accounted for solely by the lexical component of a parsing system. Instead, they have to be integrated into the grammatical component as well, as the parser has to consider all their possible syntactic realisations.
As an alternative to the words-with-spaces preprocessing approach, collocations could be recognized by the parser after the analysis of the input sentence has been performed, following an approach such as the one described in Section 2. Again, this approach is not fully appropriate from a parsing point of view, and the reason lies with the important observation that prior collocational knowledge is highly relevant for parsing. Collocational preferences are, along with other types of information, like selectional restrictions and subcategorization frames, a major means of structural disambiguation. In fact, collocational relations between the words in a sentence proved useful for selecting the most plausible among all the parse trees generated for a sentence [56][57][58][59]. As we will show later in this section, a more suitable approach is the inclusion of collocations into the grammatical component of the parser, so that the identification of collocations and the construction of a parse tree become interacting processes, which take place simultaneously and inform each other.
In what follows, we review the extent to which MWE/collocational knowledge has been incorporated into parsing systems in order to improve their performance. A number of studies have provided empirical evidence that, indeed, the recognition of MWE/collocations leads to better parsing results. For instance, Brun [60] compared the coverage of a French parser with and without terminology recognition in the preprocessing stage. She found that the integration of 210 nominal terms in the preprocessing components of the parser led to a significant reduction of the number of alternative parses (from an average of 4.21 to 2.79). The author reports that the eliminated parses were semantically undesirable and that no valid analyses were ruled out. Similarly, Zhang et al. [61] extended a lexicon with 373 additional MWE lexical entries and obtained a significant increase in the grammatical coverage of an English parser (of 14.4%, from 4.3% to 18.7%).
In the cases mentioned above, a words-with-spaces approach has been used to represent MWEs. In contrast, Alegria et al. [62] and Villavicencio et al. [63] adopted a compositional approach to the encoding of MWEs, able to capture more morphosyntactically-flexible MWEs. Alegria et al. [62] showed that by using an MWE processor in the preprocessing stage of their parser (in development) for Basque, a significant improvement in the POS tagging precision is obtained. Villavicencio et al. [63] found that the simple addition of 21 new MWEs to the lexicon led to a significant increase in the grammar coverage (from 7.1% to 22.7%), without altering the grammar accuracy.
In addition, an area of intensive research in parsing is specifically concerned with the use of lexical preferences, co-occurrence frequencies, collocations and contextually similar words for prepositional phrase (PP) attachment disambiguation. Thus, an important number of unsupervised methods [56,64,65], supervised method [57,58] and combined methods [66] have been developed to this end. However, as pointed out in [56], the bottleneck of this strand of work is that the parsers lack precisely the kind of corpus-based information that is required to resolve ambiguity. Performing parsing and identifying collocations is therefore a circular problem, to which the previous literature provided no solution.
In the remainder of this section, we outline a novel collocation processing approach implemented in relation to the development of the Fips parser, consisting of the simultaneous execution of the two tasks, sentence analysis and collocation identification [67]. Briefly put, the idea is that when the parser processes a lexical item that is marked in the lexicon as part of a collocation, in attempting to attach it to another item, it first checks the collocation lexicon for a syntactically compatible entry and, if found, it gives high priority to the structure in which the two components are attached. Thus, the collocation identification mechanism is incorporated within the constituent attachment procedure of the parser.
It is well known that, given the high frequency of lexical ambiguities and the high level of non-determinism of natural language grammars, grammar-based parsers are faced with a number of alternatives, which grows exponentially with the length of the input sentence. Parsing algorithms use various heuristics to limit the number of alternatives and, thus, to ensure that the parsing performance is satisfactory for processing large corpora. Collocations contribute crucially to the disambiguation process, by helping the parser through the maze of alternatives. The identification of collocation is therefore not a burden-an additional task to solve-but a process that helps the parser. Collocations are generally made of highly ambiguous words and identifying them helps to decide among alternatives. For instance, in break a record, both components are ambiguous if taken in isolation, and it is their combination that helps the parser to select the appropriate categories and readings.
Apart from the lexical disambiguation, collocations also contribute to structural disambiguation, as illustrated in Example 4 below for the phrase human resource management. To decide between competing analyses, e.g., one in which the word human is attached to resource and one in which it is attached to management , the parser exploits information from its collocation lexicon. Provided that the entry human resource is found, listed as an adjective-noun pair, then the parser favors the first analysis over the second, as it accommodates the structure specified in the lexicon. To measure the impact of coupling the collocation identification and the parsing tasks, we carried out experiments in which we compared the new version of the parser with the version before, which does not use collocations for attachment decisions. We assessed the impact of the procedure interconnecting parsing and collocation identification on the performance of both tasks, parsing and collocation identification. On a corpus of news articles from The Economist [68] totaling slightly more that 0.5 million words, we obtained a sensible increase in the coverage of the parser expressed in terms of the number of completely parsed sentences (83.3% vs. 81.7%), as well as an increase in collocation identification precision (93.7% vs. 81.6%).
The results are in line with previous reports on the impact of incorporating MWEs into parsing systems; the difference lies in the fact that the syntactic flexibility of MWEs is fully taken into account in our approach. Together, these results show the significant role played by these expressions on the performance of language analysis systems.

Exploiting Collocations in Machine Translation
In addition to being useful for NLP applications concerned with language analysis, collocational information derived from corpora is crucial for applications dealing with text production, such as natural language generation and machine translation. Collocations are considered a key factor in producing more acceptable output in these applications [28,69].
In spite of their relatively transparent meaning, collocations pose significant problems from the perspective of language production, since they are "idioms of encoding" [70]. The lexical selection is restricted to the conventionalized form, which is language dependent. Therefore a regular selection and, in the case of machine translation, a literal translation are inappropriate, as they may lead to unnatural, if not awkward, formulations, known as anti-collocations [30] (e.g., *accuse delay, a literal translation of the French accuser retard, 'experience delay').
To illustrate the importance of collocations for machine translation, consider the French combinations grande attention, grande diversité and grande vitesse, in which the adjective grande 'big' modifies the nouns attention, diversity and speed. A literal translation will lead to inadequate formulations in English: *big attention, *big diversity, *big speed. The right translations, great attention, wide range and high speed, show the necessity of using collocations in the target language: the same adjective, grande, is translated in three different ways, depending on the noun it modifies.
As discussed in Section 3, a considerable amount of work has been devoted to the extraction of translation equivalents from corpora, e.g., [26,45,47,48], and to the representation of collocational knowledge into computational lexica for machine translation and natural language generation [69,71]. However, there are very few reports on the actual use of collocational knowledge in such systems.
One such report refers to the Logos machine translation system, which uses collocations extracted with the method of Orliac and Dillinger [28]. It is argued that context-dependent selection of target lexical items, enabled by collocations, "achieves significant improvement in readability and perceived quality of the translation produced" [28]. Another report, by Liu et al. [72], concerns the integration of collocations into a statistical machine translation (SMT) system. The authors show that their method significantly improves the performance of both word alignment and translation quality. In a subsequent experiment, Liu et al. [73] use source language collocations for reordering for SMT, again achieving significant improvements.
More generally, as far as MWEs are concerned, there is additional evidence coming from reports showing that the incorporation of bilingual MWEs into SMT systems leads to an increase in the quality of translation results. For instance, Bai et al. [49] added 1,171 Chinese-English MWEs into their SMT systems and obtained a significant improvement in the Bilingual Evaluation Understudy (BLEU) score. Similarly, Tsvetkov and Wintner [74] reported that by adding 2,955 MWE translation pairs into a Hebrew-to-English SMT system, they obtained a statistically significant improvement of BLEU and Meteor scores. Furthermore, Bouamor et al. [75] used different strategies to integrate into their English-French SMT system bilingual phrases extracted from a 100,000 sentence training corpus, finding an increase in the BLEU and Meteor scores.
It is worth noting that phrase-based SMT systems already incorporate MWE/collocational knowledge as an effect of training their language and translation models on large (parallel) corpora. These systems are successful in dealing with local collocations, but are arguably ill-suited for handling collocation whose components are not in close proximity to one another. As Babych et al. [76] put it, "SMT output is often surprisingly good with respect to short distance collocations, but often (...) correct choices are missed in cases where selectional restrictions take effect on distant words." In the same vein, Bod [77] points out that discontiguous phrases represent a real challenge for SMT systems, and he provides empirical evidence that such phrases contribute significantly to improving the translation accuracy. Indeed, we also found that SMT systems are very sensitive to the syntactic environment of source collocations, as well as to the lexical environment [78]. As can be seen in Example 5, the same source collocation is correctly translated from English into French when found in a given context and incorrectly translated in another context: Example 5. Collocation translations from English to French.
1. the people who rely on us to give full support when it is needed les gens qui comptent sur nous pour apporter un soutien complet quand il est néecessaire 2. and it is certainly right to give massive support to these areas [...] et il est certainement droit de *donner un soutien massifà ces domaines.
More recently, Carpuat and Diab [79] provided further evidence on the impact of MWEs on SMT performance. By integrating 500 English multi-words from WordNet (corresponding to about 900 tokens) as "words with spaces" into English-Arabic SMT through segmentation of the training and test sentences, they obtained an increase in performance in terms of BLEU and Translation Error Rate (TER). An additional strategy consisted of identifying the 500 most frequent n-grams in the phrase table of the system and biasing the system towards using phrases that do not break these n-grams. This strategy had a less important, but still positive, effect on translation performance in terms of automatic metric scores.
In our own work, we assessed the impact of collocational knowledge on a rule-based translation system, namely, the Its-2 system [54] based on the Fips parser (cf. Section 2). Collocations have been integrated into this system in an indirect manner, by adding them into the underlying parsing system. More precisely, the new parsing strategy integrating collocation identification (as described in Section 4) replaced the old parsing strategy. The evaluation was performed on 200 randomly sampled sentences from Europarl, half in English and half in Italian, which contained verb-object collocations. The sentences were translated into French, and the output was manually evaluated by two judges. For both language pairs, the results showed a statistically significant improvement in collocation translation adequacy when collocational knowledge is integrated in this specific way into the translation system. 6 These findings are in line with the ones previously mentioned in relation to SMT. They confirm the positive impact of collocations in the rule-based machine translation scenario.
Differently from related work, we performed a focused evaluation concerned with the quality of translations proposed for collocations, as opposed to the overall sentence translation quality. We were reluctant to measure the impact on the BLEU score, since this metric is more suited to an overall assessment of the target sentence and the context could easily mask the effect of choosing a wrong translation for collocations. By giving equal weight to the words in a sentence, BLEU underestimates the importance of choosing the right collocate for a base word. Our evaluation strategy corresponds to what has later been coined as evaluation focused on linguistic checkpoints [80], i.e., evaluation of machine translation performance for specific linguistic phenomena.
Further investigation is required in order to be able to check whether the positive result of improved collocation translation is accompanied by a similar improvement in the overall sentence quality. However, given the massive presence of collocations in language and their role on language fluency, we hypothesize that improving the translation of collocations is one of the main ingredients in improving the overall quality of translations.

Conclusions
Multi-word expressions have long since been the subject of an important body of NLP work. A lot of progress has been achieved in particular in developing technologies for acquiring specific types of MWEs, such as collocations, from text corpora. However, this research remains somewhat endogenous: despite the widely recognized importance of such expressions for parsing and translation, not many efforts are devoted to actually integrating the acquired expressions in these applications.
In this paper, we focused on the interaction between multi-word expressions in general, and collocations in particular, on the one hand, and the applications of syntactic parsing and machine translation, on the other hand. We highlighted the existing work on the use of collocational information for improving parsing and translation performance and, vice versa, the work on the use of parsing and translation information for improving corpus-based collocation identification. In addition to giving a panorama of the previous work in these areas, we described our own methods and experiments, which stem into sustained work devoted to collocation processing in a multilingual, syntactically-aware environment. We showed that parsing and translation technologies can contribute significantly to the task of automatic detection of collocations in text corpora. We also focused on the exploitation of collocations in parsing and machine translation and presented experimental results showing the benefits that can be obtained by following a collocation-aware approach in both tasks.
One of the most sensitive issues in relation to MWE/collocation processing is their syntactic flexibility. Our work is specifically focused on this issue and complements existing words-with-spaces approaches, which are easier to implement, but less adequate for modeling MWEs (cf. [1]). We expect a further integration of parsing and translation technologies in the NLP field in the future, as already witnessed by the increasing interest in syntax-based SMT, and we hope that (flexible) MWEs will take a more prominent place in both the parsing and translation fields. We hope that our present findings will contribute to the understanding of the mutual role that collocational knowledge and parsing/translation information play in better processing natural language.