The Relation Dimension in the Identiﬁcation and Classiﬁcation of Lexically Restricted Word Co-Occurrences in Text Corpora

: The speech of native speakers is full of idiosyncrasies. Especially prominent are lexically restricted binary word co-occurrences of the type high esteem , strong tea , run [ an ] experiment , war break(s) out , etc. In lexicography, such co-occurrences are referred to as collocations . Due to their semi-decompositional nature, collocations are of high relevance to a large number of natural language processing applications as well as to second language learning. A substantial body of work exists on the automatic recognition of collocations in textual material and, increasingly also on their semantic classiﬁcation, even if not yet in the mainstream research. Especially classiﬁcation with respect to the lexical function (LF) taxonomy, which is the most detailed semantically oriented taxonomy of collocations available to date, proved to be of real use to human speakers and machines alike. The most recent approaches in the ﬁeld are based on multilingual neural graph transformer models that use explicit syntactic dependencies. Our goal is to explore whether the extension of such a model by a semantic relation extraction network improves its classiﬁcation performance or whether it already learns the corresponding semantic relations from the dependencies and the sentential contexts, such that an additional relation extraction network will not improve the overall performance. The experiments show that the semantic relation extraction layer indeed improves the overall performance of a graph transformer. However, this improvement is not very signiﬁcant, such that we can conclude that graph transformers already learn to a certain extent the semantics of the dependencies between the collocation elements.


Introduction
The language of native speakers often contains wordings with varying degrees of semantic decomposition. The most overt are idioms, such as, e.g., it is a piece of cake meaning 'easy' or to be under the weather meaning 'to feel ill', but even more frequent and relevant to language proficiency of a speaker or machine are lexically restricted binary word cooccurrences, in which one of the two syntactically bound lexical items restricts the selection of the other item. Consider a statement in an CNN sports commentary from 9 July 2022: Having taken an early lead, Rybakina almost gave up her advantage soon after, needing to fend off multiple break points before eventually taking a two-game lead in the set.
Already, this short statement contains three of such co-occurrences: take [a] lead, give up advantage, and fend off [a] break point, with lead, advantage, and break point restricting the selection of take, give up, and fend off respectively. While for an English native speaker, these co-occurrences may appear to involve no idiosyncrasy, one can clearly recognize it from the multilingual angle. Thus, in German, instead of the literal nehmen, one would use gehen 'go' to translate take in take [a] lead: in Führung gehen, lit. 'to go into lead'. In Spanish, give up in co-occurrence with ventaja 'advantage' will be translated as ceder 'cede': ceder [una] ventaja, and in French fend off will be translated in the context of breakpoint as repousser 'repell': repousser [un] point de rupture. In contrast, in the context of walk, take would be translated into German as machen 'make': einen Spaziergang machen, lit. 'make a walk', in the context of rights, give up would be translated into Spanish as renunciar 'renounce': renunciar a sus derechos, lit.'renounce to one's rights', and in the context of competition, fend off would be translated into French as contrer: contrer la concurrence. Note that at the same time lead, advantage, break point, walk, rights, and competition will always be translated literally.
Lexically resticted word co-occurrences are of extremely high relevance to second language learners [1][2][3][4][5][6] and many Natural Language Processing (NLP) applications, including, e.g., natural language generation [7,8], machine translation [9], semantic role labeling [10] and word sense disambiguation [11]. Thus, language learners must memorize them by heart and master even fine-grained semantic differences between them (as Mel'čuk and Wanner [12] show, while a certain semantic feature-based analogy (or, in other words, generalization) in the formation of such co-occurrences is possible, even for co-occurrences with emotion nouns, which are considered to be very homogeneous, major divergences prevail). For instance, a language learner should know the difference between lodge [a] complaint and voice [a] complaint or between receive [a] compensation and have [a] compensation.
Automatic text generation appears more natural if it uses idiosyncratic word co-occurrences (compare, e.g., John walks every morning on the beach vs. John takes a walk on the beach every morning); semantic role labeling needs to capture in take a walk that John is not the recipient of the walk, but rather the actor, and machine translation from English to German needs to translate take in co-occurrence with walk as machen 'make' and not as nehmen 'take'. Language models as commonly used in modern NLP partially capture such co-occurrences, but it has been shown that even if downstream applications use state-of-the-art language models, they benefit from additional information on restricted lexical co-occurrence; see, e.g., the experiments of Maru et al. on Word Sense Disambiguation [11], in which they use explicit lists of semantically labeled lexical co-occurrences. Therefore, it is surprising that so far, the automatic acquisition and semantic labeling of lexically resticted co-occurrences has not been given close attention in mainstream research on distributional semantics. Most of the work focused on a mere identification of statistically significant lexical co-occurrences in text corpora; see, among others, [7,[13][14][15]. This is insufficient. Firstly, not all statistically significant word co-occurrences are, in fact, lexically restricted, and, secondly, as illustrated above, in order to be of real use, their semantics must be also known.
In a series of experiments, Wanner et al. (see, for instance, [16][17][18][19]) work with precompiled lists of co-occurrences, which they classify with respect to the fine-grained semantic typology of lexical functions (LFs) [20]. In another work [21], they go one step further by identifying instances from precompiled co-occurrence lists in text corpora and using then their sentential contexts as additional information for classification. Most recently, Espinosa-Anke et al. [22] use in their graph transformer-based model, in addition to the sentential context, the syntactic dependencies between the elements of the co-occurrences and thus take into account lexicographic studies [20] that identify syntactic dependency as one of the prominent characteristics of the individual semantic categories of the co-occurrences. However, another crucial feature of lexically restricted co-occurrences remained so far unconsidered in NLP: between its elements, not only a syntactic but also a semantic dependency holds (although Espinosa-Anke et al. explicitly point out the existence of a semantic dependency between the elements in lexically restricted word co-occurrences, they do not model it in explicit terms: the elements that form the co-occurrence are identified separately using the BIO-tagging strategy [23]).
In view of the continuous significant advances shown by semantic relation extraction techniques ( http://nlpprogress.com/english/relationship_extraction.html, accessed on 15 July 2022), our goal is to explore whether the extension of a graph transformer-based model for the identification and classification of lexically restricted co-occurrences by a state-of-the-art semantic relation extraction network can contribute to an increase of the performance. Our exploration is motivated by the fact that (i) current approaches still struggle with correctly classifying some less frequent categories of lexically restricted word co-occurrences, and (ii) especially semantically similar categories are systematically confused, which is detrimental in particular for such applications as second language learning. Semantic relation techniques are a promising means to remedy this problem.
To this end, we design an LF relation-extraction model that is inspired by [24] and carry out with this model two kinds of experiments. In the first experiment, we tackle a classical classification task to assess the stand-alone performance of the adapted model and thus its suitability to form part of an extended graph transformer-based model. Given a dataset with lexically restricted co-occurrences in their sentential contexts, we classify the co-occurrences with respect to the LF typology. In the second experiment, we use the output of the identification stage of the model of Espinosa-Anke et al. [22] and feed it as input to an LF relation extraction network. Our experiments show that relation extraction indeed helps to increase the quality of lexically restricted word co-occurrence identification and classification. In particular, relation extraction ensures a better distinction between some of the categories of co-occurrences, which are notoriously confused by other state-of-the-art techniques. On the other hand, the quality increase is limited-which allows for some conclusions with respect to an implicit representation of semantic relation information by graph transformers.
The contributions of our work can thus be summarized as follows: • We adapt a generic relation extraction framework to the problem of lexically restricted word co-occurrence classification. • We show that neural relation extraction techniques that have not been used for the classification of lexically restricted word co-occurrences so far are a suitable means to account for the relational nature of such co-occurrences and can compete in their performance with, e.g., the most recent BIO tagging technique. This means that if the goal is to simultaneously identify and classify lexically restricted and semantic relations, relation extraction techniques can be successfully used. • We demonstrate that the neural relation extraction techniques distinguish better between notoriously confused categories of lexically restricted word co-occurrences than the state-of-the-art techniques. This can be of high relevance to applications that focus on these categories. • Our contrastive analysis of the BIO tagging technique, which receives as input syntactic dependency information only, and the proposed relation extraction technique suggests that the graph transformers used in the BIO tagging technique already capture the category-specific semantics of the lexically restricted word co-occurrences. This outcome contributes to the research on the types of knowledge captured by neural models.
The remainder of the article is structured as follows. The next section (Section 2) contains some background on lexically restricted co-occurrences. In Section 3, we provide an overview of the state of the art of the research on the identification and classification of such co-occurrences. Section 4 introduces the Graph-Trased transformer model (Gr2C-Tr) of Espinosa-Anke et al. [22] and its extension by a relation extraction network. Section 5 describes the experiments that have been carried out and their results, which are discussed in Section 6. Section 7, finally, draws the conclusions from our experiments and outlines some lines of relevant future work.

Background on Lexically Restricted Word Co-Occurrences
In lexicology and lexicography, lexically restricted word co-occurrences have been studied under the heading of collocations [25][26][27][28]. The item that restricts the selection of the other item is referred to as the base and the restricted item is the collocate. Note, however, that the original notion of collocation as introduced by J.R. Firth [29] is broader: it merely implies statistically significant word co-occurrence. In other words, any combination of words that appear together sufficiently often are considered to be collocations, among them, e.g., doctor-hospital, hospital-pandemic, or pandemic-mask. As can be observed, in these examples, no lexical restriction is imposed by one of the lexical items on the other item, i.e., there is no base and no collocate. Most of the work on automatic word co-occurrence identification is based on this notion of collocation (see Section 3 below). Obviously, this is not to say that both notions are disjoint. On the contrary, lexically restricted co-occurrences will often (although by far not always) be statistically significant; see, e.g., strong tea, contagious disease, or come [to] power.
In contrast to the mainstream research in NLP, we use the notion of collocation as introduced in lexicology/lexicography. As already mentioned above, this notion implies that between the base and the collocate of a concrete co-occurrence, a specific semantic relation, which is expressed by the collocate, and a specific syntactic dependency hold. Based on this relation and dependency, collocations can be typified. For instance, take a walk, give a lecture, make a proposal belong to the same type: take, give, and make express the same semantic relation with their respective base (namely 'perform' or 'carry out'), and all of them take their respective base as a direct object. The same applies to thunderous applause, heavy storm, high temperature: here, thunderous, heavy, and high all express the relation 'intense'.
The most fine-grained semantically-oriented typology of collocations available to date is the typology of lexical functions (LFs) [20]. An LF is defined as a function f (B) that delivers for a base B a set of synonymous collocates that express the meaning of f . LFs are assigned Latin abbreviations as labels; cf., e.g., "Oper1" ("operare" 'perform'): Oper1(walk) = {take, do, have}; "Magn" ("magnum" 'big'/'intense'): Magn(applause) = {thunderous, deafening, loud, . . . }. But each LF can also be considered as a specific lexicosemantic relation between the base and the collocate of a collocation in question [30]. Table 1 displays the subset of the relations we experiment with along with their corresponding LF names and illustrative examples.

Related Work: Research on Lexically Restricted Word Co-Occurrences in NLP
With the increasingly prominent objective to understand in depth the behavior of neural language models, the research on restricted word co-occurrences is about to go beyond the core tasks of their identification and classification. Thus, there has been some recent tentative research on the use of lexical co-occurrences as probes in the context of the exploration of the representation of non-compositional meaning in neural language models [32,33]. Still, the two core tasks, i.e., (i) identification of lexical co-occurrences in text corpora, and (ii) semantic classification of available co-occurrence instances-for instance, in terms of the LF or any other available typology, continue to be active research topics. Most often, one of these two tasks is addressed. However, more recently, both tasks have also been tackled, either in sequence or together, by one model. In what follows, we review some representative works for each of these constellations.

Identification of Lexical Co-Occurrences
As already mentioned in the Introduction and in Section 2, most of the work on lexical co-occurrences focuses on the identification of statistically restricted word co-occurrences in text corpora. The most straightforward of them operate with a co-occurrence frequency of n-grams as an association measure [34,35]. The majority uses statistical measures such as, e.g., (Pointwise) Mutual Information (PMI) [13] or its variants [14,36], likelihood-ratio [37], or t-score [38]. For detailed contrastive reviews of different association measures for restricted word co-occurrence identification, see, among others, [15,[39][40][41]. In some cases, the statistical measures are complemented by morphological [42,43] and/or syntactic [44][45][46] patterns that are characteristic for restricted word co-occurrences. Several authors focus on one syntactic pattern only, such as, e.g., Breidt [38], who targets verb-noun co-occurrences in German or even a specific profile of a single syntactic pattern. In this context, in particular, the identification of support (or light) verb constructions (SVCs/LVCs) has been a prominent topic of research. SVCs are captured by the Oper-(and partially by the Real-) families of LFs; cf., e.g., [32,[47][48][49][50][51].

Classification of Precompiled Lists of Lexical Co-Occurrences
The lists of co-occurrences identified in text corpora or retrieved from collocation dictionaries, which use a classification schema that is either too broad or too heterogeneous for use in NLP and/or SLL (such as, e.g., the Oxford Collocations Dictionary or the MacMillan Collocations Dictionary) or different in its nature (such as, e.g., the REDES dictionary of Spanish (REDES presents restricted word co-occurrences in the entries for collocates rather than bases)), can serve as input to classification algorithms. The most common classification schema has been the LF typology or its generalization, although some other more coarse-grained schemata have also been applied; see, e.g., [21,52].
To the best of our knowledge, Wanner [16] was the first to propose the classification of lexically restricted co-occurrences, i.e., collocations in the sense used in this paper, with respect to the LF typology on a limited number of precompiled instances of the nine most common LFs in Spanish. The experiments involved runs on collocations from the field of emotions and runs on collocations from a variety of other semantic fields. The idea was that the semantics of an LF can be captured by the semantic features of the collocates and bases of a representative selection of instances of this LF, such that when these features are combined, a prototypical representation (or centroid) of this LF is obtained. The semantic profile of a new binary word co-occurrence can then be matched against the morphosyntactic pattern of each LF and the semantic profile of its centroid. The LF with the best match is chosen as the LF to be assigned to the candidate co-occurrence. Features assigned to a lexical item in EuroWordNet [53] (i.e., the members of its synset and its base and top concepts) serve as semantic features of the profile of this item. The quality (precision, recall and F1-score) obtained during the emotion field experiments was high (the lowest F1-score was 0.78 for IncepFunc1; for ContOper1 and FinFunc0, an F1-score of 1.0 was achieved); for the cross-field runs, the figures were lower but still reasonable (the lowest F1-score was 0.58 for Real2). In [17,18], the same lists of LF instances and the same EuroWordNet-based semantic representation of LF instances have been used to test three standard Machine Learning techniques: Nearest Neighbors (NN), Naïve Bayes (NB), and Tree-Augmented Network (TAN), with a comparable resulting performance. Gelbukh and Kolesnikova [54] carried out similar experiments with a broad range of traditional ML techniques and a somewhat more restricted semantic representation in terms of the hypernyms of the verbal and noun elements of collocations.

Joint Identification and Classification of Lexical Co-Occurrences
The identification of one specific type of lexical co-occurrence, such as, e.g., the SVCs mentioned in Section 3.1, or preposition-verb constructions [55,56] can be considered, in a sense, as joint identification and classification. Examples of the simultaneous identification and semantic classification of lexical co-occurrences in terms of a more variant typology (such as the LF typology) are [22,57,58].

The Relation Dimension in Collocation Classification
As repeatedly pointed out above, between the lexical items of a collocation, a syntactic dependency and a semantic dependency hold. Espinosa-Anke et al. [22] exploit the syntactic dependency to identify and classify collocations in corpora using a graph transformer model that identifies and classifies the collocation elements separately in a sequence-tagging manner for English, French, and Spanish. The assessment of the performance (precision, recall, and F1-score) figures and the confusion matrices provided in [22] suggest that the performance ceiling has not been reached yet. Especially, the confusion between instances of common LFs such as Oper1, Real1, and Real2 calls for further research. One option that needs exploration is whether an explicit consideration of the semantic dependency between the base and collocate elements would contribute to an increase of the classification performance.
For this purpose, we adapt the relation extraction-driven model outlined in [24] for collocation classification on top of the predictions made by the model of Espinosa-Anke et al. (=base model), such that it accounts even for slight differences in meaning and is thus able to better distinguish between semantically similar LFs. Since the base model provides the position of the extracted collocation parts in the considered sequence of tokens, its outcome can be used as input for advanced relation classification models that require the explicit position of entities participating in relation as a signal for classification. Figure 1 sketches the overall architecture of our model.
In what follows, we first introduce Espinosa-Anke's original Graph-to-Collocation Transformer (G2C-Tr) model that serves as a collocation candidate provider (the introduction mirrors the description in [22]) and then discuss the architecture of the relation extraction-driven classification model in more detail.

G2C Transformer Model
G2C-Tr is a suite of BERT-based multitask models for the joint binary classification of a sentence with respect to the occurrence of any LF-instances in it and the LF-instance BIO sequence tagging in this sentence. The task of sentence classification has been added to create a multitask setup because multitasking proved to improve the performance of the neural models for each of the tasks involved [59].
The upper part of Figure 1 illustrates the model. Given the input sentence W = (w 1 , w 2 , . . . , w N ), a pre-trained Universal Dependency (UD) parser DP() is used to obtain the dependency graph G and Part-of-Speech (PoS) tags P = (p 1 , p 2 , . . . , p N ) as input to G2C-Tr, which predicts the BIO-tagged sentence Y = (y 1 , y 2 , . . . , y N ) as follows:  Enc() computes the contextualized vector embeddings H as the sum of pre-trained token embeddings of BERT, position embeddings, and PoS tag embeddings using a modified version of the Transformer attention mechanism to inject the syntactic dependency information. In each Transformer layer, given Z n = (z 1 , z 2 , . . . , z T ) as the output representations of the previous layer, the attention weights are calculated as a Softmax over the attention scores α ij , which is defined as: where W Q , W K ∈ R d h ×d are learned query and key parameters. W R A ∈ R 2|G|+1×d is the graph relation embedding matrix, learned during training, d h is the dimension of hidden vectors, d is the head dimension of the self-attention module, and |G| is the overal number of dependency labels. r ij is the one-hot vector representing both the relation and direction of syntactic relation between token x i and x j , so r ij W R A selects the embedding vector for the appropriate syntactic relation.
To obtain the output representations (H), Vaswani's [60] original mechanism for the position-wise feed-forward layer and layer normalization is used.
Dec() calculates the sentence classification output as: with i as the index of the sentence that is to be classified and h 1 as the hidden state of the first pooled special token (CLS in the case of BERT). For sequence tagging, this equation is extended such that the sequence [h 2 , . . . , h T ] is fed to word-level softmax layers: where h n is the hidden state corresponding to w n . Finally, the joint model combines both architectures and is trained, end-to-end, by minimizing the cross-entropy loss for both tasks. A joint model is fine-tuned end-to-end by minimizing the cross-entropy loss:

Semantic Relation-Extraction Driven Collocation Classification
Standard approaches to supervised relation extraction that rely on deep neural networks usually encode the input sequence in terms of an attention layer to predict the type of relation between a specified pair of entities [61,62]. It is also common to introduce either mention pooling to perform classification only over encoded representations of entities or positional embeddings that indicate the tokens of the entities between which a relation holds in the input sequence to affect encoding itself [63][64][65]. In [24], the authors showed that special entity markers (functional tokens placed before and after the tokens of an entity) introduced into the token sequence instead of the traditional extra token type embedding layer lead to more accurate results. We adapt this model to our problem; cf. Figure 2.
The input to the model is a relation statement r = (x, s 1 , s 2 ) that contains a sequence of tokens x and the entity span identifiers s 1 and s 2 . Before introducing the sequence into the encoder, x is augmented by four reserved word pieces, [E1], [/E1], [E2], and [/E2], to mark the beginning and end of each entity mention in the relation statement as follows: In addition, entity indices are updated to account for the inserted tokens:s 1 = (i + 1, j + 1) ands 2 = (k + 3, l + 3). Given the last hidden layer of the transformer network defined as H = [h 0 , . . . , h n ] for n = |x|, the concatenation of the final hidden states corresponding to the respective start entity markers is used to represent the relation in the encoder outcome: r h =< h i |h k+2 >. It is worth noting that the authors of the model also tried two other items in the encoder outcome to be used for classification, namely, the CLS token and entity mention states, but concluded that the states corresponding to the start entity markers lead to the best scores. This representation is fed into a fully connected layer that either contains a linear activation or performs layer normalization [66]. The choice of the post Transformer layer is treated as a hyper-parameter. Furthermore, a classification layer W ∈ R K×H is introduced, where H is the size of the relation representation and K is the number of relation types. The classification loss is the standard cross-entropy of the softmax of h r W T with respect to the true relation type.

Experiments
Extending the original G2C-Tr model by a dedicated relation-based collocation classification described above, we obtain a reinforced model architecture depicted in Figure 1, with two transformers dedicated specifically to collocation extraction and classification, respectively. In the follow-up experiments, we assess to what extent this extension leads to an improvement of the performance of the overall model.
In this section, we present the datasets that we used, the setup of the experiments, including the details of the training and the combination of the pre-trained models selected for each transformer block in Figure 1, and the results of the experiments.

Datasets
In order to be able to compare the performance of the extended model that we propose with the performance of the original G2C-Tr model [22], we use the same English, French, and Spanish datasets compiled from the 2019 Wikipedia dumps using the lists of LF instances of Fisas et al. [67] as seed lists. The dumps are preprocessed (removing metadata and markups) and parsed with the UDPipe2.5 (https://ufal.mff.cuni.cz/udpipe, accessed on 1 July 2022). From the parsed dumps, for each LF instance encountered in the lists of LF instances, sentences that contain this instance (with one of its valid dependency patterns) are extracted. Only those sentences are selected in which the lemmas of the base and collocate elements have the same PoS as specified in the list of LF instances compiled in [67]. In order to further minimize the number of the remaining erroneous samples in which the base and the collocate items do not form a collocation (as, e.g., in Conceding defeat, Cavaco Silva said he wished his rival "much success in meeting his duties for the good of all Portuguese", where between success and meet an indirect dependency relation holds and the two words meet and success form, in principle, a collocation, but not in this sentence), an additional manual validation has been performed. For each LF and each syntactic dependency pattern between the base and the collocate elements of this LF, three sentences from the preliminary dataset are randomly picked. In case the base and the collocate elements did not form an instance of this LF, all sentences with the considered dependency pattern between the base and collocate elements were removed from the dataset (for this purpose, an expanded list of expected syntactic dependencies between the base and the collocate elements is used, namely, 'acl', 'acl:relcl', 'advcl', 'advmod', 'amod', 'case', 'compound', 'conj', 'csubj', 'nmod', 'nsubj', 'nsubj:pass', 'obj', 'obl', 'obl:npmod', and 'xcomp'). This allowed us to take into account mistakes of the parser while not excluding correct examples with wrongly assigned dependencies). Table 2 displays the counts of the individual LF instances in the obtained English, French, and Spanish corpora. The corpora are annotated with respect to the LF taxonomy and the sentence LF labels in terms of BI labels of the BIO sequence annotation schema for both elements of the instance, the base and the collocate ('B-<LF>b' and 'I-<LF>b' for the base, 'B-<LF>c', 'I-<LF>c' for the collocate, and 'O' for other tokens); see Figure 1 for an illustration. The BIO annotation has the advantage that it facilitates a convenient labeling of multi-word elements, and the separate annotation of the base and collocate elements allows for flawless annotation of cases where the base and the collocate elements are not adjacent. As a sentence label, the most frequent LF in a sentence and the first one in case of a draw is chosen.
For training of the relation classification transformer, we created an additional dataset, where we introduced entity markers into the input sequences (e 1 for the base and e 2 for the collocate), such that the <LF> tag indicates the default order of the base and the collocate (e.g., in "Oper1(e 2 ,e 1 )", which is a verb-object construction, the verbal collocate precedes the base). In case one sentence contains several collocations, we use each of them for an individual training example by copying the sentence and introducing entity markers only for a single collocation at once. Thus, the number of examples is equal to the number of annotated collocations in our corpora. We did not introduce negative examples of any special "not-a-collocation" class to ensure that the transformer in extension cannot "cancel" a collocation extracted by the base G2C-Tr model but can only refine the lexical function assignment.

Setup of the Experiments
For the experiments, the obtained datasets were split into training, development, and test subsets in proportion 80-10-10 in terms of LF-wise unique instances, such that all occurrence samples of a single LF instance appear only in one of the subsets. Sentences with several collocations that belonged to different splits are dropped. Since collocations have different frequencies in the corpus, not each split leads to the same proportion in terms of overall number of samples. Therefore, for each LF, we additionally distributed collocations to ensure an approximate 80-10-10 split not only in terms of the number of LF instances in general but also in terms of the number of instances per LF.
In order to be able to clearly distinguish the contribution of our extension to the final figures, to run the experiments, we used the same versions of Transformers for the G2C-Tr model as in [22]: BERT-large (https://huggingface.co/bert-large-uncased, accessed on 1 July 2022) for English, XLM-RoBERTa-base (https://huggingface.co/xlm-roberta-base, accessed on 1 July 2022) for Spanish and French, and considered models trained with and without information about PoS tags.
As for the relation classification, we used different monolingual RoBERTa large and XLM-RoBERTa large-based models for English (https://huggingface.co/roberta-large, accessed on 1 July 2022), Spanish (https://huggingface.co/xlm-roberta-large, accessed on 1 July 2022), and French (https://huggingface.co/camembert/camembert-large, accessed on 1 July 2022). About 20% of the development set has been used for the evaluation of intermediate checkpoints during the training phase, and the three best checkpoints were evaluated on the entire development set in order to select the model to be used for the G2C-Tr extension. The batch size was of 16; the models were trained for 10 epochs (about 169,000 steps for English, 90,000 steps for Spanish, and 83,000 steps for French). Since, in contrast to the G2C-Tr base model, which may predict base without collocate and vise versa, our model extracts only complete collocations, we also removed unpaired B-I tags from the outcome of the base model and re-evaluated it to make the scores for both models comparable.

Results of the Experiments
In what follows, we first present the performance of the relation classification model with respect to the individual LFs and then show the results of the evaluation of the entire model, i.e., the extended G2C-Tr model. Tables 3-5 provide precision (P), recall (R), and F1-scores (F1) achieved on the training, development and test set, respectively, per LF for each considered language (with 'I' as correctly identified instances of LF f , 'A' as all instances identified as instances of f , and 'B' as all instances of f in the test set, precision is defined as P = |I ∩ A| \ |A|, recall as R = |A ∩ B| \ |B| and F1 as the harmonic mean of P and R: F1 = 2PR \ (P + R)). Within each LF, examples were split into two groups depending on the order of collocation parts in a sentence, i.e., when the collocate precedes the base (e 2 , e 1 ) and when the base precedes the collocate (e 1 , e 2 ). The columns "#" show the number of examples in each group, with the total size of a set at the bottom. Table 6 reports the average F1-scores of (i) the original G2C-Tr model, i.e., for separate identification of the base and collocate elements across all LFs, without that both elements must have been identified; (ii) the original G2C-Tr model, but limited to cases when both elements are identified; (iii) the G2C-Tr model extended by the relation extraction layer. We provide the results for these three constellations because the first was used in [22] to report on the performance of the G2C-Tr model, while the second and third allow for a direct comparison between the original G2C-Tr model and the relation extraction-extended model. Configurations with an access to PoS embeddings ("G2C+PoS") and without PoS embeddings ("G2C-PoS") are assessed. Figures 3-5 display the corresponding confusion matrices for the test set.

Discussion
Let us first discuss the performance of our relation extraction-based collocation classification model and then assess to what extent it contributes to the improvement of the basic G2C-Tr model.

Discussion of Relation Extraction-Based Collocation Classification
Already, a first cursory glance at Tables 4 and 5 provides some interesting details on the grammatical constructions associated by our model with the instances of the individual LFs and thus also encountered in the corpora-sometimes in contrast to the expectations motivated by the canonical word order. For instance, the canonical word order in English and Spanish in adjective-noun LF instances is (e 2 , e 1 ), as, e.g., AntiMagn: low (=e 2 ) temperature (=e 1 ), Magn: high (=e 2 ) temperature (=e 1 ), Ver: comfort (=e 2 ) temperature (=e 1 ), and AntiVer: freezing (=e 2 ) temperature (=e 1 ). The model seems to learn both the canonical and the non-canonical orders rather well. It is to be noted, however, that for Magn in the English test set, the (e 1 , e 2 ) pattern (as, e.g., in The temperatures were high) is recognized considerably better. For Ver, the same tendency can be identified, although by far not that clearly. For French, we see that in the case of AntiMagn, AntiVer, and Magn, the adjectives more often precede the noun, while for Ver, they follow the noun (most of the values are very close to 1.00 because we chose a checkpoint of the model generated at the moment when the model started being overfitted and the scores on the validation set stopped increasing). Especially for English, the canonical order of the collocation elements is also not reflected in the performance of the recognition of several verb-noun LFs. Thus, Oper1 and Real2 instances, which canonically instantiate the pattern (e 2 , e 1 ), show in Table 5 a better performance for (e 1 , e 2 ), and FinFunc0, which canonically instantiates the pattern (e 1 , e 2 ), shows a better performance for (e 2 , e 1 ). In Spanish and French, this "deviance" is observed to a much lesser degree. Since for English, it is also not observed that much on development data (Table 4), we may explain it by an insufficient number of different LF instances in our training data.
Tables 4 and 5 also show a similar overall performance for English, French and Spanish, although the differences across the individual LFs are significant, ranging from an F1-score between 0.95 and 1.0 for the (e 2 , e 1 ) pattern of FinFOper1 instances in English, French, and Spanish on both the development and test data to an F1-score between 0.01 and 0.95 for the pattern (e 1 , e 2 ) of Real1. However, in general, we can conclude that our relation extraction-based model is able to capture various syntactic patterns of the LF instances.
We cannot compare the performance of our model with the performance of other semantic classification models known from the literature (see Section 3) since they were trained and tested on other data; however, the confusion matrices in Figures 3-5 suggest that our model does not confuse Magn with AntiMagn and only moderately AntiMagn with Magn LF instances and thus overcomes, like [22], the notorious problem of automatic distinction of antonyms. Furthermore, it practically does not confuse Magn and Ver instances, which is a challenge still experienced in [22].
Some semantically very similar LFs may still be confused. For instance, for English, CausFunc0 and CausFact0 can be confused with Real1 instances, and Real1 instances can be confused with Real2 instances. In general, the confusion matrices tell us that the model behaves similarly across languages, even if some language-specific confusion glitches can be observed; see also the analysis in [22] in this respect.

Does Relation Extraction Improve the Overall Performance of Collocation Classification?
The good performance of our relation extraction-driven collocation classification model as presented in Section 5.3 and discussed above legitimates its use as an extension of the G2C-Tr model to assess to what extent an explicit consideration of relation information helps in collocation recognition and classification. For this assessment, let us look at Table 6, which contrasts the results obtained with the basic G2C-Tr model with the performance of the G2C-Tr model extended by a relation extraction network. We can observe that the extension of the G2C-Tr model by a relation extraction network improves the classification performance of the original model when both elements of an LF instance have been identified before and when PoS information is taken into account. When PoS information is not considered, the performance of the original and the extended models are nearly the same for French and Spanish, while for English, the original model is better. Interestingly enough, for English, PoS information does not contribute to the performance, just on the contrary: a deeper linguistic analysis is needed to determine why this is the case.
Overall, we can conclude that for applications such as second language learning, for which a correct distinction between instances of semantically similar LFs is crucial (cf., e.g., Magn(voice) = loud vs. Ver(voice) = clear) or IncepOper1(war) = launch vs. Oper1(war) = wage), our extended model brings an advantage compared to the original G2C-Tr model. From the perspective of the research on neural collocation classification, we can state that the introduction of an additional explicit relation extraction network into a graph transformerbased collocation identification and classification model does not lead to a considerably higher classification performance. This confirms that when fed with syntactic dependencies between collocation elements in the training material, a graph transformer is also able to learn to a major extent the semantic relations between them, such that an additional explicit semantic relation extraction layer is of limited use only.

Conclusions and Future Work
Following the fact that between the elements of a collocation, i.e., a lexically restricted word co-occurrence, both a syntactic and a semantic dependency holds, we explored to what extent the addition of an explicit semantic relation extraction layer to a graph transformer model, which operates on syntactic dependencies, improves the overall performance of the identification and classification of collocations with respect to the LF taxonomy. Our experiments have shown that the Transformer is already able to capture the semantic relations and that although the additional relation extraction layer helps to somewhat improve the performance, this improvement is limited. It is also to be noted that the semantic relation extraction layer still does not fully solve the problem of the confusion of the instances of syntactically and/or semantically similar LFs, although an improvement compared to the original G2C-Tr model can be observed.
Along with these valuable newly gained insights, the presented work reveals some limitations. In particular to be mentioned is the fact that so far, the relation extraction model has been applied only to the task of classification of lexically restricted word co-occurrences.
In our future work, we want to explore the potential of a relation extraction-based model for the joint identification and classification of LF instances.Furthermore, in our study, we did not pursue the question on how the exploitation of the information on syntactic sentence structures would influence the quality of the classification of LF instances with the canonical vs. non-canonical order of their elements. Finally, a more thorough linguistic analysis of the confusion matrices would certainly contribute to a better understanding and thus also to the solution of the classification of LF instances. It should be also clear to the reader that the number of different instances of certain LFs which are available so far for the training and fine-tuning of ML models is very limited. This restricts the potential of currently explored models (including the one presented in this work). To advance significantly, either novel models must be researched, which are able to learn (as humans do) on a few training data only, or substantially larger datasets need to be compiled.  Acknowledgments: Many thanks to Luis Espinosa-Anke, Alireza Mohammadshahi, and James Henderson for the collaboration on the GrC-Tr model and to Beatriz Fisas, Alba Táboas and Inmaculada López for their contributions to the the original LF instance lists.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

BERT
Bidirectional Encoder Representation from Transformers BIO Begin-In-Out (tagging strategy) CLS (token) Special Classification Token (used in the BERT model) F1 (score) Harmonic mean between precision and recall Gr2C Graph-to-Collocation Gr2C-Tr