Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

Avram, Andrei-Marius; Mititelu, Verginica Barbu; Păiș, Vasile; Cercel, Dumitru-Clementin; Trăușan-Matu, Ștefan

doi:10.3390/math11112548

Open AccessArticle

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

by

Andrei-Marius Avram

^1,*

,

Verginica Barbu Mititelu

²

,

Vasile Păiș

²

,

Dumitru-Clementin Cercel

^1,*

and

Ștefan Trăușan-Matu

^1,2

¹

Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, 060042 Bucharest, Romania

²

Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, 050711 Bucharest, Romania

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(11), 2548; https://doi.org/10.3390/math11112548

Submission received: 1 May 2023 / Revised: 26 May 2023 / Accepted: 29 May 2023 / Published: 1 June 2023

(This article belongs to the Special Issue Current Trends in Natural Language Processing (NLP) and Human Language Technology (HLT))

Download

Browse Figures

Versions Notes

Abstract

:

Correctly identifying multiword expressions (MWEs) is an important task for most natural language processing systems since their misidentification can result in ambiguity and misunderstanding of the underlying text. In this work, we evaluate the performance of the mBERT model for MWE identification in a multilingual context by training it on all 14 languages available in version 1.2 of the PARSEME corpus. We also incorporate lateral inhibition and language adversarial training into our methodology to create language-independent embeddings and improve its capabilities in identifying multiword expressions. The evaluation of our models shows that the approach employed in this work achieves better results compared to the best system of the PARSEME 1.2 competition, MTLB-STRUCT, on 11 out of 14 languages for global MWE identification and on 12 out of 14 languages for unseen MWE identification. Additionally, averaged across all languages, our best approach outperforms the MTLB-STRUCT system by 1.23% on global MWE identification and by 4.73% on unseen global MWE identification.

Keywords:

multiword expression identification; multilingual; lateral inhibition; domain adaptation; PARSEME corpus

MSC:

68T50

1. Introduction

Natural language processing (NLP) is a significant domain of artificial intelligence, with applications ranging from language translation to text classification and information retrieval. NLP allows computers to interpret and process human language, enabling them to perform tasks such as understanding and responding to questions, summarizing texts, and detecting sentiments. Some phenomena present in language can preclude its correct understanding by machines (and even humans sometimes). Such a phenomenon is represented by multiword expressions (MWEs), which are groups of words that function as a unit and convey a specific meaning that is not the sum of the meanings of the component words (i.e., the expression lacks compositionality). Examples of MWEs include idioms (e.g., “break a leg” is used to wish someone good luck), collocations (e.g., “take an exam”), or compounds (e.g., “ice cream”), different authors assuming a more comprehensive or a narrower meaning of this term. The number of MWEs in a language is relatively high. The authors of [1] synthesized papers reporting the number or proportion of MWEs in different languages: English—with an almost equal number of MWEs and single words; French—with 3.3 times greater number of MWE adverbs than that of single adverbs and 1.7 times greater number of MWE verbs than that of single verbs; and Japanese—in which 44% of the verbs are MWEs. Despite being so numerous in the dictionary, MWEs’ frequency in corpora is low [2].

Identifying and processing MWEs is crucial for various NLP tasks [3]. In machine translation, for instance, the correct translation of an MWE often depends on the specific context in which it appears. Suppose an MWE is translated rather than appropriately localized for the target language. In that case, the resulting translation may be difficult to understand for native speakers or may convey a wrong meaning [4]. In text classification tasks, MWEs are considered essential clues regarding the sentiment or topic of a text [5]. Additionally, to improve the accuracy of search engines in information retrieval, MWEs can help disambiguate the meaning of a query [6].

Acknowledged recent progress in the field has been made by the PARSEME community [7], which evolved from the COST action with the same name, where the topics of interest were parsing and MWEs (https://typo.uni-konstanz.de/parseme/ last accessed on 21 April 2023). There are two significant outcomes of their activity, (i) a multilingual corpus annotated for verbal MWEs (VMWEs) in 26 languages by more than 160 native annotators, with three versions so far (https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2282, https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2842, https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3367 last accessed on 21 April 2023) [8,9,10]; and (ii) a series of shared tasks (also three editions so far) dedicated to the automatic and semi-supervised identification of VMWEs in texts [11,12,13], in which the previously mentioned corpora were used for training and testing the participating systems.

Developing systems that can handle multiple languages is another important NLP area. In particular, the ability to accurately process and analyze text in various languages is becoming increasingly important as the world becomes more globalized and interconnected. For example, multilingual NLP systems can improve machine translation, allowing computers to translate text from one language to another accurately. This can be particularly useful in situations where there is a need to communicate with speakers of different languages, such as in global business or international relations. In addition to its practical applications, multilingual NLP is an important area of study from a theoretical perspective. Research in this field can help shed light on the underlying principles of language processing and how these principles differ across languages [14,15].

Multilingual Transformer models have become a popular choice for multilingual NLP tasks due to their ability to handle multiple languages and achieve strong performance on a wide range of tasks. Based on the Transformer architecture [16], these models are pre-trained on large amounts of multilingual data and can be fine-tuned for specific NLP tasks, such as language translation or text classification. Some models that have become influential in this area include the multilingual bidirectional encoder from transformers (mBERT) [17], cross-lingual language model (XLM) [18], XLM-RoBERTa (XLM-R) [19], and multilingual bidirectional auto-regressive transformers (mBART) [20]. One of the essential benefits of multilingual Transformer models is their ability to transfer knowledge between languages. These models can learn common representations of different languages, allowing them to perform well on tasks in languages that they have yet to be specifically trained on. Thus, multilingual Transformer models are a good choice for NLP tasks that involve multiple languages, such as machine translation or cross-lingual information retrieval [21].

In this work, we leverage the knowledge developed in the two research areas (i.e., MWEs and multilingual NLP) to improve the results obtained at the PARSEME 1.2 shared task [13]. We explore the benefits of combining them in a singular system by jointly fine-tuning the mBERT model on all languages simultaneously and evaluating it separately. In addition, we try to improve the performance of the overall system by employing two mechanisms, (i) the newly introduced lateral inhibition layer [22] on top of the language model and (ii) adversarial training [23] between languages. For the last mechanism, other researchers have experimented with this algorithm and have shown that it can provide better results with the right setting [24]; however, we are the first to experiment with and show the advantages of lateral inhibition in multilingual adversarial training.

Our results demonstrate that by employing lateral inhibition and multilingual adversarial training, we improve the results obtained by MTLB-STRUCT [25], the best system in edition 1.2 of the PARSEME competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 languages for unseen MWE identification. Furthermore, averaged across all languages, our highest-performing methodology achieves F1-scores of 71.37% and 43.26% for global and unseen MWE identification, respectively. Thus, we obtain an improvement of 1.23% for the former category and a gain of 4.73% for the latter category compared to the MTLB-STRUCT system.

The rest of the paper is structured as follows. Section 2 summarises the contributions of the PARSEME 1.2 competition and the main multilingual Transformer models. The following section, Section 3, outlines the methodology employed in this work, including data representation, lateral inhibition, adversarial training, and how they were employed in our system. Section 4 describes the setup (i.e., dataset and training parameters) used to evaluate our models. Section 5 presents the results, and Section 6 details our interpretation of their significance. Finally, our work is concluded in Section 7 with potential future research directions.

2. Related Work

2.1. Multilingual Transformers

This subsection will present the most influential three multilingual language models (MLLMs): mBERT, XLM, and XLM-R. The mBERT model, similar to the original BERT model [17], is a Transformer model [16] with 12 hidden layers. However, while BERT was trained solely on monolingual English data with an English-specific vocabulary, mBERT is trained on the Wikipedia pages of 104 languages and uses a shared word-piece vocabulary. mBERT has no explicit markers indicating the input language and no mechanism specifically designed to encourage translation-equivalent pairs to have similar representations within the model. Although simple in its architecture, due to its multilingual representations, mBERT’s robustness to generalize across languages is often surprising, despite needing to be explicitly trained for cross-lingual generalization. The central hypothesis is that using word pieces common to all languages, which must be mapped to a shared space, may lead to other co-occurring word pieces being mapped to this shared space [26].

XLM resulted from various investigations made by the authors in cross-lingual pre-training. They introduce the translation language modeling objective (TLM), which extends the masked language modeling (MLM) objective to pairs of parallel sentences. The reason for doing that is sound and straightforward. Suppose the model needs to predict a masked word within a sentence from a given language. In that case, it can consider that sentence and its translation into a different language. Thus, the model is motivated to align the representations of both languages in a shared space. Using this approach, XLM obtained state-of-the-art (SOTA) results on supervised and unsupervised machine translation using the WMT’16 German–English and WMT’16 Romanian–English datasets [27], respectively. In addition, the model also obtained SOTA results on the Cross-lingual Natural Language Inference (XNLI) corpus [28].

In contrast to XLM, XLM-R does not use the TLM objective and instead trains RoBERTa [29] on a large, multilingual dataset extracted from CommonCrawl (http://commoncrawl.org/ last accessed on 21 April 2023) datasets. In 100 languages, totaling 2.5 TB of text. It is trained using only the MLM objective, similar to RoBERTa, the main difference between the two being the vocabulary size, with XLM-R using 250,000 tokens compared to RoBERTa’s 50,000 tokens. Therefore, XLM-R is significantly larger, with 550 million parameters, compared to RoBERTa’s 355 million parameters. The main distinction between XLM and XLM-R is that XLM-R is fully self-supervised, whereas XLM requires parallel examples that may be difficult to obtain in large quantities. In addition, this work demonstrated for the first time that it is possible to develop multilingual models that do not compromise performance in individual languages. XLM-R obtained similar results to monolingual models on the GLUE [30] and XNLI benchmarks.

2.2. PARSEME 1.2 Competition

We present the results obtained by the systems participating in edition 1.2 of the PARSEME shared task [13] on discovering VMWEs that were not present (i.e., were not seen) in the training corpus. We will not focus on the previous editions of this shared task for two reasons, (i) the corpora were different, on the one hand, concerning the distribution of seen and unseen VMWEs in the train/dev/test sets, and, on the other hand, smaller for some languages; and (ii) the focus in the last edition, unlike the first two, was on the systems’ ability to identify VMWEs unseen in the train and dev corpora, exploring alternative ways of discovering them. Thus, in a supervised machine learning approach, the systems were supposed to learn some characteristics of seen VMWEs and, based on those, find others in the test dataset.

The competing systems used recurrent neural networks [25,31,32,33], but also exploited the syntactic annotation of the corpus [34,35], or association measures [34,35]. The shared task was organized on two tracks, closed and open. The former allowed only for the use of the train and dev sets provided by the organizers, as well as of the raw corpora provided for each language, with sizes between 12 and 2474 million tokens. The latter track allowed for the use of any existing resource for training the system, and examples of such resources are as follows, VMWEs lexicons in the target language or another language (exploited due to their translation in the target language) or language models (monolingual or multilingual BERT [25,33], XLM-RoBERTa [32]). Only two systems participated in the closed track, while seven participated in the open one.

The best-performing system in the open track is MTLB-STRUCT [25]. It is a neural language model relying on pre-trained multilingual BERT and learning both MWEs and syntactic dependency parsing, using a tree CRF network [36]. The authors explain that the joint training of the tree CRF and a Transformer-based MWE detection system improves the results for many languages.

The second and third place in the same track is occupied by the model called TRAVIS [33] that came in two variants, TRAVISmulti (ranked second), which employs multilingual contextual embeddings, and TRAVISmono (ranked third), which employs monolingual ones. These systems rely solely on embeddings, and no other feature is used. The author claims that the monolingual contextual embeddings are much better at generalizations than the multilingual ones, especially concerning unseen MWEs.

3. Methodology

In this work, we perform two kinds of experiments, (i) train a model using only the data for a specific language (referred to as monolingual training) and (ii) put multiple corpora from different languages in one place, train the multilingual model on it and then evaluate the trained model on the test set of each language (referred to as multilingual training). For the latter, we also perform additional experiments to improve the results by employing lateral inhibition and adversarial training mechanisms, as depicted in Figure 1.

3.1. Data Representation

BERT has significantly impacted the field of NLP and has achieved SOTA performance on various tasks. Its success can be attributed to the training process, which involves learning from large amounts of textual data using a Transformer model and then fine-tuning it on a smaller amount of task-specific data. The masked language modeling objective used during pre-training allows the model to learn effective sentence representations, which can be fine-tuned for improved performance on downstream tasks with minimal task-specific training data. The success of BERT has led to the creation of language-specific versions of the model for various languages, such as CamemBERT (French) [37], AfriBERT (Afrikaans) [38], FinBERT (Finnish) [39], and RoBERT (Romanian) [40].

The scarceness of data and resources has resulted in recent advances in NLP being limited to English and a few high-resource languages rather than being more widely applicable across languages. To address this issue, MLLMs have been developed and trained using large amounts of unlabeled textual data collected from multiple languages. These models are designed to benefit lower resource languages by leveraging their shared vocabulary, genetic relatedness, or contact relatedness with higher resource languages [41,42]. Many different MLLMs are available, which vary in terms of their architecture, training objective, data used for pre-training, and the number of languages covered. However, in our experiments, we employ only the mBERT model because it allows us to provide a cleaner comparison with the monolingual BERT models and thus emphasizes the strengths of our approach.

3.2. Lateral Inhibition

The biological process of lateral inhibition represents the capacity of excited neurons to reduce the activity of their neighbors [43]. In the visual cortex, this process is associated with an increased perception under challenging environments, such as low-lighting conditions. Previously, we proposed implementing the lateral inhibition mechanism in artificial neural networks (ANN) to improve the named entity recognition task [22,44]. The intuition behind introducing this mechanism is that it reduces noise associated with word representations in some instances, such as less frequent words or contexts.

The implementation uses an additional ANN layer that filters the values of a neuron from a previous layer (the word embedding representation) based on values from other adjacent neurons in the previous layer. Equation (1) describes the new layer’s forward pass. Here, X is the layer’s input vector (a token embedding representation),

D i a g

is a matrix with the diagonal set to the vector given as a parameter,

Z e r o D i a g

produces a matrix with the value zero on the main diagonal, and W and B represent the weights and bias.

Θ

is the Heaviside function, described in Equation (2). The derivative of the Heaviside function in the backward pass is approximated with the sigmoid function using a scaling parameter k [45] (see Equation (3)), a method known as surrogate gradient learning [46].

F (X) = X * D i a g (Θ (X * Z e r o D i a g (W^{T}) + B))

(1)

Θ (x) = \{\begin{matrix} 1, x > 0 \\ 0, x \leq 0 \end{matrix}

(2)

σ (x) = \frac{1}{1 + e^{- k x}}

(3)

3.3. Adversarial Training

In recent years, adversarial training of neural networks had a significant influence, particularly in computer vision, where generative unsupervised models have demonstrated the ability to generate new images [47]. A crucial challenge in adversarial training is finding the proper balance between the generator and the adversarial discriminator. As a result, several methods have been proposed in recent times to stabilize the training process [48,49,50]. Therefore, Joty et al. [51] introduced cross-lingual adversarial neural networks designed to learn discriminative yet language-invariant representations. In this work, we use the same methodology to learn task-specific representations in a cross-lingual setting and improve the predictive capabilities of a multilingual BERT model.

Our approach is rooted in the Domain Adversarial Neural Network (DANN) algorithm, initially designed for domain adaptation [52]. DANN consists of a deep feature extractor F, responsible for extracting relevant features f from the input data, and a deep label classifier C, which uses those features to make predictions about the label of the input x. Together, these two components form a standard feed-forward architecture. In order to improve the performance of the model on a target domain where labeled data are scarce, an additional component is added to the architecture, called a domain classifier D, which is responsible for distinguishing between samples from the source and target domains d. This domain classifier is connected to the feature extractor via a gradient reversal layer, which multiplies the gradient by a negative constant during training. The gradient reversal layer helps ensure that the feature distributions over the two domains are as similar as possible, resulting in domain-invariant features that can better generalize to the target domain. The overall training process minimizes the label prediction loss on the source examples and the domain classification loss on all samples. Thus, we have the following equations that are used to update the parameters of each of the three components:

\begin{matrix} θ_{C} = θ_{C} - α \frac{\partial L_{y}}{\partial θ_{C}} \\ θ_{D} = θ_{D} - α \frac{\partial L_{d}}{\partial θ_{D}} \\ θ_{F} = θ_{F} - α (\frac{\partial L_{y}}{\partial θ_{F}} - λ \frac{\partial L_{d}}{\partial θ_{F}}) \end{matrix}

(4)

where

θ_{C}

are the parameters of the label classifier,

L_{y}

is the loss obtained by the label classifier when predicting the class labels y,

θ_{D}

are the parameters of the domain classifier,

L_{d}

is the loss obtained by the domain classifier when predicting the domain labels d,

θ_{F}

are the parameters of the feature extractor,

λ

is the hyperparameter used to scale the reverse gradients, and

α

is the learning rate.

3.4. Monolingual Training

In the monolingual training experiments, we treat the MWE task as sequence tagging, so we try to predict a label for each input token. To attain that, we employ a feed-forward layer that maps the embeddings produced by a BERT model into the specific MWE class logits and then apply the softmax activation function to obtain the probabilities. This mechanism is succinctly described in the following equation:

p_{i} = s o f t m a x (e_{i} W^{T} + b)

(5)

where

p_{i}

are the class MWE probabilities for the token i,

e_{i}

are the embeddings produced by the language model,

W^{T}

is the transpose of the feed-forward layer, and b is its bias. We use the same BERT models for each language as in [25]).

3.5. Multilingual Training

We fine-tune the mBERT model for multilingual training using the same methodology as in the monolingual case. However, we improve the predictions by first employing the lateral inhibition layer on top of the embeddings. The lateral inhibition layer has been shown to improve the performance of language models in named entity recognition tasks [22,44,53], and we believe that it would do the same for MWE identification since the methodology is similar for the two tasks. Therefore, the equation that describes the resulting system becomes:

p_{i} = s o f t m a x (L I (e_{i}) W^{T} + b)

(6)

where

L I

is the lateral inhibition layer and the rest of the terms are the same as in Equation (5).

We also adapt the multilingual training by employing the DANN algorithm with a language discriminator instead of the domain discriminator. Thus, we create language-independent features out of the mBERT model by reversing the gradient that comes out of the language discriminator when backpropagating through the language model. The gradient reversal mechanism in our system is described using the following equations

\begin{matrix} θ_{C} = θ_{C} - α \frac{\partial L_{y}}{\partial θ_{C}} \\ θ_{L D} = θ_{L D} - α \frac{\partial L_{l d}}{\partial θ_{L D}} \\ θ_{F} = θ_{F} - α (\frac{\partial L_{y}}{\partial θ_{F}} - λ \frac{\partial L_{l d}}{\partial θ_{F}}) \end{matrix}

(7)

where

θ_{C}

are the parameters of the MWE classifier,

L_{y}

is the loss obtained by the MWE classifier when predicting the MWE labels y,

θ_{L D}

are the parameters of the language discriminator,

L_{l d}

is the loss obtained by the language discriminator when predicting the language labels

l d

,

θ_{F}

are the parameters of the mBERT model (i.e., the feature extractor in DANN),

λ

is the hyperparameter used to scale the reversed gradients, and

α

is the learning rate.

Finally, we employ the lateral inhibition layer and the DANN methodology with a language discriminator on the mBERT model for multilingual training. The forward procedure of this approach, which is used to compute the loss between the predicted MWE probabilities for a given text and the corresponding ground truths, and the loss between the predicted language probabilities and the corresponding ground truths of the given text, is described in Algorithm 1 as follows:

Tokenize the $t e x t$ using the mBERT tokenizer, obtaining the tokens $t o k_{i}$ (Line 1).
Generate the multilingual embeddings $e m b_{i}$ for each of the above tokens $t o k_{i}$ using the mBERT model (Line 2).
Apply the lateral inhibition layer on each of the embeddings $e m b_{i}$ (Line 3).
Use the MWE classifier composed of lateral inhibition layer output to produce the probabilities ${\hat{y}}_{i}$ of a token to belong to a certain MWE class (Line 4).
Use the language discriminator on the embedding $e m b_{[C L S]}$ corresponding to the token [CLS] to produce the probabilities ${\hat{l d}}_{i}$ of the text to belong to a certain language (Line 5).
Compute the loss $L_{y}$ between the predicted MWE probabilities and the ground truth MWE labels (Line 6) and the loss $L_{l d}$ between the predicted language probabilities and the ground truth language labels (Line 7).

Algorithm 1: Algorithm describing the forward pass of the multilingual training with lateral inhibition and language adversarial training.

Input: text, ground truth MWE labels $y_{i}$ , and ground truth language labels $l d_{i}$
Output: MWE identification loss $L_{y}$ and language discrimination loss $L_{l d}$
$t o k_{i} \leftarrow$ tokenize( $t e x t$ )
$e m b_{i} \leftarrow$ mbert( $t o k_{i}$ )
$h_{i} \leftarrow$ lateral_inhibition( $e m b_{i}$ )
${\hat{y}}_{i} \leftarrow$ mwe_classifier( $h_{i}$ )
${\hat{l d}}_{i} \leftarrow$ language_discriminator( $e m b_{[C L S]}$ )
$L_{y} \leftarrow$ cross_entropy_loss( $y_{i}, {\hat{y}}_{i}$ )
$L_{l d} \leftarrow$ cross_entropy_loss( $l d_{i}, {\hat{l d}}_{i}$ )

In Algorithm 2, we outline the backward procedure used to update the parameters of our models as follows:

Compute the gradients $\nabla_{C}$ for the MWE classifier using the MWE loss $L_{y}$ (Line 1).
Compute the gradients $\nabla_{L D}$ for the language discriminator using the language discriminator loss $L_{l d}$ (Line 2).
Compute the gradients $\nabla_{F}$ of the mBERT model using $\nabla_{C}$ and $- \nabla_{L D}$ multiplied by $λ$ (Line 3).
Update the model parameters (i.e., $θ_{C}$ , $θ_{L D}$ , and $θ_{F}$ ) using the gradient descent algorithm (Lines 4–6).

Algorithm 2: Algorithm describing the backward pass of the multilingual training with lateral inhibition and language adversarial training.

Input: MWE identification loss $L_{y}$ , language discrimination loss $L_{l d}$ , and reversed gradient scaling factor $λ$
Output: Parameters $θ_{C}$ , $θ_{L D}$ , and $θ_{F}$
$\nabla_{C} \leftarrow$ compute_gradients( $L_{y}$ )
$\nabla_{L D} \leftarrow$ compute_gradients( $L_{l d}$ )
$\nabla_{F} \leftarrow$ compute_gradients( $\nabla_{C} - λ \nabla_{L D}$ )
$θ_{C} \leftarrow$ update_parameters( $\nabla_{C}$ )
$θ_{L D} \leftarrow$ update_parameters( $\nabla_{L D}$ )
$θ_{F} \leftarrow$ update_parameters( $\nabla_{F}$ )

4. Experimental Settings

4.1. Dataset

The corpus used to evaluate our models is the PARSEME dataset version 1.2. The corpus was manually annotated with VMWEs of several types. Some are universal because they exist and were annotated in all languages in the project. These universal types are verbal idioms (e.g., the Romanian “a face din țânțar armăsar”—eng. “to make a mountain out of a molehill”) and light verb constructions (e.g., the Romanian “a face o vizită”—eng. “to pay a visit”) in which their verb is light in the sense that its semantic contribution to the meaning of the whole expression is almost null, its role being rather only that of carrying the verb specific morphological information, such as tense, number, or person. There are also light verb constructions in which the verb carries a causative meaning (e.g., the Romanian “a da bătăi de cap”—eng. “to give a hard time”), and they are also annotated in all languages. The types of VMWEs that apply only to some of the languages in the project are called quasi-universal: inherently reflexive verbs (e.g., the Romanian “a-și imagina”—eng. “to imagine (oneself)”), verb-particle constructions (e.g., “to give up”), multi-verb constructions (e.g., “make do”), and inherently adpositional verbs (e.g., “to rely on”). For Italian, a language-specific type was defined, namely inherently clitic verbs (e.g., “prendersela”—eng. “to be angry”).

The dataset used in the PARSEME shared task edition 1.2 contains 14 languages, including German (DE), Basque (EU), Greek (EL), French (FR), Irish (GA), Hebrew (HE), Hindi (HI), Italian (IT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish (TR), and Chinese (ZH). The number of tokens ranges from 35 k tokens (HI) to 1015 k tokens (RO), while the number of annotated VMWEs ranges from 662 (GA) to 9164 (ZH). The dataset split was made to ensure a higher number of unseen VMWEs in the dev (100 unseen VMWEs with respect to the train set) and test (300 unseen VMWEs with respect to the train + dev files) sets. More statistics regarding the PARSEME 1.2 dataset are depicted in Table 1.

In addition to the annotation with VMWEs, the multilingual PARSEME corpus is also tokenized, morphologically, and syntactically annotated, mostly with UDPipe [54]. Thus, the syntactic analysis follows the principles of Universal Dependencies (https://universaldependencies.org/ last accessed on 21 April 2023) [55].

4.2. Fine-Tuning

We followed the fine-tuning methodology employed by MTLB-STRUCT (the corresponding configuration files for each language are available at https://github.com/shivaat/MTLB-STRUCT/tree/master/code/configs last accessed on 21 April 2023) with the tree conditional random fields [56] disabled. Thus, we trained our models for 10 epochs using a batch size of 32 and the Adam optimizer [57] with a learning rate of 3 × 10⁻⁵. We set the maximum input sequence length to 150, the scaling parameter k, used in the gradient approximation of the lateral inhibition Heaviside function, to 10, which was empirically shown to create a good enough surrogate gradient [22], and the hyperparameter

λ

to

0.01

in the DANN algorithm for scaling the reversed gradient. We did not employ k-fold cross-validation in our experiments, and we measured the model performance in terms of precision, recall, and F1-score at the token level using the following equations:

Precision = \frac{T P}{T P + F P}

(8)

Recall = \frac{T P}{T P + F N}

(9)

F 1 - score = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

where

T P

is the number of true positives,

F P

is the number of false positives, and

F N

is the number of false negatives. As suggested by the PARSEME 1.2 competition evaluation methodology (https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/ last accessed on 21 April 2023), we compute the strict variant of the F1-score. Thus, we consider the predicted label of a group of tokens as true positive only if it perfectly matches the ground truth [58].

5. Results

The results of our evaluation for both monolingual and multilingual training, with and without lateral inhibition and adversarial training, for all the 14 languages, are displayed in Table 2. We improved the performance of MTLB-STRUCT, the best overall system according to the competition benchmark (https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_40_Shared_Task last accessed on 21 April 2023), on 11 out of 14 languages for global MWE prediction (the three remaining languages are German, Italian, and Romanian) and on 12 out of 14 languages for unseen MWE prediction (the two remaining languages are German and Greek). Out of all the cases where our methods underperformed, the only high difference was obtained in the German language, our best system being behind the MTLB-STRUCT system by approximately 3.43% on global MWE prediction and approximately 6.57% on unseen MWE prediction. We believe that this is due to the employment of the German BERT (https://huggingface.co/bert-base-german-cased last accessed on 21 April 2023) by the MTLB-STRUCT team, while we still used the mBERT model for this language.

For the global MWE prediction, we managed to improve the performance in 11 languages, the highest F1-score was obtained by the monolingual training once (i.e., Chinese), by the simple multilingual training three times (i.e., Greek, Irish, and Turkish), by the multilingual training with lateral inhibition three times (i.e., French, Hebrew, and Polish), by the multilingual adversarial training once (i.e., Basque), and by the multilingual adversarial training with the lateral inhibition three times (i.e., Hindi, Portuguese, and Swedish). On the other hand, for the unseen MWE prediction, we managed to achieve better results in 12 languages. The simple multilingual training obtained the highest F1-score only once (i.e., Swedish), the multilingual training with the lateral inhibition three times (i.e., French, Turkish, and Chinese), the multilingual adversarial training five times (i.e., Irish, Hebrew, Hindi, Polish, and Romanian), and the multilingual adversarial training with lateral inhibition three times (i.e., Basque, Italian, and Portuguese). Also, the monolingual training has not achieved the highest F1-score for unseen MWE prediction for any language. These findings are summarized in Table 3).

We further compared the average scores across all languages obtained by our systems. In Table 4, we compared our results with the ones obtained by each system at the latest edition of the PARSEME competition (https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_50_Shared_task_results last accessed on 21 April 2023): MTLB-STRUCT [25], Travis-multi/mono [33], Seen2Unseen [34], FipsCo [10], HMSid [35], and MultiVitamin [32]. For the global MWE identification, we outperformed the MTLB-STRUCT results with all the multilingual training experiments, the highest average F1-score being obtained by the simple multilingual training without lateral inhibition or adversarial training. It achieved an average F1-score of 71.37%, an improvement of 1.23% compared to the MTLB-STRUCT F1-score (i.e., 70.14%). For unseen MWE identification, we improved the average results obtained by MTLB-STRUCT using all the methodologies employed in this work. The highest average F1-score was obtained by the multilingual adversarial training with 43.26%, outperforming the MTLB-STRUCT system by 4.73%.

6. Discussion

According to our experiments, the average MWE identification performance can be improved by approaching this problem using a multilingual NLP system, as described in this work. An interesting perspective of our results on this task is how much improvement we brought compared to the PARSEME 1.2 competition’s best system. These results are shown at the top of Figure 2 for global MWE prediction and at its bottom for unseen MWE prediction. In general, the most significant relative improvements were achieved in the Irish language by employing multilingual training that, combined with adversarial training, boosted the performance by 45.32% for the global MWE prediction and by 90.78% for the unseen MWE prediction. On the other hand, for the same language, by using the monolingual training, we decrease the system’s performance on global MWE prediction by 8.71% and slightly increase it by 2.86% on unseen MWE prediction. We believe that these improvements in Irish were due to the benefits brought by the multilingual training since this language contained the least amount of training sentences (i.e., 257 sentences), and it has been shown in previous research that superior results are obtained when such fine-tuning mechanisms are employed [59]. However, the Hindi language also contains a small number of training samples (i.e., 282 sentences), but our multilingual training results are worse when compared to Irish. We assume that this is the outcome of the language inequalities that appeared in the mBERT pre-training data [60] and the linguistic isolation of Hindi since there are no other related languages in the fine-tuning data [61].

The second highest improvements for global MWE prediction were achieved in the Swedish language with 2.45% for the monolingual training, 4.26% for the multilingual training, 4.17% for the multilingual training with the lateral inhibition, 4.65% for the multilingual adversarial training, and 5.92% for the multilingual adversarial training with lateral inhibition. We observe a relatively high difference between the first and the second place, but we believe again that this is due to the small number of sentences for Irish compared to Swedish. On the other hand, the results for unseen MWE prediction outline that the second highest improvements were attained in Romanian with 43.62% for the monolingual training, 44.00% for the multilingual training, 32.56% for the multilingual training with lateral inhibition, 49.47% for the multilingual adversarial training, and 40.32% for the multilingual adversarial training with lateral inhibition. In addition, the improvements are more uniform on the unseen MWE prediction than the global one.

7. Conclusions and Future Work

Failure to identify MWEs can lead to misinterpretation of text and errors in NLP tasks, making this an important area of research. In this paper, we analyzed the performance of MWE identification in a multilingual setting, training the mBERT model on the combined PARSEME 1.2 corpus using all the 14 languages found in its composition. In addition, to boost the performance of our system, we employed lateral inhibition and language adversarial training in our methodology, intending to create embeddings that are as language-independent as possible. Our evaluation results highlighted that through this approach, we managed to improve the results obtained by MTLB-STRUCT, the best system of the PARSEME 1.2 competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 for unseen MWE identification. Thus, with the highest average F1-scores of 71.37% for global MWE identification and 43.26% for unseen MWE identification, we class ourselves over MTLB-STRUCT by 1.23% for the former task and by 4.73% for the latter.

Possible future work directions involve analyzing how the language-independent features produced by mBERT are when lateral inhibition and adversarial training are involved, together with an analysis of more models that produce multilingual embeddings, such as XLM or XLM-R. In addition, we intend to analyze these two methodologies, with possible extensions, for multilingual training beyond MWE identification, targeting tasks, such as language generation or named entity recognition. Finally, since the languages in the PARSEME 1.2 dataset may share similar linguistic properties, we would like to explore how language groups improve each other’s performance in the multilingual scenario.

Author Contributions

Conceptualization, A.-M.A., V.B.M., V.P. and D.-C.C.; methodology, A.-M.A. and V.P.; software, A.-M.A.; validation, A.-M.A., V.B.M., D.-C.C. and Ș.T.-M.; formal analysis, A.-M.A.; investigation, A.-M.A., V.B.M. and D.-C.C.; resources, A.-M.A. and V.B.M.; data curation, A.-M.A.; writing—original draft preparation, A.-M.A., V.B.M. and V.P.; writing—review and editing, A.-M.A., V.B.M., D.-C.C. and Ș.T.-M.; visualization, A.-M.A.; supervision, D.-C.C. and Ș.T.-M.; project administration, D.-C.C.; funding acquisition, D.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the University Politehnica of Bucharest through the PubArt program.

Data Availability Statement

The PARSEME 1.2 dataset used in this work has been open-sourced by the competition organizers and is available for public usage at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3367 (last accessed on 21 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Shudo, K.; Kurahone, A.; Tanabe, T. A comprehensive dictionary of multiword expressions. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 161–170. [Google Scholar]
Savary, A. Computational inflection of multi-word units: A contrastive study of lexical approaches. Linguist. Issues Lang. Technol. 2008, 1, 1–53. [Google Scholar] [CrossRef]
Avram, A.; Mititelu, V.B.; Cercel, D.C. Romanian Multiword Expression Detection Using Multilingual Adversarial Training and Lateral Inhibition. In Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), Dubrovnik, Croatia, 2–6 May 2023; pp. 7–13. [Google Scholar]
Zaninello, A.; Birch, A. Multiword expression aware neural machine translation. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 3816–3825. [Google Scholar]
Najar, D.; Mesfar, S.; Ghezela, H.B. Multi-Word Expressions Annotations Effect in Document Classification Task. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Paris, France, 13–15 June 2018; pp. 238–246. [Google Scholar]
Goyal, K.D.; Goyal, V. Development of Hybrid Algorithm for Automatic Extraction of Multiword Expressions from Monolingual and Parallel Corpus of English and Punjabi. In Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations, Patna, India, 18–21 December 2020; pp. 4–6. [Google Scholar]
Savary, A.; Candito, M.; Mititelu, V.B.; Bejček, E.; Cap, F.; Čéplö, S.; Cordeiro, S.R.; Eryiğit, G.; Giouli, V.; van Gompel, M.; et al. PARSEME multilingual corpus of verbal multiword expressions. In Multiword Expressions at Length and in Depth: Extended Papers from the MWE 2017 Workshop; Markantonatou, S., Ramisch, C., Savary, A., Vincze, V., Eds.; Language Science Press: Berlin, Germany, 2018; pp. 87–147. [Google Scholar] [CrossRef]
Savary, A.; Ramisch, C.; Cordeiro, S.R.; Sangati, F.; Vincze, V.; QasemiZadeh, B.; Candito, M.; Cap, F.; Giouli, V.; Stoyanova, I.; et al. Annotated Corpora and Tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions, 1.0 ed.; LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL); Faculty of Mathematics and Physics, Charles University: Staré Město, Czech Republic, 2017. [Google Scholar]
Ramisch, C.; Cordeiro, S.R.; Savary, A.; Vincze, V.; Barbu Mititelu, V.; Bhatia, A.; Buljan, M.; Candito, M.; Gantar, P.; Giouli, V.; et al. Annotated Corpora and Tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions, 1.1 ed.; LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL); Faculty of Mathematics and Physics, Charles University: Staré Město, Czech Republic, 2018. [Google Scholar]
Ramisch, C.; Guillaume, B.; Savary, A.; Waszczuk, J.; Candito, M.; Vaidya, A.; Barbu Mititelu, V.; Bhatia, A.; Iñurrieta, U.; Giouli, V.; et al. Annotated Corpora and Tools of the PARSEME Shared Task on Semi-Supervised Identification of Verbal Multiword Expressions, 1.2 ed.; LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL); Faculty of Mathematics and Physics, Charles University: Staré Město, Czech Republic, 2020. [Google Scholar]
Savary, A.; Ramisch, C.; Cordeiro, S.; Sangati, F.; Vincze, V.; QasemiZadeh, B.; Candito, M.; Cap, F.; Giouli, V.; Stoyanova, I.; et al. The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), Valencia, Spain, 4 April 2017; pp. 31–47. [Google Scholar] [CrossRef]
Ramisch, C.; Cordeiro, S.R.; Savary, A.; Vincze, V.; Barbu Mititelu, V.; Bhatia, A.; Buljan, M.; Candito, M.; Gantar, P.; Giouli, V.; et al. Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), Santa Fe, NM, USA, 25–26 August 2018; pp. 222–240. [Google Scholar]
Ramisch, C.; Savary, A.; Guillaume, B.; Waszczuk, J.; Candito, M.; Vaidya, A.; Barbu Mititelu, V.; Bhatia, A.; Iñurrieta, U.; Giouli, V.; et al. Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 107–118. [Google Scholar]
Ponti, E.M.; O’horan, H.; Berzak, Y.; Vulić, I.; Reichart, R.; Poibeau, T.; Shutova, E.; Korhonen, A. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Comput. Linguist. 2019, 45, 559–601. [Google Scholar] [CrossRef]
Arroyo González, R.; Fernández-Lancho, E.; Maldonado Jurado, J.A. Learning Effect in a Multilingual Web-Based Argumentative Writing Instruction Model, Called ECM, on Metacognition, Rhetorical Moves, and Self-Efficacy for Scientific Purposes. Mathematics 2021, 9, 2119. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Conneau, A.; Lample, G. Cross-lingual language model pretraining. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 7059–7069. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, É.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
Pais, V. RACAI at SemEval-2022 Task 11: Complex named entity recognition using a lateral inhibition mechanism. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, WA, USA, 14–15 July 2022; pp. 1562–1569. [Google Scholar] [CrossRef]
Lowd, D.; Meek, C. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 641–647. [Google Scholar]
Dong, X.; Zhu, Y.; Zhang, Y.; Fu, Z.; Xu, D.; Yang, S.; De Melo, G. Leveraging adversarial training in self-learning for cross-lingual text classification. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1541–1544. [Google Scholar]
Taslimipoor, S.; Bahaadini, S.; Kochmar, E. MTLB-STRUCT@ Parseme 2020: Capturing Unseen Multiword Expressions Using Multi-task Learning and Pre-trained Masked Language Models. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 142–148. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar]
Bojar, O.; Graham, Y.; Kamran, A.; Stanojević, M. Results of the wmt16 metrics shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers; Association for Computational Linguistics: Cedarville, OH, USA, 2016; pp. 199–231. [Google Scholar]
Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S.; Schwenk, H.; Stoyanov, V. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2475–2485. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar]
Yirmibeşoğlu, Z.; Güngör, T. ERMI at PARSEME Shared Task 2020: Embedding-Rich Multiword Expression Identification. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 130–135. [Google Scholar]
Gombert, S.; Bartsch, S. MultiVitaminBooster at PARSEME Shared Task 2020: Combining Window-and Dependency-Based Features with Multilingual Contextualised Word Embeddings for VMWE Detection. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 149–155. [Google Scholar]
Kurfalı, M. TRAVIS at PARSEME Shared Task 2020: How good is (m) BERT at seeing the unseen? In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 136–141. [Google Scholar]
Pasquer, C.; Savary, A.; Ramisch, C.; Antoine, J.Y. Seen2Unseen at PARSEME Shared Task 2020: All Roads do not Lead to Unseen Verb-Noun VMWEs. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 124–129. [Google Scholar]
Colson, J.P. HMSid and HMSid2 at PARSEME Shared Task 2020: Computational Corpus Linguistics and unseen-in-training MWEs. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Online, 13 December 2020; pp. 119–123. [Google Scholar]
Rush, A. Torch-Struct: Deep structured prediction library. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 335–342. [Google Scholar]
Martin, L.; Muller, B.; Suárez, P.J.O.; Dupont, Y.; Romary, L.; De La Clergerie, É.V.; Seddah, D.; Sagot, B. CamemBERT: A Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 7203–7219. [Google Scholar]
Ralethe, S. Adaptation of deep bidirectional transformers for Afrikaans language. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2475–2478. [Google Scholar]
Virtanen, A.; Kanerva, J.; Ilo, R.; Luoma, J.; Luotolahti, J.; Salakoski, T.; Ginter, F.; Pyysalo, S. Multilingual is not enough: BERT for Finnish. arXiv 2019, arXiv:1912.07076. [Google Scholar]
Dumitrescu, S.; Avram, A.M.; Pyysalo, S. The birth of Romanian BERT. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 4324–4328. [Google Scholar]
Doddapaneni, S.; Ramesh, G.; Kunchukuttan, A.; Kumar, P.; Khapra, M.M. A primer on pretrained multilingual language models. arXiv 2021, arXiv:2107.00676. [Google Scholar]
Draskovic, D.; Zecevic, D.; Nikolic, B. Development of a Multilingual Model for Machine Sentiment Analysis in the Serbian Language. Mathematics 2022, 10, 3236. [Google Scholar] [CrossRef]
Cohen, R.A. Lateral inhibition. Encyclopedia of Clinical Neuropsychology; Springer: New York, NY, USA, 2011; pp. 1436–1437. [Google Scholar]
Mitrofan, M.; Pais, V. Improving Romanian BioNER Using a Biologically Inspired System. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland, 26 May 2022; pp. 316–322. [Google Scholar] [CrossRef]
Wunderlich, T.C.; Pehle, C. Event-based backpropagation can compute exact gradients for spiking neural networks. Sci. Rep. 2021, 11, 12829. [Google Scholar] [CrossRef] [PubMed]
Neftci, E.O.; Mostafa, H.; Zenke, F. Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. arXiv 2020, arXiv:2001.06937. [Google Scholar] [CrossRef]
Wiatrak, M.; Albrecht, S.V.; Nystrom, A. Stabilizing generative adversarial networks: A survey. arXiv 2019, arXiv:1910.00927. [Google Scholar]
Nam, S.H.; Kim, Y.H.; Choi, J.; Park, C.; Park, K.R. LCA-GAN: Low-Complexity Attention-Generative Adversarial Network for Age Estimation with Mask-Occluded Facial Images. Mathematics 2023, 11, 1925. [Google Scholar] [CrossRef]
Zhang, X.; Wang, J.; Cheng, N.; Xiao, J. Metasid: Singer identification with domain adaptation for metaverse. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Queensland, Australia, 18–23 June 2022; pp. 1–7. [Google Scholar]
Joty, S.; Nakov, P.; Màrquez, L.; Jaradat, I. Cross-language Learning with Adversarial Neural Networks. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 226–237. [Google Scholar] [CrossRef]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Avram, A.M.; Păiș, V.; Mitrofan, M. Racai@ smm4h’22: Tweets disease mention detection using a neural lateral inhibitory mechanism. In Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1–3. [Google Scholar]
Straka, M.; Straková, J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, BC, Canada, 3–4 August 2017; pp. 88–99. [Google Scholar]
de Marneffe, M.C.; Manning, C.D.; Nivre, J.; Zeman, D. Universal Dependencies. Comput. Linguist. 2021, 47, 255–308. [Google Scholar] [CrossRef]
Bradley, J.K.; Guestrin, C. Learning tree conditional random fields. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 127–134. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sang, E.T.K.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
Eisenschlos, J.; Ruder, S.; Czapla, P.; Kadras, M.; Gugger, S.; Howard, J. MultiFiT: Efficient Multi-lingual Language Model Fine-tuning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5702–5707. [Google Scholar]
Wu, S.; Dredze, M. Are All Languages Created Equal in Multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, 9 July 2020; pp. 120–130. [Google Scholar]
Dhamecha, T.; Murthy, R.; Bharadwaj, S.; Sankaranarayanan, K.; Bhattacharyya, P. Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8584–8595. [Google Scholar]

Figure 1. Domain adversarial training algorithm. We have the mBERT feature extractor F with green, whose role is to generate the token embeddings, the MWE label classifier C with blue, and the language classifier

L D

with orange, whose gradient is reversed and scaled by

λ

before it is fed into the feature extractor. Additionally, C has incorporated in its architecture the lateral inhibition mechanism.

Figure 1. Domain adversarial training algorithm. We have the mBERT feature extractor F with green, whose role is to generate the token embeddings, the MWE label classifier C with blue, and the language classifier

L D

with orange, whose gradient is reversed and scaled by

λ

before it is fed into the feature extractor. Additionally, C has incorporated in its architecture the lateral inhibition mechanism.

Figure 2. Improvements brought by our methodologies (i.e., Monolingual, Multilingual, Multilingual+LI, Multilingual+Adv, and Multilingual+LI+Adv) on global (top) and unseen (bottom) MWE prediction compared to the results of MTLB-STRUCT, the best system in the PARSEME shared task edition 1.2.

Table 1. The statistics of PARSEME 1.2: number of sentences (#Sent.), of tokens (#Tok.), and the sentence average length (Len.) on each of the three splits: training, validation, and test.

Lang.	Training			Validation			Test
Lang.	#Sent.	#Tok.	Len.	#Sent.	#Tok.	Len.	#Sent.	#Tok.	Len.
DE	6.5 k	126.8 k	19.3	602	11.7 k	19.5	1.8 k	34.9 k	19.1
EL	17.7 k	479.6 k	27.0	909	23.9 k	26.3	2.8 k	75.4 k	26.7
EU	4.4 k	61.8 k	13.9	1.4 k	20.5 k	14.4	5.3 k	75.4 k	14.2
FR	14.3 k	360.0 k	25.0	1.5 k	39.5 k	25.1	5.0 k	126.4 k	25.2
GA	257	6.2 k	24.2	322	7.0 k	21.8	1.1 k	25.9 k	23.1
HE	14.1 k	286.2 k	20.2	1.2 k	25.3 k	20.2	3.7 k	76.8 k	20.2
HI	282	5.7 k	20.4	289	6.2 k	21.7	1.1 k	23.3 k	21.0
IT	10.6 k	282.0 k	27.4	1.2 k	32.6 k	27.1	3.8 k	106.0 k	27.3
PL	17.7 k	298.4 k	16.8	1.4 k	23.9 k	16.8	4.3 k	73.7 k	16.7
PT	23.9 k	542.4 k	22.6	1.9 k	43.6 k	22.1	6.2 k	142.3 k	22.8
RO	10.9 k	195.7 k	17.9	7.7 k	134.3 k	17.4	38.0 k	685.5 k	18.0
SV	1.6 k	24.9 k	15.5	596	8.8 k	14.9	2.1 k	31.6 k	15.0
TR	17.9 k	267.5 k	14.9	1.0 k	15.9 k	15.0	3.3 k	48.7 k	14.7
ZH	35.3 k	575.5 k	16.2	1.1 k	18.2 k	16.0	3.4 k	55.7 k	16.0
Total	175.7 k	3512.7 k	20.1	29.3 k	522.2 k	19.8 k	81.9 k	1581.6 k	20.0

Table 2. The results obtained by the monolingual and multilingual training, together with the results obtained by the best system of the PARSEME 1.2 competition, MTLB-STRUCT. LI is the lateral inhibition component, while Adv is the domain adaptation technique for cross-lingual MWE identification. We measure the precision (P), recall (R), and F1-score (F1) for each global and unseen MWE identification experiment. The best results in each language are highlighted in bold.

Language	Method	Global MWE-Based			Unseen MWE-Based
Language	Method	P	R	F1	P	R	F1
DE	MTLB-STRUCT [25]	77.11	75.24	76.17	49.17	49.50	49.34
	Monolingual	74.26	72.82	73.53	40.35	41.79	41.06
	Multilingual	77.26	68.47	72.60	37.85	43.22	40.35
	Multilingual + LI	69.07	66.38	67.70	39.15	43.85	41.37
	Multilingual + Adv	69.00	68.33	68.66	39.18	45.11	41.94
	Multilingual + LI + Adv	71.37	68.08	69.69	41.47	43.85	42.77
EL	MTLB-STRUCT [25]	72.54	72.69	72.62	38.74	47.00	42.47
	Monolingual	72.33	73.00	72.66	38.30	46.75	42.11
	Multilingual	74.60	72.38	73.48	38.92	42.21	40.50
	Multilingual + LI	72.52	72.90	72.71	37.90	45.78	41.47
	Multilingual + Adv	73.23	72.18	72.70	38.81	44.48	41.45
	Multilingual + LI + Adv	73.42	72.59	73.00	38.64	44.16	41.21
EU	MTLB-STRUCT [25]	80.72	79.36	80.03	28.12	44.33	34.41
	Monolingual	81.61	80.40	81.00	34.94	49.29	40.89
	Multilingual	86.49	77.03	81.49	33.32	45.04	39.17
	Multilingual + LI	84.07	78.66	81.28	37.38	44.48	40.62
	Multilingual + Adv	82.77	78.71	80.69	36.46	48.44	41.61
	Multilingual + LI + Adv	84.80	78.42	81.48	39.71	46.46	42.82
FR	MTLB-STRUCT [25]	80.04	78.81	79.42	39.20	46.00	42.33
	Monolingual	79.84	79.54	79.69	38.89	44.87	41.67
	Multilingual	81.80	77.04	79.35	43.17	44.55	43.85
	Multilingual + LI	81.85	78.96	80.37	45.48	48.40	46.89
	Multilingual + Adv	80.12	78.59	79.35	41.60	48.40	44.74
	Multilingual + LI + Adv	80.47	78.22	79.33	40.87	45.19	42.92
GA	MTLB-STRUCT [25]	37.72	25.00	30.07	23.08	16.94	19.54
	Monolingual	33.67	23.17	27.45	24.02	17.28	20.10
	Multilingual	54.91	34.63	42.48	45.91	28.61	35.25
	Multilingual + LI	55.31	34.63	42.60	45.79	27.76	34.57
	Multilingual + Adv	56.12	35.78	43.70	48.42	30.31	37.28
	Multilingual + LI + Adv	55.72	34.63	42.72	45.79	27.76	34.57
HE	MTLB-STRUCT [25]	56.20	42.35	48.30	25.53	15.89	19.59
	Monolingual	54.09	40.76	46.49	26.02	15.94	19.77
	Multilingual	61.38	40.76	48.98	34.76	17.81	23.55
	Multilingual + LI	61.63	42.54	50.23	34.46	19.06	24.55
	Multilingual + Adv	58.40	42.15	48.96	35.35	21.88	27.03
	Multilingual + LI + Adv	59.89	42.74	49.88	34.92	20.62	25.93
HI	MTLB-STRUCT [25]	72.25	75.04	73.62	48.75	58.33	53.11
	Monolingual	66.53	70.28	68.35	49.35	61.35	54.70
	Multilingual	77.78	71.77	74.65	62.72	58.65	60.61
	Multilingual + LI	77.08	68.95	72.78	61.83	56.49	59.04
	Multilingual + Adv	75.46	73.11	74.26	60.95	62.43	61.68
	Multilingual + LI + Adv	75.53	73.85	74.68	60.31	62.43	61.35
IT	MTLB-STRUCT [25]	67.68	60.27	63.76	20.23	21.33	20.81
	Monolingual	64.53	59.59	61.96	20.81	24.06	22.32
	Multilingual	69.37	56.40	62.21	22.22	19.38	20.70
	Multilingual + LI	71.27	56.01	62.72	23.02	20.12	21.28
	Multilingual + Adv	65.65	58.33	61.78	20.83	21.88	21.43
	Multilingual + LI + Adv	69.18	57.85	63.01	25.51	23.44	24.43
PL	MTLB-STRUCT [25]	82.94	79.18	81.02	38.46	41.53	39.94
	Monolingual	81.89	79.33	80.85	38.30	41.99	40.06
	Multilingual	84.02	77.03	80.37	40.34	37.50	38.87
	Multilingual + LI	85.14	79.26	82.09	44.48	41.33	42.84
	Multilingual + Adv	82.55	79.85	81.18	40.75	45.19	42.86
	Multilingual + LI + Adv	83.19	78.74	80.90	41.01	41.67	41.34
PT	MTLB-STRUCT [25]	73.93	72.76	73.34	30.54	41.33	35.13
	Monolingual	74.81	70.94	73.01	33.81	39.05	35.98
	Multilingual	75.93	70.94	73.35	34.06	39.18	36.44
	Multilingual + LI	77.15	71.89	74.43	35.61	39.18	37.31
	Multilingual + Adv	73.36	73.48	73.42	30.33	40.13	34.55
	Multilingual + LI + Adv	75.51	73.53	74.49	33.76	41.78	37.36
RO	MTLB-STRUCT [25]	89.88	91.05	90.46	28.84	41.47	34.02
	Monolingual	90.39	90.11	90.25	46.82	51.09	48.86
	Multilingual	91.34	88.46	89.88	49.90	48.12	48.99
	Multilingual + LI	90.78	88.85	89.81	45.06	45.15	45.10
	Multilingual + Adv	89.14	90.13	89.63	46.27	56.44	50.85
	Multilingual + LI + Adv	89.95	88.78	89.36	45.44	50.30	47.74
SV	MTLB-STRUCT [25]	69.59	73.68	71.58	35.57	53.00	42.57
	Monolingual	73.01	73.68	73.34	44.32	54.62	48.93
	Multilingual	78.92	70.79	74.63	50.78	54.62	52.63
	Multilingual + LI	75.48	73.68	74.57	46.77	52.66	49.54
	Multilingual + Adv	75.42	74.41	74.91	46.70	53.50	49.87
	Multilingual + LI + Adv	77.62	74.10	75.82	49.47	51.82	50.62
TR	MTLB-STRUCT [25]	68.41	70.55	69.46	42.11	45.33	43.66
	Monolingual	69.11	72.89	70.95	43.75	47.88	45.72
	Multilingual	67.52	73.27	71.18	41.83	47.56	44.51
	Multilingual + LI	69.92	72.28	71.08	47.94	49.19	48.55
	Multilingual + Adv	68.41	70.37	69.38	43.54	47.23	45.31
	Multilingual + LI + Adv	68.22	69.77	68.99	43.04	44.30	43.66
ZH	MTLB-STRUCT [25]	68.56	70.74	69.63	58.97	53.67	56.20
	Monolingual	72.33	72.88	72.60	59.74	58.03	58.87
	Multilingual	72.03	71.32	71.67	62.30	55.87	58.91
	Multilingual + LI	69.82	70.36	70.09	62.50	57.31	59.79
	Multilingual + Adv	69.29	69.47	69.38	62.42	54.73	58.32
	Multilingual + LI + Adv	70.64	68.58	69.59	65.41	54.73	59.59

Table 3. The number of times we managed to obtain the highest F1-score with each system developed in this work for both global MWE (#Highest Global MWE) and unseen MWE (#Highest Unseen MWE) predictions.

Method	#Highest	#Highest
Method	Global MWE	Unseen MWE
MTLB-STRUCT [25]	3	2
Monolingual	1	0
Multilingual	3	1
Multilingual + LI	3	3
Multilingual + ADV	1	5
Multilingual + LI + ADV	3	3
Total (ours)	11	12

Table 4. The average precision (AP), recall (AR), and F1-scores (AF1) over all languages obtained by our systems are compared with the results obtained by each system at the PARSEME 1.2 competition on global and unseen MWE identification. We also depict the number of languages used to train each system (#Lang). The best results are highlighted in bold.

Method	#Lang.	Global MWE-Based			Unseen MWE-Based
Method	#Lang.	AP	AR	AF1	AP	AR	AF1
MTLB-STRUCT [25]	14/14	71.26	69.05	70.14	36.24	41.12	38.53
TRAVIS-multi [33]	13/14	60.65	57.62	59.10	28.11	33.29	30.48
TRAVIS-mono [33]	10/14	49.50	43.48	46.34	24.33	28.01	26.04
Seen2Unseen [34]	14/14	63.36	62.69	63.02	16.14	11.95	13.73
FipsCo [10]	3/14	11.69	8.75	10.01	4.31	5.21	4.72
HMSid [35]	1/14	4.56	4.85	4.70	1.98	3.81	2.61
MultiVitaminBooster [32]	7/14	0.19	0.09	0.12	0.05	0.07	0.06
Monolingual	14/14	70.60	68.52	69.54	38.52	42.42	40.38
Multilingual	14/14	75.23	67.88	71.37	42.72	41.60	42.15
Multilingual + LI	14/14	74.36	68.24	71.17	43.48	42.20	42.78
Multilingual + Adv	14/14	72.78	68.92	70.80	42.26	44.30	43.26
Multilingual + LI + Adv	14/14	73.96	68.56	71.16	43.24	42.75	43.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Avram, A.-M.; Mititelu, V.B.; Păiș, V.; Cercel, D.-C.; Trăușan-Matu, Ș. Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation. Mathematics 2023, 11, 2548. https://doi.org/10.3390/math11112548

AMA Style

Avram A-M, Mititelu VB, Păiș V, Cercel D-C, Trăușan-Matu Ș. Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation. Mathematics. 2023; 11(11):2548. https://doi.org/10.3390/math11112548

Chicago/Turabian Style

Avram, Andrei-Marius, Verginica Barbu Mititelu, Vasile Păiș, Dumitru-Clementin Cercel, and Ștefan Trăușan-Matu. 2023. "Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation" Mathematics 11, no. 11: 2548. https://doi.org/10.3390/math11112548

APA Style

Avram, A.-M., Mititelu, V. B., Păiș, V., Cercel, D.-C., & Trăușan-Matu, Ș. (2023). Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation. Mathematics, 11(11), 2548. https://doi.org/10.3390/math11112548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

Abstract

1. Introduction

2. Related Work

2.1. Multilingual Transformers

2.2. PARSEME 1.2 Competition

3. Methodology

3.1. Data Representation

3.2. Lateral Inhibition

3.3. Adversarial Training

3.4. Monolingual Training

3.5. Multilingual Training

4. Experimental Settings

4.1. Dataset

4.2. Fine-Tuning

5. Results

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI