1. Introduction
Natural language processing (NLP) is a significant domain of artificial intelligence, with applications ranging from language translation to text classification and information retrieval. NLP allows computers to interpret and process human language, enabling them to perform tasks such as understanding and responding to questions, summarizing texts, and detecting sentiments. Some phenomena present in language can preclude its correct understanding by machines (and even humans sometimes). Such a phenomenon is represented by multiword expressions (MWEs), which are groups of words that function as a unit and convey a specific meaning that is not the sum of the meanings of the component words (i.e., the expression lacks compositionality). Examples of MWEs include idioms (e.g., “break a leg” is used to wish someone good luck), collocations (e.g., “take an exam”), or compounds (e.g., “ice cream”), different authors assuming a more comprehensive or a narrower meaning of this term. The number of MWEs in a language is relatively high. The authors of [
1] synthesized papers reporting the number or proportion of MWEs in different languages: English—with an almost equal number of MWEs and single words; French—with 3.3 times greater number of MWE adverbs than that of single adverbs and 1.7 times greater number of MWE verbs than that of single verbs; and Japanese—in which 44% of the verbs are MWEs. Despite being so numerous in the dictionary, MWEs’ frequency in corpora is low [
2].
Identifying and processing MWEs is crucial for various NLP tasks [
3]. In machine translation, for instance, the correct translation of an MWE often depends on the specific context in which it appears. Suppose an MWE is translated rather than appropriately localized for the target language. In that case, the resulting translation may be difficult to understand for native speakers or may convey a wrong meaning [
4]. In text classification tasks, MWEs are considered essential clues regarding the sentiment or topic of a text [
5]. Additionally, to improve the accuracy of search engines in information retrieval, MWEs can help disambiguate the meaning of a query [
6].
Acknowledged recent progress in the field has been made by the PARSEME community [
7], which evolved from the COST action with the same name, where the topics of interest were parsing and MWEs (
https://typo.uni-konstanz.de/parseme/ last accessed on 21 April 2023). There are two significant outcomes of their activity, (i) a multilingual corpus annotated for verbal MWEs (VMWEs) in 26 languages by more than 160 native annotators, with three versions so far (
https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2282,
https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2842,
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3367 last accessed on 21 April 2023) [
8,
9,
10]; and (ii) a series of shared tasks (also three editions so far) dedicated to the automatic and semi-supervised identification of VMWEs in texts [
11,
12,
13], in which the previously mentioned corpora were used for training and testing the participating systems.
Developing systems that can handle multiple languages is another important NLP area. In particular, the ability to accurately process and analyze text in various languages is becoming increasingly important as the world becomes more globalized and interconnected. For example, multilingual NLP systems can improve machine translation, allowing computers to translate text from one language to another accurately. This can be particularly useful in situations where there is a need to communicate with speakers of different languages, such as in global business or international relations. In addition to its practical applications, multilingual NLP is an important area of study from a theoretical perspective. Research in this field can help shed light on the underlying principles of language processing and how these principles differ across languages [
14,
15].
Multilingual Transformer models have become a popular choice for multilingual NLP tasks due to their ability to handle multiple languages and achieve strong performance on a wide range of tasks. Based on the Transformer architecture [
16], these models are pre-trained on large amounts of multilingual data and can be fine-tuned for specific NLP tasks, such as language translation or text classification. Some models that have become influential in this area include the multilingual bidirectional encoder from transformers (mBERT) [
17], cross-lingual language model (XLM) [
18], XLM-RoBERTa (XLM-R) [
19], and multilingual bidirectional auto-regressive transformers (mBART) [
20]. One of the essential benefits of multilingual Transformer models is their ability to transfer knowledge between languages. These models can learn common representations of different languages, allowing them to perform well on tasks in languages that they have yet to be specifically trained on. Thus, multilingual Transformer models are a good choice for NLP tasks that involve multiple languages, such as machine translation or cross-lingual information retrieval [
21].
In this work, we leverage the knowledge developed in the two research areas (i.e., MWEs and multilingual NLP) to improve the results obtained at the PARSEME 1.2 shared task [
13]. We explore the benefits of combining them in a singular system by jointly fine-tuning the mBERT model on all languages simultaneously and evaluating it separately. In addition, we try to improve the performance of the overall system by employing two mechanisms, (i) the newly introduced lateral inhibition layer [
22] on top of the language model and (ii) adversarial training [
23] between languages. For the last mechanism, other researchers have experimented with this algorithm and have shown that it can provide better results with the right setting [
24]; however, we are the first to experiment with and show the advantages of lateral inhibition in multilingual adversarial training.
Our results demonstrate that by employing lateral inhibition and multilingual adversarial training, we improve the results obtained by MTLB-STRUCT [
25], the best system in edition 1.2 of the PARSEME competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 languages for unseen MWE identification. Furthermore, averaged across all languages, our highest-performing methodology achieves F1-scores of 71.37% and 43.26% for global and unseen MWE identification, respectively. Thus, we obtain an improvement of 1.23% for the former category and a gain of 4.73% for the latter category compared to the MTLB-STRUCT system.
The rest of the paper is structured as follows.
Section 2 summarises the contributions of the PARSEME 1.2 competition and the main multilingual Transformer models. The following section,
Section 3, outlines the methodology employed in this work, including data representation, lateral inhibition, adversarial training, and how they were employed in our system.
Section 4 describes the setup (i.e., dataset and training parameters) used to evaluate our models.
Section 5 presents the results, and
Section 6 details our interpretation of their significance. Finally, our work is concluded in
Section 7 with potential future research directions.
5. Results
The results of our evaluation for both monolingual and multilingual training, with and without lateral inhibition and adversarial training, for all the 14 languages, are displayed in
Table 2. We improved the performance of MTLB-STRUCT, the best overall system according to the competition benchmark (
https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_40_Shared_Task last accessed on 21 April 2023), on 11 out of 14 languages for global MWE prediction (the three remaining languages are German, Italian, and Romanian) and on 12 out of 14 languages for unseen MWE prediction (the two remaining languages are German and Greek). Out of all the cases where our methods underperformed, the only high difference was obtained in the German language, our best system being behind the MTLB-STRUCT system by approximately 3.43% on global MWE prediction and approximately 6.57% on unseen MWE prediction. We believe that this is due to the employment of the German BERT (
https://huggingface.co/bert-base-german-cased last accessed on 21 April 2023) by the MTLB-STRUCT team, while we still used the mBERT model for this language.
For the global MWE prediction, we managed to improve the performance in 11 languages, the highest F1-score was obtained by the monolingual training once (i.e., Chinese), by the simple multilingual training three times (i.e., Greek, Irish, and Turkish), by the multilingual training with lateral inhibition three times (i.e., French, Hebrew, and Polish), by the multilingual adversarial training once (i.e., Basque), and by the multilingual adversarial training with the lateral inhibition three times (i.e., Hindi, Portuguese, and Swedish). On the other hand, for the unseen MWE prediction, we managed to achieve better results in 12 languages. The simple multilingual training obtained the highest F1-score only once (i.e., Swedish), the multilingual training with the lateral inhibition three times (i.e., French, Turkish, and Chinese), the multilingual adversarial training five times (i.e., Irish, Hebrew, Hindi, Polish, and Romanian), and the multilingual adversarial training with lateral inhibition three times (i.e., Basque, Italian, and Portuguese). Also, the monolingual training has not achieved the highest F1-score for unseen MWE prediction for any language. These findings are summarized in
Table 3).
We further compared the average scores across all languages obtained by our systems. In
Table 4, we compared our results with the ones obtained by each system at the latest edition of the PARSEME competition (
https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_50_Shared_task_results last accessed on 21 April 2023): MTLB-STRUCT [
25], Travis-multi/mono [
33], Seen2Unseen [
34], FipsCo [
10], HMSid [
35], and MultiVitamin [
32]. For the global MWE identification, we outperformed the MTLB-STRUCT results with all the multilingual training experiments, the highest average F1-score being obtained by the simple multilingual training without lateral inhibition or adversarial training. It achieved an average F1-score of 71.37%, an improvement of 1.23% compared to the MTLB-STRUCT F1-score (i.e., 70.14%). For unseen MWE identification, we improved the average results obtained by MTLB-STRUCT using all the methodologies employed in this work. The highest average F1-score was obtained by the multilingual adversarial training with 43.26%, outperforming the MTLB-STRUCT system by 4.73%.
6. Discussion
According to our experiments, the average MWE identification performance can be improved by approaching this problem using a multilingual NLP system, as described in this work. An interesting perspective of our results on this task is how much improvement we brought compared to the PARSEME 1.2 competition’s best system. These results are shown at the top of
Figure 2 for global MWE prediction and at its bottom for unseen MWE prediction. In general, the most significant relative improvements were achieved in the Irish language by employing multilingual training that, combined with adversarial training, boosted the performance by 45.32% for the global MWE prediction and by 90.78% for the unseen MWE prediction. On the other hand, for the same language, by using the monolingual training, we decrease the system’s performance on global MWE prediction by 8.71% and slightly increase it by 2.86% on unseen MWE prediction. We believe that these improvements in Irish were due to the benefits brought by the multilingual training since this language contained the least amount of training sentences (i.e., 257 sentences), and it has been shown in previous research that superior results are obtained when such fine-tuning mechanisms are employed [
59]. However, the Hindi language also contains a small number of training samples (i.e., 282 sentences), but our multilingual training results are worse when compared to Irish. We assume that this is the outcome of the language inequalities that appeared in the mBERT pre-training data [
60] and the linguistic isolation of Hindi since there are no other related languages in the fine-tuning data [
61].
The second highest improvements for global MWE prediction were achieved in the Swedish language with 2.45% for the monolingual training, 4.26% for the multilingual training, 4.17% for the multilingual training with the lateral inhibition, 4.65% for the multilingual adversarial training, and 5.92% for the multilingual adversarial training with lateral inhibition. We observe a relatively high difference between the first and the second place, but we believe again that this is due to the small number of sentences for Irish compared to Swedish. On the other hand, the results for unseen MWE prediction outline that the second highest improvements were attained in Romanian with 43.62% for the monolingual training, 44.00% for the multilingual training, 32.56% for the multilingual training with lateral inhibition, 49.47% for the multilingual adversarial training, and 40.32% for the multilingual adversarial training with lateral inhibition. In addition, the improvements are more uniform on the unseen MWE prediction than the global one.
7. Conclusions and Future Work
Failure to identify MWEs can lead to misinterpretation of text and errors in NLP tasks, making this an important area of research. In this paper, we analyzed the performance of MWE identification in a multilingual setting, training the mBERT model on the combined PARSEME 1.2 corpus using all the 14 languages found in its composition. In addition, to boost the performance of our system, we employed lateral inhibition and language adversarial training in our methodology, intending to create embeddings that are as language-independent as possible. Our evaluation results highlighted that through this approach, we managed to improve the results obtained by MTLB-STRUCT, the best system of the PARSEME 1.2 competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 for unseen MWE identification. Thus, with the highest average F1-scores of 71.37% for global MWE identification and 43.26% for unseen MWE identification, we class ourselves over MTLB-STRUCT by 1.23% for the former task and by 4.73% for the latter.
Possible future work directions involve analyzing how the language-independent features produced by mBERT are when lateral inhibition and adversarial training are involved, together with an analysis of more models that produce multilingual embeddings, such as XLM or XLM-R. In addition, we intend to analyze these two methodologies, with possible extensions, for multilingual training beyond MWE identification, targeting tasks, such as language generation or named entity recognition. Finally, since the languages in the PARSEME 1.2 dataset may share similar linguistic properties, we would like to explore how language groups improve each other’s performance in the multilingual scenario.