MDPI - Publisher of Open Access Journals

23 pages, 1604 KiB

Open AccessArticle

Fine-Tuning Large Language Models for Kazakh Text Simplification

by Alymzhan Toleu, Gulmira Tolegen and Irina Ualiyeva

Appl. Sci. 2025, 15(15), 8344; https://doi.org/10.3390/app15158344 - 26 Jul 2025

Viewed by 342

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 [...] Read more.

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

20 pages, 1955 KiB

Open AccessArticle

by Svitlana Biloshchytska, Arailym Tleubayeva, Oleksandr Kuchanskyi, Andrii Biloshchytskyi, Yurii Andrashko, Sapar Toxanov, Aidos Mukhatayev and Saltanat Sharipova

Appl. Sci. 2025, 15(12), 6707; https://doi.org/10.3390/app15126707 - 15 Jun 2025

Viewed by 589

Abstract

This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked [...] Read more.

This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 332 KiB

Open AccessArticle

Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages

by Rigon Sallauka, Umut Arioz, Matej Rojc and Izidor Mlakar

Appl. Sci. 2025, 15(10), 5585; https://doi.org/10.3390/app15105585 - 16 May 2025

Cited by 1 | Viewed by 567

Abstract

Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This [...] Read more.

Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving the entity annotations medical problems, diagnostic tests, and treatments. Data augmentation addressed the class imbalance, and the fine-tuned BERT-based models outperformed baselines consistently. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to the existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly-supervised multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages. Full article

► Show Figures

Figure 1

22 pages, 487 KiB

Open AccessArticle

From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts

by Gergely Márk Csányi, Dorina Lakatos, István Üveges, Andrea Megyeri , János Pál Vadász, Dániel Nagy and Renátó Vági

Big Data Cogn. Comput. 2024, 8(12), 185; https://doi.org/10.3390/bdcc8120185 - 10 Dec 2024

Viewed by 1907

Abstract

This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with [...] Read more.

This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with comparable facts using preliminary legal fact drafts. Evaluating such systems often poses significant challenges, given the need for thorough document checks, which can be costly and limit evaluation reusability. To address this, the study employs manually created fact drafts for legal cases, enabling reliable ranking of original cases within retrieved documents and quantitative comparison of various vectorization methods. The study compares twelve different text embedding solutions (the most recent became available just a few weeks before the manuscript was written) identifying Cohere’s embed-multilingual-v3.0, Beijing Academy of Artificial Intelligence’s bge-m3, Jina AI’s jina-embeddings-v3, OpenAI’s text-embedding-3-large, and Microsoft’s multilingual-e5-large models as top performers. To overcome the transformer-based models’ context window limitation, we investigated chunking, striding, and last chunk scaling techniques, with last chunk scaling significantly improving embedding quality. The results suggest that the effectiveness of striding varies based on token count. Notably, employing striding with 16 tokens yielded optimal results, representing 3.125% of the context window size for the best-performing models. Results also suggested that from the models having 8192 token long context window the bge-m3 model is superior compared to jina-embeddings-v3 and text-embedding-3-large models in capturing the relevant parts of a document if the text contains significant amount of noise. The validity of the approach was evaluated and confirmed by legal experts. These insights led to an operational semantic search system for a prominent legal content provider. Full article

► Show Figures

Figure 1

25 pages, 1115 KiB

Open AccessArticle

Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages

by Koena Ronny Mabokela, Mpho Primus and Turgay Celik

Big Data Cogn. Comput. 2024, 8(11), 160; https://doi.org/10.3390/bdcc8110160 - 15 Nov 2024

Viewed by 2542

Abstract

Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models [...] Read more.

Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models (PLMs) have been developed for various natural language processing (NLP) tasks, their applications in eXplainable Artificial Intelligence (XAI) remain largely unexplored. In this study, we propose a novel approach that combines Afrocentric PLMs with XAI techniques for sentiment analysis. We demonstrate the effectiveness of incorporating attention mechanisms and visualization techniques in improving the transparency, trustworthiness, and decision-making capabilities of transformer-based models when making sentiment predictions. To validate our approach, we employ the SAfriSenti corpus, a multilingual sentiment dataset for South African under-resourced languages, and perform a series of sentiment analysis experiments. These experiments enable comprehensive evaluations, comparing the performance of Afrocentric models against mainstream PLMs. Our results show that the Afro-XLMR model outperforms all other models, achieving an average F1-score of 71.04% across five tested languages, and the lowest error rate among the evaluated models. Additionally, we enhance the interpretability and explainability of the Afro-XLMR model using Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). These XAI techniques ensure that sentiment predictions are not only accurate and interpretable but also understandable, fostering trust and reliability in AI-driven NLP technologies, particularly in the context of African languages. Full article

(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

► Show Figures

Figure 1

19 pages, 914 KiB

Open AccessArticle

A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things

by Yu Weng, Shumin Dong and Chaomurilige

Mathematics 2024, 12(4), 598; https://doi.org/10.3390/math12040598 - 17 Feb 2024

Cited by 2 | Viewed by 1687

Abstract

With the expansion of the Internet of Things (IoT) and artificial intelligence (AI) technologies, multilingual scenarios are gradually increasing, and applications based on multilingual resources are also on the rise. In this process, apart from the need for the construction of multilingual resources, [...] Read more.

With the expansion of the Internet of Things (IoT) and artificial intelligence (AI) technologies, multilingual scenarios are gradually increasing, and applications based on multilingual resources are also on the rise. In this process, apart from the need for the construction of multilingual resources, privacy protection issues like data privacy leakage are increasingly highlighted. Comparable corpus is important in multilingual language information processing in IoT. However, the multilingual comparable corpus concerning privacy preserving is rare, so there is an urgent need to construct a multilingual corpus resource. This paper proposes a method for constructing a privacy-preserving multilingual comparable corpus, taking Chinese–Uighur–Tibetan IoT based news as an example, and mapping the different language texts to a unified language vector space to avoid sensitive information, then calculates the similarity between different language texts and serves as a comparability index to construct comparable relations. Through the decision-making mechanism of minimizing the impossibility, it can identify a comparable corpus pair of multilingual texts based on chapter size to realize the construction of a privacy-preserving Chinese–Uighur–Tibetan comparable corpus (CUTCC). Evaluation experiments demonstrate the effectiveness of our proposed provable method, which outperforms in accuracy rate by 77%, recall rate by 34% and F value by 47.17%. The CUTCC provides valuable privacy-preserving data resources support and language service for multilingual situations in IoT. Full article

(This article belongs to the Special Issue Privacy-Preserving Techniques in AI, Blockchain and Cloud Systems with Formal Mathematical Analysis)

► Show Figures

Figure 1

12 pages, 318 KiB

Open AccessArticle

esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish

by Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Ksenia Kharitonova and Zoraida Callejas

Appl. Sci. 2023, 13(22), 12155; https://doi.org/10.3390/app132212155 - 8 Nov 2023

Cited by 1 | Viewed by 2328

Abstract

In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through [...] Read more.

In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL. Full article

(This article belongs to the Special Issue IberSPEECH 2022: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

44 pages, 1396 KiB

Open AccessArticle

Subordination in Turkish Heritage Children with and without Developmental Language Impairment

by Nebiye Hilal Șan

Languages 2023, 8(4), 239; https://doi.org/10.3390/languages8040239 - 19 Oct 2023

Cited by 3 | Viewed by 3914

Abstract

A large body of cross-linguistic research has shown that complex constructions, such as subordinate constructions, are vulnerable in bilingual DLD children, whereas they are robust in bilingual children with typical language development; therefore, they are argued to constitute a potential clinical marker for [...] Read more.

A large body of cross-linguistic research has shown that complex constructions, such as subordinate constructions, are vulnerable in bilingual DLD children, whereas they are robust in bilingual children with typical language development; therefore, they are argued to constitute a potential clinical marker for identifying DLD in bilingual contexts, especially when the majority language is assessed. However, it is not clear whether this also applies to heritage contexts, particularly in contexts in which the heritage language is affected by L2 contact-induced phenomena, as in the case of Heritage Turkish in Germany. In this study, we compare subordination using data obtained from 13 Turkish heritage children with and without DLD (age range 5; 1–11; 6) to 10 late successive (lL2) BiTDs (age range 7; 2–12; 2) and 10 Turkish adult heritage bilinguals (age range 20; 3–25; 10) by analyzing subordinate constructions using both Standard and Heritage Turkish as reference varieties. We further investigate which background factors predict performance in subordinate constructions. Speech samples were elicited using the sentence repetition task (SRT) from the TODİL standardized test battery and the Multilingual Assessment Instrument for Narratives (MAIN). A systematic analysis of a corpus of subordinate clauses constructed with respect to SRT and MAIN narrative production comprehension tasks shows that heritage children with TD and DLD may not be differentiated through these tasks, especially when their utterances are scored using the Standard Turkish variety as a baseline; however, they may be differentiated if the Heritage Turkish is considered as the baseline. The age of onset in the second language (AoO_L2) was the leading performance predictor in subordinate clause production in SRT and in both tasks of MAIN regardless of using Standard Turkish or Heritage Turkish as reference varieties in scoring. Full article

(This article belongs to the Special Issue Bilingualism and Language Impairment)

► Show Figures

Figure 1

21 pages, 1731 KiB

Open AccessArticle

Analyzing Indo-European Language Similarities Using Document Vectors

by Samuel R. Schrader and Eren Gultepe

Informatics 2023, 10(4), 76; https://doi.org/10.3390/informatics10040076 - 26 Sep 2023

Cited by 2 | Viewed by 3342

Abstract

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of [...] Read more.

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research. Full article

(This article belongs to the Special Issue Digital Humanities and Visualization)

► Show Figures

Figure 1

18 pages, 874 KiB

Open AccessArticle

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

by Andrei-Marius Avram, Verginica Barbu Mititelu, Vasile Păiș, Dumitru-Clementin Cercel and Ștefan Trăușan-Matu

Mathematics 2023, 11(11), 2548; https://doi.org/10.3390/math11112548 - 1 Jun 2023

Cited by 4 | Viewed by 1941

Abstract

Correctly identifying multiword expressions (MWEs) is an important task for most natural language processing systems since their misidentification can result in ambiguity and misunderstanding of the underlying text. In this work, we evaluate the performance of the mBERT model for MWE identification in [...] Read more.

Correctly identifying multiword expressions (MWEs) is an important task for most natural language processing systems since their misidentification can result in ambiguity and misunderstanding of the underlying text. In this work, we evaluate the performance of the mBERT model for MWE identification in a multilingual context by training it on all 14 languages available in version 1.2 of the PARSEME corpus. We also incorporate lateral inhibition and language adversarial training into our methodology to create language-independent embeddings and improve its capabilities in identifying multiword expressions. The evaluation of our models shows that the approach employed in this work achieves better results compared to the best system of the PARSEME 1.2 competition, MTLB-STRUCT, on 11 out of 14 languages for global MWE identification and on 12 out of 14 languages for unseen MWE identification. Additionally, averaged across all languages, our best approach outperforms the MTLB-STRUCT system by 1.23% on global MWE identification and by 4.73% on unseen global MWE identification. Full article

(This article belongs to the Special Issue Current Trends in Natural Language Processing (NLP) and Human Language Technology (HLT))

► Show Figures

Figure 1

17 pages, 635 KiB

Open AccessFeature PaperArticle

Building Neural Machine Translation Systems for Multilingual Participatory Spaces

by Pintu Lohar, Guodong Xie, Daniel Gallagher and Andy Way

Analytics 2023, 2(2), 393-409; https://doi.org/10.3390/analytics2020022 - 1 May 2023

Cited by 2 | Viewed by 3196

Abstract

This work presents the development of the translation component in a multistage, multilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate automated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics [...] Read more.

This work presents the development of the translation component in a multistage, multilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate automated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics were to be addressed in the deliberation process: (i) the environment and climate change; and (ii) the economy and inequality. In this work, we describe the development of neural machine translation (NMT) models for these domains for six European languages: Italian, English (included as the second official language of Ireland), Irish, German, French and Polish. As a result, we generate 30 NMT models, initially baseline systems built using freely available online data, which are then adapted to the domains of interest in the project by (i) filtering the corpora, (ii) tuning the systems with automatically extracted in-domain development datasets and (iii) using corpus concatenation techniques to expand the amount of data available. We compare our results produced by the domain-adapted systems with those produced by Google Translate, and demonstrate that fast, high-quality systems can be produced that facilitate multilingual deliberation in a secure environment. Full article

► Show Figures

Figure 1

17 pages, 2134 KiB

Open AccessArticle

Spoken Language Identification System Using Convolutional Recurrent Neural Network

by Adal A. Alashban, Mustafa A. Qamhan, Ali H. Meftah and Yousef A. Alotaibi

Appl. Sci. 2022, 12(18), 9181; https://doi.org/10.3390/app12189181 - 13 Sep 2022

Cited by 30 | Viewed by 5912

Abstract

Following recent advancements in deep learning and artificial intelligence, spoken language identification applications are playing an increasingly significant role in our day-to-day lives, especially in the domain of multi-lingual speech recognition. In this article, we propose a spoken language identification system that depends [...] Read more.

Following recent advancements in deep learning and artificial intelligence, spoken language identification applications are playing an increasingly significant role in our day-to-day lives, especially in the domain of multi-lingual speech recognition. In this article, we propose a spoken language identification system that depends on the sequence of feature vectors. The proposed system uses a hybrid Convolutional Recurrent Neural Network (CRNN), which combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) network, for spoken language identification on seven languages, including Arabic, chosen from subsets of the Mozilla Common Voice (MCV) corpus. The proposed system exploits the advantages of both CNN and RNN architectures to construct the CRNN architecture. At the feature extraction stage, it compares the Gammatone Cepstral Coefficient (GTCC) feature and Mel Frequency Cepstral Coefficient (MFCC) feature, as well as a combination of both. Finally, the speech signals were represented as frames and used as the input for the CRNN architecture. After conducting experiments, the results of the proposed system indicate higher performance with combined GTCC and MFCC features compared to GTCC or MFCC features used individually. The average accuracy of the proposed system was 92.81% in the best experiment for spoken language identification. Furthermore, the system can learn language-specific patterns in various filter size representations of speech files. Full article

(This article belongs to the Special Issue Recent Trends in Natural Language Processing and Its Applications)

► Show Figures

Graphical abstract

22 pages, 2952 KiB

Open AccessArticle

Fake News Spreaders Detection: Sometimes Attention Is Not All You Need

by Marco Siino, Elisa Di Nuovo, Ilenia Tinnirello and Marco La Cascia

Information 2022, 13(9), 426; https://doi.org/10.3390/info13090426 - 9 Sep 2022

Cited by 39 | Viewed by 5637

Abstract

Guided by a corpus linguistics approach, in this article we present a comparative evaluation of State-of-the-Art (SotA) models, with a special focus on Transformers, to address the task of Fake News Spreaders (i.e., users that share Fake News) detection. First, we explore the [...] Read more.

Guided by a corpus linguistics approach, in this article we present a comparative evaluation of State-of-the-Art (SotA) models, with a special focus on Transformers, to address the task of Fake News Spreaders (i.e., users that share Fake News) detection. First, we explore the reference multilingual dataset for the considered task, exploiting corpus linguistics techniques, such as chi-square test, keywords and Word Sketch. Second, we perform experiments on several models for Natural Language Processing. Third, we perform a comparative evaluation using the most recent Transformer-based models (RoBERTa, DistilBERT, BERT, XLNet, ELECTRA, Longformer) and other deep and non-deep SotA models (CNN, MultiCNN, Bayes, SVM). The CNN tested outperforms all the models tested and, to the best of our knowledge, any existing approach on the same dataset. Fourth, to better understand this result, we conduct a post-hoc analysis as an attempt to investigate the behaviour of the presented best performing black-box model. This study highlights the importance of choosing a suitable classifier given the specific task. To make an educated decision, we propose the use of corpus linguistics techniques. Our results suggest that large pre-trained deep models like Transformers are not necessarily the first choice when addressing a text classification task as the one presented in this article. All the code developed to run our tests is publicly available on GitHub. Full article

► Show Figures

Figure 1

11 pages, 1578 KiB

Open AccessArticle

K-EPIC: Entity-Perceived Context Representation in Korean Relation Extraction

by Yuna Hur, Suhyune Son, Midan Shim, Jungwoo Lim and Heuiseok Lim

Appl. Sci. 2021, 11(23), 11472; https://doi.org/10.3390/app112311472 - 3 Dec 2021

Cited by 9 | Viewed by 3118

Abstract

Relation Extraction (RE) aims to predict the correct relation between two entities from the given sentence. To obtain the proper relation in Relation Extraction (RE), it is significant to comprehend the precise meaning of the two entities as well as the context of [...] Read more.

Relation Extraction (RE) aims to predict the correct relation between two entities from the given sentence. To obtain the proper relation in Relation Extraction (RE), it is significant to comprehend the precise meaning of the two entities as well as the context of the sentence. In contrast to the RE research in English, Korean-based RE studies focusing on the entities and preserving Korean linguistic properties rarely exist. Therefore, we propose K-EPIC (Entity-Perceived Context representation in Korean) to ensure enhanced capability for understanding the meaning of entities along with considering linguistic characteristics in Korean. We present the experimental results on the BERT-Ko-RE and KLUE-RE datasets with four different types of K-EPIC methods, utilizing entity position tokens. To compare the ability of understanding entities and context of Korean pre-trained language models, we analyze HanBERT, KLUE-BERT, KoBERT, KorBERT, KoELECTRA, and multilingual-BERT (mBERT). The experimental results demonstrate that the F1 score increases significantly with our K-EPIC and that the performance of the language models trained with the Korean corpus outperforms the baseline. Full article

► Show Figures

Figure 1

16 pages, 443 KiB

Open AccessArticle

NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

by Vicent Ahuir, Lluís-F. Hurtado, José Ángel González and Encarna Segarra

Appl. Sci. 2021, 11(21), 9872; https://doi.org/10.3390/app11219872 - 22 Oct 2021

Cited by 14 | Viewed by 2496

Abstract

Most of the models proposed in the literature for abstractive summarization are generally suitable for the English language but not for other languages. Multilingual models were introduced to address that language constraint, but despite their applicability being broader than that of the monolingual [...] Read more.

Most of the models proposed in the literature for abstractive summarization are generally suitable for the English language but not for other languages. Multilingual models were introduced to address that language constraint, but despite their applicability being broader than that of the monolingual models, their performance is typically lower, especially for minority languages like Catalan. In this paper, we present a monolingual model for abstractive summarization of textual content in the Catalan language. The model is a Transformer encoder-decoder which is pretrained and fine-tuned specifically for the Catalan language using a corpus of newspaper articles. In the pretraining phase, we introduced several self-supervised tasks to specialize the model on the summarization task and to increase the abstractivity of the generated summaries. To study the performance of our proposal in languages with higher resources than Catalan, we replicate the model and the experimentation for the Spanish language. The usual evaluation metrics, not only the most used ROUGE measure but also other more semantic ones such as BertScore, do not allow to correctly evaluate the abstractivity of the generated summaries. In this work, we also present a new metric, called content reordering, to evaluate one of the most common characteristics of abstractive summaries, the rearrangement of the original content. We carried out an exhaustive experimentation to compare the performance of the monolingual models proposed in this work with two of the most widely used multilingual models in text summarization, mBART and mT5. The experimentation results support the quality of our monolingual models, especially considering that the multilingual models were pretrained with many more resources than those used in our models. Likewise, it is shown that the pretraining tasks helped to increase the degree of abstractivity of the generated summaries. To our knowledge, this is the first work that explores a monolingual approach for abstractive summarization both in Catalan and Spanish. Full article

(This article belongs to the Special Issue Current Approaches and Applications in Natural Language Processing)

► Show Figures

Figure 1

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI