Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (25)

Search Parameters:
Keywords = Arabic corpora

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 216
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

14 pages, 2851 KB  
Article
Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models
by Khalid Almeman
Data 2025, 10(12), 208; https://doi.org/10.3390/data10120208 - 12 Dec 2025
Viewed by 835
Abstract
The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. [...] Read more.
The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. By leveraging the text generation and dialectal transformation capabilities of Large Language Models, an initial set of approximately 100,000 parallel sentences was generated. Following a rigorous multi-stage deduplication process, 50,010 unique parallel sentences were obtained from Modern Standard Arabic (MSA) and five major Arabic dialects—Saudi, Egyptian, Iraqi, Levantine, and Moroccan. This study presents the detailed methodology of corpus generation and refinement, describes the characteristics of the generated corpus, and provides a comprehensive statistical analysis highlighting the corpus size, lexical diversity, and linguistic overlap between MSA and the five dialects. This corpus represents a valuable resource for researchers and developers in Arabic dialect processing and AI applications that require nuanced contextual understanding. Full article
Show Figures

Figure 1

30 pages, 1673 KB  
Article
Adversarially Robust Multitask Learning for Offensive and Hate Speech Detection in Arabic Text Using Transformer-Based Models and RNN Architectures
by Eman S. Alshahrani and Mehmet S. Aksoy
Appl. Sci. 2025, 15(17), 9602; https://doi.org/10.3390/app15179602 - 31 Aug 2025
Cited by 1 | Viewed by 1778
Abstract
Offensive language and hate speech have a detrimental effect on victims and have become a significant problem on social media platforms. Recent research has developed automated techniques for detecting Arabic offensive language and hate speech but remains limited, and further research is required [...] Read more.
Offensive language and hate speech have a detrimental effect on victims and have become a significant problem on social media platforms. Recent research has developed automated techniques for detecting Arabic offensive language and hate speech but remains limited, and further research is required compared to the research on high-resource languages such as English due to limited resources, annotated corpora, and morphological analysis. Most social media users who use profanities attempt to modify their text while maintaining the same meaning, thereby deceiving detection methods that forbid offending phrases. Therefore, this study proposes an adversarially robust multitask learning framework for detection of Arabic offensive and hate speech. For this purpose, this study used the OSACT2020 dataset, augmented with additional posts collected from the X social media platform. To improve contextual understanding, classification models based on various configurations were constructed using four pre-trained Arabic language models integrated with various sequential layers that were trained and evaluated in three different settings: single-task learning with the original dataset, single-task learning with the augmented dataset, and multitask learning with the augmented dataset. The multitask MARBERTv2+BiGRU model achieved the best results, with an 88% macro-F1 for hate speech and 93% for offensive language on clean data. To improve the model’s robustness, adversarial samples were generated using attacks on both the character and sentence levels. These attacks subtly change the text to mislead the model while maintaining the overall appearance and meaning. The clean model’s performance dropped significantly under attack, especially for hate speech, to a 74% macro-F1; however, adversarial training, which re-trains the model using both clean and adversarial data, improved the results to a 78% macro-F1 for hate speech. Further improvements were achieved with input transformation techniques, boosting the macro-F1 to 81%. Notably, the adversarially trained model maintained high performance on clean data, demonstrating both robustness and generalization. Full article
(This article belongs to the Special Issue Machine Learning Approaches in Natural Language Processing)
Show Figures

Figure 1

19 pages, 626 KB  
Article
A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding
by Yajing Ma, Yingxia Yu, Han Liu, Gulila Altenbek, Xiang Zhang and Yilixiati Tuersun
Appl. Sci. 2025, 15(12), 6643; https://doi.org/10.3390/app15126643 - 12 Jun 2025
Viewed by 871
Abstract
Current research on intelligent question answering mainly focuses on high-resource languages such as Chinese and English, with limited studies on question understanding and reasoning in low-resource languages. In addition, during the joint modeling of question understanding tasks, the interdependence among subtasks can lead [...] Read more.
Current research on intelligent question answering mainly focuses on high-resource languages such as Chinese and English, with limited studies on question understanding and reasoning in low-resource languages. In addition, during the joint modeling of question understanding tasks, the interdependence among subtasks can lead to error accumulation during the interaction phase, thereby affecting the prediction performance of the individual subtasks. To address the issue of error propagation caused by sentence-level intent encoding in the joint modeling of intent recognition and slot filling, this paper proposes a Cross-lingual Token-level Bi-Interactive Model (Bi-XTM). The model introduces a novel subtask interaction method that leverages the token-level intent output distribution as additional information for slot vector representation, effectively reducing error propagation and enhancing the information exchange between intent and slot vectors. Meanwhile, to address the scarcity of Kazakh (Arabic alphabet) language corpora, this paper constructs a cross-lingual joint question understanding dataset for the Xinjiang tourism domain, named JISD, which includes 16,548 Chinese samples and 1399 Kazakh samples. This dataset provides a new resource for cross-lingual intent recognition and slot filling joint tasks. Experimental results on the publicly available multi-lingual question understanding dataset MTOD and the newly constructed dataset demonstrate that the proposed Bi-XTM achieves state-of-the-art performance in both monolingual and cross-lingual settings. Full article
Show Figures

Figure 1

22 pages, 7770 KB  
Article
Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation
by Azzah Allahim and Asma Cherif
Appl. Sci. 2024, 14(23), 11104; https://doi.org/10.3390/app142311104 - 28 Nov 2024
Cited by 3 | Viewed by 2813
Abstract
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by [...] Read more.
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings. Full article
Show Figures

Figure 1

18 pages, 871 KB  
Article
Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia
by Nouf Al-Shenaifi, Aqil M. Azmi and Manar Hosny
Mathematics 2024, 12(19), 3120; https://doi.org/10.3390/math12193120 - 5 Oct 2024
Cited by 6 | Viewed by 5223
Abstract
This study harnesses the linguistic diversity of Arabic dialects to create two expansive corpora from X (formerly Twitter). The Gulf Arabic Corpus (GAC-6) includes around 1.7 million tweets from six Gulf countries—Saudi Arabia, UAE, Qatar, Oman, Kuwait, and Bahrain—capturing a wide range of [...] Read more.
This study harnesses the linguistic diversity of Arabic dialects to create two expansive corpora from X (formerly Twitter). The Gulf Arabic Corpus (GAC-6) includes around 1.7 million tweets from six Gulf countries—Saudi Arabia, UAE, Qatar, Oman, Kuwait, and Bahrain—capturing a wide range of linguistic variations. The Saudi Dialect Corpus (SDC-5) comprises 790,000 tweets, offering in-depth insights into five major regional dialects of Saudi Arabia: Hijazi, Najdi, Southern, Northern, and Eastern, reflecting the complex linguistic landscape of the region. Both corpora are thoroughly annotated with dialect-specific seed words and geolocation data, achieving high levels of accuracy, as indicated by Cohen’s Kappa scores of 0.78 for GAC-6 and 0.90 for SDC-5. The annotation process leverages AI-driven techniques, including machine learning algorithms for automated dialect recognition and feature extraction, to enhance the granularity and precision of the data. These resources significantly contribute to the field of Arabic dialectology and facilitate the development of AI algorithms for linguistic data analysis, enhancing AI system design and efficiency. The data provided by this research are crucial for advancing AI methodologies, supporting diverse applications in the realm of next-generation AI technologies. Full article
(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0)
Show Figures

Figure 1

15 pages, 382 KB  
Article
Domain Adaptation for Arabic Machine Translation: Financial Texts as a Case Study
by Emad A. Alghamdi, Jezia Zakraoui and Fares A. Abanmy
Appl. Sci. 2024, 14(16), 7088; https://doi.org/10.3390/app14167088 - 13 Aug 2024
Cited by 2 | Viewed by 3095
Abstract
Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality [...] Read more.
Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed a parallel corpus for Arabic-English (AR-EN) translation in the financial domain to benchmark different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that fine-tuning pre-trained NMT models on a few well-aligned in-domain AR-EN segments led to noticeable improvement. The quality of ChatGPT translation was superior to other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available. Full article
Show Figures

Figure 1

17 pages, 1397 KB  
Article
A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models
by Faisal Qarah and Tawfeeq Alsanoosy
Appl. Sci. 2024, 14(13), 5696; https://doi.org/10.3390/app14135696 - 29 Jun 2024
Cited by 14 | Viewed by 8352
Abstract
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become [...] Read more.
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications. Full article
Show Figures

Figure 1

27 pages, 5204 KB  
Article
AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language Processing
by Asmaa Alrayzah, Fawaz Alsolami and Mostafa Saleh
Appl. Sci. 2024, 14(12), 5294; https://doi.org/10.3390/app14125294 - 19 Jun 2024
Cited by 6 | Viewed by 3462
Abstract
The research presented in the following paper focuses on the effectiveness of a modern standard Arabic corpus, AraFast, in training transformer models for natural language processing tasks, particularly in Arabic. In the study described herein, four experiments were conducted to evaluate the use [...] Read more.
The research presented in the following paper focuses on the effectiveness of a modern standard Arabic corpus, AraFast, in training transformer models for natural language processing tasks, particularly in Arabic. In the study described herein, four experiments were conducted to evaluate the use of AraFast across different configurations: segmented, unsegmented, and mini versions. The main outcomes of the present study are as follows: Transformer models trained with larger and cleaner versions of AraFast, especially in question-answering, indicate the impact of corpus quality and size on model efficacy. Secondly, a dramatic reduction in training loss was observed with the mini version of AraFast, underscoring the importance of optimizing corpus size for effective training. Moreover, the segmented text format led to a decrease in training loss, highlighting segmentation as a beneficial strategy in Arabic NLP. In addition, using the study findings, challenges in managing noisy data derived from web sources are identified, which were found to significantly hinder model performance. These findings collectively demonstrate the critical role of well-prepared, segmented, and clean corpora in advancing Arabic NLP capabilities. The insights from AraFast’s application can guide the development of more efficient NLP models and suggest directions for future research in enhancing Arabic language processing tools. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

15 pages, 909 KB  
Article
A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization
by Canglan Liu, Wushouer Silamu and Yanbing Li
Appl. Sci. 2023, 13(19), 10589; https://doi.org/10.3390/app131910589 - 22 Sep 2023
Cited by 3 | Viewed by 2604
Abstract
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through methods such as flipping, cropping, rotating, [...] Read more.
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through methods such as flipping, cropping, rotating, and adding noise. Traditionally, pseudo-parallel corpora are generated by randomly replacing words in low-resource language machine translation. However, this method can introduce ambiguity, as the same word may have different meanings in different contexts. This study proposes a new approach for low-resource language machine translation, which involves generating pseudo-parallel corpora by replacing phrases. The performance of this approach is compared with other data augmentation methods, and it is observed that combining it with other data augmentation methods further improves performance. To enhance the robustness of the model, R-Drop regularization is also used. R-Drop is an effective method for improving the quality of machine translation. The proposed method was tested on Chinese–Kazakh (Arabic script) translation tasks, resulting in performance improvements of 4.99 and 7.7 for Chinese-to-Kazakh and Kazakh-to-Chinese translations, respectively. By combining the generation of pseudo-parallel corpora through phrase replacement with the application of R-Drop regularization, there is a significant advancement in machine translation performance for low-resource languages. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 2627 KB  
Article
AraMAMS: Arabic Multi-Aspect, Multi-Sentiment Restaurants Reviews Corpus for Aspect-Based Sentiment Analysis
by Alanod AlMasaud and Heyam H. Al-Baity
Sustainability 2023, 15(16), 12268; https://doi.org/10.3390/su151612268 - 11 Aug 2023
Cited by 8 | Viewed by 3733
Abstract
The abundance of data on the internet makes analysis a must. Aspect-based sentiment analysis helps extract valuable information from textual data. Because of limited Arabic resources, this paper enriches the Arabic dataset landscape by creating AraMA, the first and largest Arabic multi-aspect corpus. [...] Read more.
The abundance of data on the internet makes analysis a must. Aspect-based sentiment analysis helps extract valuable information from textual data. Because of limited Arabic resources, this paper enriches the Arabic dataset landscape by creating AraMA, the first and largest Arabic multi-aspect corpus. AraMA comprises 10,750 Google Maps reviews for restaurants in Riyadh, Saudi Arabia. It covers four aspect categories—food, environment, service, and price—along with four sentiment polarities: positive, negative, neutral, and conflict. All AraMA reviews are labeled with at least two aspect categories. A second version, named AraMAMS, includes reviews labeled with at least two different sentiments, making it the first Arabic multi-aspect, multi-sentiment dataset. AraMAMS has 5312 reviews covering the same four aspect categories and sentiment polarities. Both corpora were evaluated using naïve biased (NB), support vector classification (SVC), linear SVC, and stochastic gradient descent (SGD) models. In the AraMA corpus, the aspect categories task achieved a 91.41% F1 measure result using the SVC model, while in the AraMAMS corpus, the best F1 measure result for aspect categories task reached 91.70% using the linear SVC model. Full article
Show Figures

Figure 1

16 pages, 2468 KB  
Article
Contact-Induced Change in an Endangered Language: The Case of Cypriot Arabic
by Spyros Armostis and Marilena Karyolemou
Languages 2023, 8(1), 10; https://doi.org/10.3390/languages8010010 - 26 Dec 2022
Cited by 4 | Viewed by 8001
Abstract
Cypriot Arabic (CyAr) is a severely endangered Semitic variety spoken by Cypriot Maronites. It belongs to the group of “peripheral varieties” of Arabic that were separated from the core Arabic-speaking area and came into contact with non-Semitic languages. Although there has been a [...] Read more.
Cypriot Arabic (CyAr) is a severely endangered Semitic variety spoken by Cypriot Maronites. It belongs to the group of “peripheral varieties” of Arabic that were separated from the core Arabic-speaking area and came into contact with non-Semitic languages. Although there has been a renewed interest since the turn of the century for the study of CyAr, some aspects of its structure are still not well known. In this paper, we present and analyze a number of developments in CyAr induced by contact with Cypriot Greek. Our methodology for investigating such phenomena makes a novel contribution to the description of this underrepresented variety, as it was based not only on existing linguistic descriptions and text corpora in the literature, but mainly on a vast corpus of naturalistic oral speech data from the Archive of Oral Tradition of CyAr. Our analysis revealed the complexity of investigated contact phenomena and the differing degrees of integration of borrowings into the lexico-grammatical system of CyAr. Full article
(This article belongs to the Special Issue Investigating Language Contact and New Varieties)
26 pages, 7820 KB  
Article
Employing Energy and Statistical Features for Automatic Diagnosis of Voice Disorders
by Avinash Shrivas, Shrinivas Deshpande, Girish Gidaye, Jagannath Nirmal, Kadria Ezzine, Mondher Frikha, Kamalakar Desai, Sachin Shinde, Ankit D. Oza, Dumitru Doru Burduhos-Nergis and Diana Petronela Burduhos-Nergis
Diagnostics 2022, 12(11), 2758; https://doi.org/10.3390/diagnostics12112758 - 11 Nov 2022
Cited by 18 | Viewed by 2814
Abstract
The presence of laryngeal disease affects vocal fold(s) dynamics and thus causes changes in pitch, loudness, and other characteristics of the human voice. Many frameworks based on the acoustic analysis of speech signals have been created in recent years; however, they are evaluated [...] Read more.
The presence of laryngeal disease affects vocal fold(s) dynamics and thus causes changes in pitch, loudness, and other characteristics of the human voice. Many frameworks based on the acoustic analysis of speech signals have been created in recent years; however, they are evaluated on just one or two corpora and are not independent to voice illnesses and human bias. In this article, a unified wavelet-based paradigm for evaluating voice diseases is presented. This approach is independent of voice diseases, human bias, or dialect. The vocal folds’ dynamics are impacted by the voice disorder, and this further modifies the sound source. Therefore, inverse filtering is used to capture the modified voice source. Furthermore, the fundamental frequency independent statistical and energy metrics are derived from each spectral sub-band to characterize the retrieved voice source. Speech recordings of the sustained vowel /a/ were collected from four different datasets in German, Spanish, English, and Arabic to run the several intra and inter-dataset experiments. The classifiers’ achieved performance indicators show that energy and statistical features uncover vital information on a variety of clinical voices, and therefore the suggested approach can be used as a complementary means for the automatic medical assessment of voice diseases. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

19 pages, 360 KB  
Article
The Saudi Novel Corpus: Design and Compilation
by Tareq Alfraidi, Mohammad A. R. Abdeen, Ahmed Yatimi, Reyadh Alluhaibi and Abdulmohsen Al-Thubaity
Appl. Sci. 2022, 12(13), 6648; https://doi.org/10.3390/app12136648 - 30 Jun 2022
Cited by 6 | Viewed by 4436
Abstract
Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital [...] Read more.
Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital datasets that would enhance corpus linguistic and stylistic studies of these works. Thus, Arabic lags behind English and other European languages in this context. In this paper, we present the Saudi Novels Corpus, built to be a valuable resource for linguistic and stylistic research communities. We specifically present the procedures we followed and the decisions we made in creating the corpus. We describe and clarify the design criteria, data collection methods, process of annotation, and encoding. In addition, we present preliminary results that emerged from the analysis of the corpus content. We consider the work described in this paper as initial steps to bridge the existing gap between corpus linguistics and Arabic literary texts. Further work is planned to improve the quality of the corpus by adding advanced features. Full article
(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)
Show Figures

Figure 1

16 pages, 1094 KB  
Article
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
by Natalia Levshina
Entropy 2022, 24(2), 280; https://doi.org/10.3390/e24020280 - 16 Feb 2022
Cited by 16 | Viewed by 5692
Abstract
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly [...] Read more.
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches to Explaining Linguistic Structure)
Show Figures

Figure 1

Back to TopTop