Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (30)

Search Parameters:
Keywords = rule-based machine translation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 887 KB  
Article
Transformation of Real-World Contracts to Smart Contracts for Blockchain Applications
by Cecilia E. Chen, Xuanyu Liu, Limin Jia, Bo Liang, Yan Zhu and Tong Wu
Electronics 2026, 15(7), 1514; https://doi.org/10.3390/electronics15071514 - 3 Apr 2026
Viewed by 459
Abstract
The widespread adoption of smart contracts, self-executing agreements on the blockchain, is hindered by the complexity of translating real-world contracts, often written in multiple languages, into their digital counterparts. This paper addresses this challenge by introducing an innovative approach based on Contract Text [...] Read more.
The widespread adoption of smart contracts, self-executing agreements on the blockchain, is hindered by the complexity of translating real-world contracts, often written in multiple languages, into their digital counterparts. This paper addresses this challenge by introducing an innovative approach based on Contract Text Markup Language (CTML), an extensible markup language specifically designed to facilitate the automatic generation of smart contracts from multilingual contracts. CTML overcomes traditional method limitations by employing a two-stage transformation process: (1) Contract Abstraction and Markup: CTML redefines grammar rules and incorporates encoding extensions to transform multilingual contracts into structured, marked-up contracts. This process effectively abstracts the essential details of the original contract, enabling language-agnostic interpretation. (2) Domain-Specific Language (DSL) Translation and Smart Contract Code Generation: The marked-up contract is then seamlessly translated into a DSL program, capturing the legal concepts in a machine-readable format. Finally, the DSL program is automatically compiled into executable smart contract code, ready for deployment on the blockchain. The effectiveness of the proposed approach is demonstrated using a legal contract in both English and Chinese. Therefore, the CTML-based approach can automatically generate smart contracts from multilingual contracts, enabling a more inclusive and accessible smart contract ecosystem. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

40 pages, 13676 KB  
Review
Interfacial Interactions of Nanoparticles and Molecular Nanostructures with Model Membrane Systems: Mechanisms, Methods, and Applications
by Konstantin Balashev
Membranes 2026, 16(4), 134; https://doi.org/10.3390/membranes16040134 - 1 Apr 2026
Viewed by 1146
Abstract
This review surveys how nanoparticles and biomolecular nanosized structures interact with model membrane systems, and how these interfacial processes govern their performance in drug and gene delivery, antimicrobial strategies, biosensing, and nanotoxicology. The nanostructures covered include polymeric nanoparticles, lipid-based carriers, peptide nanostructures, dendrimers, [...] Read more.
This review surveys how nanoparticles and biomolecular nanosized structures interact with model membrane systems, and how these interfacial processes govern their performance in drug and gene delivery, antimicrobial strategies, biosensing, and nanotoxicology. The nanostructures covered include polymeric nanoparticles, lipid-based carriers, peptide nanostructures, dendrimers, and multifunctional hybrids. Model membranes span Langmuir monolayers, supported lipid bilayers, vesicles/liposomes across sizes, and emerging hybrid or asymmetric constructs that better approximate native complexity. Mechanistically, interactions follow recurrent routes—surface adsorption, bilayer insertion, pore formation, and lipid extraction/reorganization—regulated by particle size, morphology, charge, ligand architecture, and lipophilicity, in conjunction with membrane composition, phase state, curvature, and asymmetry. A multiscale toolkit links structure, mechanics, and dynamics: Langmuir troughs and Brewster Angle Microscopy map thermodynamics and mesoscale morphology; atomic force microscopy and quartz crystal microbalance with dissipation resolve nanoscale topography and viscoelasticity; fluorescence microscopy/spectroscopy reports on localization and packing; neutron and X-ray reflectometry quantify vertical structure; molecular dynamics provides atomistic pathways and design hypotheses. Historically, the field advanced from early monolayers and bilayers, through the fluid mosaic model, to raft microdomains and modern biomimetic systems, enabling increasingly realistic experiments. Key advances include cross-method integration linking experimental observations with image-based computational models; persistent debates concern the translation from simplified models to living membranes, the role of dynamic coronas, and scale/force-field limits in simulations. Future efforts should prioritize hybrid models incorporating proteins and asymmetric lipidomes, standardized reporting and reference systems, rigorous coupling of experiments with calibrated simulations and machine learning, and alignment with safety-by-design and regulatory expectations, thereby shifting interfacial measurements from descriptive observation to predictive design rules. Full article
Show Figures

Graphical abstract

25 pages, 1436 KB  
Article
Entropy-Augmented Forecasting and Portfolio Construction at the Industry-Group Level: A Causal Machine-Learning Approach Using Gradient-Boosted Decision Trees
by Gil Cohen, Avishay Aiche and Ron Eichel
Entropy 2026, 28(1), 108; https://doi.org/10.3390/e28010108 - 16 Jan 2026
Viewed by 722
Abstract
This paper examines whether information-theoretic complexity measures enhance industry-group return forecasting and portfolio construction within a machine-learning framework. Using daily data for 25 U.S. GICS industry groups spanning more than three decades, we augment gradient-boosted decision tree models with Shannon entropy and fuzzy [...] Read more.
This paper examines whether information-theoretic complexity measures enhance industry-group return forecasting and portfolio construction within a machine-learning framework. Using daily data for 25 U.S. GICS industry groups spanning more than three decades, we augment gradient-boosted decision tree models with Shannon entropy and fuzzy entropy computed from recent return dynamics. Models are estimated at weekly, monthly, and quarterly horizons using a strictly causal rolling-window design and translated into two economically interpretable allocation rules, a maximum-profit strategy and a minimum-risk strategy. Results show that the top performing strategy, the weekly maximum-profit model augmented with Shannon entropy, achieves an accumulated return exceeding 30,000%, substantially outperforming both the baseline model and the fuzzy-entropy variant. On monthly and quarterly horizons, entropy and fuzzy entropy generate smaller but robust improvements by maintaining lower volatility and better downside protection. Industry allocations display stable and economically interpretable patterns, profit-oriented strategies concentrate primarily in cyclical and growth-sensitive industries such as semiconductors, automobiles, technology hardware, banks, and energy, while minimum-risk strategies consistently favor defensive industries including utilities, food, beverage and tobacco, real estate, and consumer staples. Overall, the results demonstrate that entropy-based complexity measures improve both economic performance and interpretability, yielding industry-rotation strategies that are simultaneously more profitable, more stable, and more transparent. Full article
(This article belongs to the Special Issue Entropy, Artificial Intelligence and the Financial Markets)
Show Figures

Figure 1

23 pages, 7685 KB  
Article
Literal Pattern Analysis of Texts Written with the Multiple Form of Characters: A Comparative Study of the Human and Machine Styles
by Kazuya Hayata
Entropy 2026, 28(1), 36; https://doi.org/10.3390/e28010036 - 27 Dec 2025
Viewed by 423
Abstract
Aside from languages having no form of written expression, it is usually the case with every language on this planet that texts are written in a single character. But every rule has its exceptions. A very rare exception is Japanese, the texts of [...] Read more.
Aside from languages having no form of written expression, it is usually the case with every language on this planet that texts are written in a single character. But every rule has its exceptions. A very rare exception is Japanese, the texts of which are written in the three kinds of characters. In European languages, no one can find a text written in a mixture of the Latin, Cyrillic, and Greek alphabets. For several Japanese texts currently available, we conduct a quantitative analysis of how the three characters are mixed using a methodology based on a binary pattern approach to the sequence that has been generated by a procedure. Specifically, we consider two different texts in the former and present constitutions as well as a famous American story that has been translated at least 13 times into Japanese. For the latter, a comparison is made among the human translations and four machine translations by DeepL and Google Translate. As metrics of divergence and diversity, the Hellinger distance, chi-square value, normalized Shannon entropy, and Simpson’s diversity index are employed. Numerical results suggest that in terms of the entropy, the 17 translations consist of three clusters, and that overall, the machine-translated texts exhibit entropy higher than the human translations. The finding suggests that the present method can provide a tool useful for stylometry and author attribution. Finally, through comparison with the diversity index, capabilities of the entropic measure are confirmed. Lastly, in addition to the abovementioned texts, applicability to the Japanese version of the periodic table of elements is investigated. Full article
(This article belongs to the Special Issue Entropy-Based Time Series Analysis: Theory and Applications)
Show Figures

Figure 1

25 pages, 1910 KB  
Review
Natural Language Processing in Generating Industrial Documentation Within Industry 4.0/5.0
by Izabela Rojek, Olga Małolepsza, Mirosław Kozielski and Dariusz Mikołajewski
Appl. Sci. 2025, 15(23), 12662; https://doi.org/10.3390/app152312662 - 29 Nov 2025
Cited by 1 | Viewed by 1887
Abstract
Deep learning (DL) methods have revolutionized natural language processing (NLP), enabling industrial documentation systems to process and generate text with high accuracy and fluency. Modern deep learning models, such as transformers and recurrent neural networks (RNNs), learn contextual relationships in text, making them [...] Read more.
Deep learning (DL) methods have revolutionized natural language processing (NLP), enabling industrial documentation systems to process and generate text with high accuracy and fluency. Modern deep learning models, such as transformers and recurrent neural networks (RNNs), learn contextual relationships in text, making them ideal for analyzing and creating complex industrial documentation. Transformer-based architectures, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), are ideally suited for tasks such as text summarization, content generation, and question answering, which are crucial for documentation systems. Pre-trained language models, tuned to specific industrial datasets, support domain-specific vocabulary, ensuring the generated documentation complies with industry standards. Deep learning-based systems can use sequential models, such as those used in machine translation, to generate documentation in multiple languages, promoting accessibility, and global collaboration. Using attention mechanisms, these models identify and highlight critical sections of input data, resulting in the generation of accurate and concise documentation. Integration with optical character recognition (OCR) tools enables DL-based NLP systems to digitize and interpret legacy documents, streamlining the transition to automated workflows. Reinforcement learning and human feedback loops can enhance a system’s ability to generate consistent and contextually relevant text over time. These approaches are particularly effective in creating dynamic documentation that is automatically updated based on data from sensors, registers, or other sources in real time. The scalability of DL techniques enables industrial organizations to efficiently produce massive amounts of documentation, reducing manual effort and improving overall efficiency. NLP has become a fundamental technology for automating the generation, maintenance, and personalization of industrial documentation within the Industry 4.0, 5.0, and emerging Industry 6.0 paradigms. Recent advances in large language models, search-assisted generation, and multimodal architectures have significantly improved the accuracy and contextualization of technical manuals, maintenance reports, and compliance documents. However, persistent challenges such as domain-specific terminology, data scarcity, and the risk of hallucinations highlight the limitations of current approaches in safety-critical manufacturing environments. This review synthesizes state-of-the-art methods, comparing rule-based, neural, and hybrid systems while assessing their effectiveness in addressing industrial requirements for reliability, traceability, and real-time adaptation. Human–AI collaboration and the integration of knowledge graphs are transforming documentation workflows as factories evolve toward cognitive and autonomous systems. The review included 32 articles published between 2018 and 2025. The implications of these bibliometric findings suggest that a high percentage of conference papers (69.6%) may indicate a field still in its conceptual phase, which contextualizes the article’s emphasis on proposed architecture rather than their industrial validation. Most research was conducted in computer science, suggesting early stages of technological maturity. The leading countries were China and India, but these countries did not have large publication counts, nor were leading researchers or affiliations observed, suggesting significant research dispersion. However, the most frequently observed SDGs indicate a clear health context, focusing on “industry innovation and infrastructure” and “good health and well-being”. Full article
(This article belongs to the Special Issue Emerging and Exponential Technologies in Industry 4.0)
Show Figures

Figure 1

21 pages, 1605 KB  
Article
Risk Management Challenges in Maritime Autonomous Surface Ships (MASSs): Training and Regulatory Readiness
by Hyeri Park, Jeongmin Kim, Min Jung, Suk-young Kang, Daegun Kim, Changwoo Kim and Unkyu Jang
Appl. Sci. 2025, 15(20), 10993; https://doi.org/10.3390/app152010993 - 13 Oct 2025
Viewed by 2525
Abstract
Maritime Autonomous Surface Ships (MASSs) raise safety and regulatory challenges that extend beyond technical reliability. This study builds on a published system-theoretic process analysis (STPA) of degraded operations that identified 92 loss scenarios. These scenarios were reformulated into a two-round Delphi survey with [...] Read more.
Maritime Autonomous Surface Ships (MASSs) raise safety and regulatory challenges that extend beyond technical reliability. This study builds on a published system-theoretic process analysis (STPA) of degraded operations that identified 92 loss scenarios. These scenarios were reformulated into a two-round Delphi survey with 20 experts from academic, industry, seafaring, and regulatory backgrounds. Panelists rated each scenario on severity, likelihood, and detectability. To avoid rank reversal, common in the Risk Priority Number, an adjusted index was applied. Initial concordance was low (Kendall’s W = 0.07), reflecting diverse perspectives. After feedback, Round 2 reached substantial agreement (W = 0.693, χ2 = 3265.42, df = 91, p < 0.001) and produced a stable Top 10. High-priority items involved propulsion and machinery, communication links, sensing, integrated control, and human–machine interaction. These risks are further exacerbated by oceanographic conditions, such as strong currents, wave-induced motions, and biofouling, which can impair propulsion efficiency and sensor accuracy. This highlights the importance of environmental resilience in MASS safety. These clusters were translated into five action bundles that addressed fallback procedures, link assurance, sensor fusion, control chain verification, and alarm governance. The findings show that Remote Operator competence and oversight are central to MASS safety. At the same time, MASSs rely on artificial intelligence systems that can fail in degraded states, for example, through reduced explainability in decision making, vulnerabilities in sensor fusion, or adversarial conditions such as fog-obscured cameras. Recognizing these AI-specific challenges highlights the need for both human oversight and resilient algorithmic design. They support explicit inclusion of Remote Operators in the STCW convention, along with watchkeeping and fatigue rules for Remote Operation Centers. This study provides a consensus-based baseline for regulatory debate, while future work should extend these insights through quantitative system modeling. Full article
(This article belongs to the Special Issue Risk and Safety of Maritime Transportation)
Show Figures

Figure 1

27 pages, 913 KB  
Article
Criticality Assessment of Wind Turbine Defects via Multispectral UAV Fusion and Fuzzy Logic
by Pavlo Radiuk, Bohdan Rusyn, Oleksandr Melnychenko, Tomasz Perzynski, Anatoliy Sachenko, Serhii Svystun and Oleg Savenko
Energies 2025, 18(17), 4523; https://doi.org/10.3390/en18174523 - 26 Aug 2025
Cited by 3 | Viewed by 1220
Abstract
Ensuring the structural integrity of wind turbines is crucial for the sustainability of wind energy. A significant challenge remains in transitioning from mere defect detection to objective, scalable criticality assessment for prioritizing maintenance. In this work, we propose a novel comprehensive framework that [...] Read more.
Ensuring the structural integrity of wind turbines is crucial for the sustainability of wind energy. A significant challenge remains in transitioning from mere defect detection to objective, scalable criticality assessment for prioritizing maintenance. In this work, we propose a novel comprehensive framework that leverages multispectral unmanned aerial vehicle (UAV) imagery and a novel standards-aligned Fuzzy Inference System to automate this task. Our contribution is validated on two open research-oriented datasets representing small on- and offshore machines: the public AQUADA-GO and Thermal WTB Inspection datasets. An ensemble of YOLOv8n models trained on fused RGB-thermal data achieves a mean Average Precision (mAP@.5) of 92.8% for detecting cracks, erosion, and thermal anomalies. The core novelty, a 27-rule Fuzzy Inference System derived from the IEC 61400-5 standard, translates quantitative defect parameters into a five-level criticality score. The system’s output demonstrates exceptional fidelity to expert assessments, achieving a mean absolute error of 0.14 and a Pearson correlation of 0.97. This work provides a transparent, repeatable, and engineering-grounded proof of concept, demonstrating a promising pathway toward predictive, condition-based maintenance strategies and supporting the economic viability of wind energy. Full article
(This article belongs to the Special Issue Optimal Control of Wind and Wave Energy Converters)
Show Figures

Graphical abstract

17 pages, 609 KB  
Article
GPT-Based Text-to-SQL for Spatial Databases
by Hui Wang, Li Guo, Yubin Liang, Le Liu and Jiajin Huang
ISPRS Int. J. Geo-Inf. 2025, 14(8), 288; https://doi.org/10.3390/ijgi14080288 - 24 Jul 2025
Cited by 2 | Viewed by 4078
Abstract
Text-to-SQL for spatial databases enables the translation of natural language questions into corresponding SQL queries, allowing non-experts to easily access spatial data, which has gained increasing attention from researchers. Previous research has primarily focused on rule-based methods. However, these methods have limitations when [...] Read more.
Text-to-SQL for spatial databases enables the translation of natural language questions into corresponding SQL queries, allowing non-experts to easily access spatial data, which has gained increasing attention from researchers. Previous research has primarily focused on rule-based methods. However, these methods have limitations when dealing with complicated or unknown natural language questions. While advanced machine learning models can be trained, they typically require large labeled training datasets, which are severely lacking for spatial databases. Recently, Generative Pre-Trained Transformer (GPT) models have emerged as a promising paradigm for Text-to-SQL tasks in relational databases, driven by carefully designed prompts. In response to the severe lack of datasets for spatial databases, we have created a publicly available dataset that supports both English and Chinese. Furthermore, we propose a GPT-based method to construct prompts for spatial databases, which incorporates geographic and spatial database knowledge into the prompts and requires only a small number of training samples, such as 1, 3, or 5 examples. Extensive experiments demonstrate that incorporating geographic and spatial database knowledge into prompts improves the accuracy of Text-to-SQL tasks for spatial databases. Our proposed method can help non-experts access spatial databases more easily and conveniently. Full article
Show Figures

Figure 1

24 pages, 939 KB  
Review
Advances in Amazigh Language Technologies: A Comprehensive Survey Across Processing Domains
by Oussama Akallouch, Mohammed Akallouch and Khalid Fardousse
Information 2025, 16(7), 600; https://doi.org/10.3390/info16070600 - 13 Jul 2025
Cited by 1 | Viewed by 4371
Abstract
The Amazigh language, spoken by millions across North Africa, presents unique computational challenges due to its complex morphological system, dialectal variation, and multiple writing systems. This survey examines technological advances over the past decade across four key domains: natural language processing, speech recognition, [...] Read more.
The Amazigh language, spoken by millions across North Africa, presents unique computational challenges due to its complex morphological system, dialectal variation, and multiple writing systems. This survey examines technological advances over the past decade across four key domains: natural language processing, speech recognition, optical character recognition, and machine translation. We analyze the evolution from rule-based systems to advanced neural models, demonstrating how researchers have addressed resource constraints through innovative approaches that blend linguistic knowledge with machine learning. Our analysis reveals uneven progress across domains, with optical character recognition reaching high maturity levels while machine translation remains constrained by limited parallel data. Beyond technical metrics, we explore applications in education, cultural preservation, and digital accessibility, showing how these technologies enable Amazigh speakers to participate in the digital age. This work illustrates that advancing language technology for marginalized languages requires fundamentally different approaches that respect linguistic diversity while ensuring digital equity. Full article
Show Figures

Figure 1

19 pages, 16096 KB  
Article
Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations
by Tawffeek A. S. Mohammed
Information 2025, 16(6), 440; https://doi.org/10.3390/info16060440 - 26 May 2025
Cited by 7 | Viewed by 5657
Abstract
This study investigates translation quality between Arabic and English, comparing traditional rule-based machine translation systems, modern neural machine translation tools such as Google Translate, and large language models like ChatGPT. The research adopts both qualitative and quantitative approaches to assess the efficacy, accuracy, [...] Read more.
This study investigates translation quality between Arabic and English, comparing traditional rule-based machine translation systems, modern neural machine translation tools such as Google Translate, and large language models like ChatGPT. The research adopts both qualitative and quantitative approaches to assess the efficacy, accuracy, and contextual fidelity of translations. It particularly focuses on the translation of idiomatic and colloquial expressions as well as technical texts and genres. Using well-established evaluation metrics such as bilingual evaluation understudy (BLEU), translation error rate (TER), and character n-gram F-score (chrF), alongside the qualitative translation quality assessment model proposed by Juliane House, this study investigates the linguistic and semantic nuances of translations generated by different systems. This study concludes that although metric-based evaluations like BLEU and TER are useful, they often fail to fully capture the semantic and contextual accuracy of idiomatic and expressive translations. Large language models, particularly ChatGPT, show promise in addressing this gap by offering more coherent and culturally aligned translations. However, both systems demonstrate limitations that necessitate human post-editing for high-stakes content. The findings support a hybrid approach, combining machine translation tools with human oversight for optimal translation quality, especially in languages with complex morphology and culturally embedded expressions like Arabic. Full article
(This article belongs to the Special Issue Machine Translation for Conquering Language Barriers)
Show Figures

Figure 1

23 pages, 711 KB  
Article
Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser
by Simon Strübbe, Irina Sidorenko and Renée Lampe
Information 2025, 16(4), 274; https://doi.org/10.3390/info16040274 - 28 Mar 2025
Cited by 1 | Viewed by 2244
Abstract
As the prevalence of machine-written texts grows, it has become increasingly important to distinguish between human- and machine-generated content, especially when such texts are not explicitly labeled. Current artificial intelligence (AI) detection methods primarily focus on human-like characteristics, such as emotionality and subjectivity. [...] Read more.
As the prevalence of machine-written texts grows, it has become increasingly important to distinguish between human- and machine-generated content, especially when such texts are not explicitly labeled. Current artificial intelligence (AI) detection methods primarily focus on human-like characteristics, such as emotionality and subjectivity. However, these features can be easily modified through AI humanization, which involves altering word choice. In contrast, altering the underlying grammar without affecting the conveyed information is considerably more challenging. Thus, the grammatical characteristics of a text can be used as additional indicators of its origin. To address this, we employ a newly developed rule-based parser to analyze the grammatical structures in human- and machine-written texts. Our findings reveal systematic grammatical differences between human- and machine-written texts, providing a reliable criterion for the determination of the text origin. We further examine the stability of this criterion in the context of AI humanization and translation to other languages. Full article
Show Figures

Figure 1

19 pages, 3746 KB  
Article
The Impact of the Human Factor on Communication During a Collision Situation in Maritime Navigation
by Leszek Misztal and Paulina Hatlas-Sowinska
Appl. Sci. 2025, 15(5), 2797; https://doi.org/10.3390/app15052797 - 5 Mar 2025
Cited by 1 | Viewed by 1752
Abstract
In this paper, the authors draw attention to the significant impact of the human factor during collision situations in maritime navigation. The problems in the communication process between navigators are so excessive that the authors propose automatic communication. This is an alternative method [...] Read more.
In this paper, the authors draw attention to the significant impact of the human factor during collision situations in maritime navigation. The problems in the communication process between navigators are so excessive that the authors propose automatic communication. This is an alternative method to the current one. The presented system comprehensively performs communication tasks during a sea voyage. To reach the mentioned goal, AI methods of natural language processing and additional properties of metaontology (ontology supplemented with objective functions) are applied. Dedicated to maritime transport applications, the model for translating a natural language into an ontology consists of multiple steps and uses AI methods of classification for the recognition of a message from the ship’s bridge. The reverse model is also multi-stage and uses a created rule-based knowledge base to create natural-language sentences built on the basis of the ontology. Validation of the model’s accuracy results was conducted through accuracy assessment coefficients for information classification, commonly used in science. Receiver operating characteristic (ROC) curves represent the results in the datasets. The presented solution of the designed architecture of the system as well as algorithms developed in the software prototype confirmed the correctness of the assumptions in the described study. The authors demonstrated that it is feasible to successfully apply metaontology and machine learning methods in the proposed prototype software for ship-to-ship communication. Full article
(This article belongs to the Section Marine Science and Engineering)
Show Figures

Figure 1

19 pages, 2296 KB  
Article
A Hybrid Approach to Ontology Construction for the Badini Kurdish Language
by Media Azzat, Karwan Jacksi and Ismael Ali
Information 2024, 15(9), 578; https://doi.org/10.3390/info15090578 - 19 Sep 2024
Cited by 1 | Viewed by 4074
Abstract
Semantic ontologies have been widely utilized as crucial tools within natural language processing, underpinning applications such as knowledge extraction, question answering, machine translation, text comprehension, information retrieval, and text summarization. While the Kurdish language, a low-resource language, has been the subject of some [...] Read more.
Semantic ontologies have been widely utilized as crucial tools within natural language processing, underpinning applications such as knowledge extraction, question answering, machine translation, text comprehension, information retrieval, and text summarization. While the Kurdish language, a low-resource language, has been the subject of some ontological research in other dialects, a semantic web ontology for the Badini dialect remains conspicuously absent. This paper addresses this gap by presenting a methodology for constructing and utilizing a semantic web ontology for the Badini dialect of the Kurdish language. A Badini annotated corpus (UOZBDN) was created and manually annotated with part-of-speech (POS) tags. Subsequently, an HMM-based POS tagger model was developed using the UOZBDN corpus and applied to annotate additional text for ontology extraction. Ontology extraction was performed by employing predefined rules to identify nouns and verbs from the model-annotated corpus and subsequently forming semantic predicates. Robust methodologies were adopted for ontology development, resulting in a high degree of precision. The POS tagging model attained an accuracy of 95.04% when applied to the UOZBDN corpus. Furthermore, a manual evaluation conducted by Badini Kurdish language experts yielded a 97.42% accuracy rate for the extracted ontology. Full article
(This article belongs to the Special Issue Knowledge Representation and Ontology-Based Data Management)
Show Figures

Figure 1

20 pages, 2098 KB  
Article
Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method
by Fenfang Li, Zhengzhang Zhao, Li Wang and Han Deng
Appl. Sci. 2024, 14(7), 2989; https://doi.org/10.3390/app14072989 - 2 Apr 2024
Cited by 4 | Viewed by 2251
Abstract
Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, [...] Read more.
Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, which have high requirements on the corpus and the linguistic foundation of the researchers and are more costly to annotate manually. In this study, we explore Tibetan SBD using deep learning technology. Initially, we analyze Tibetan characteristics and various subword techniques, selecting Byte Pair Encoding (BPE) and Sentencepiece (SP) for text segmentation and training the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model. Secondly, we studied the Tibetan SBD based on different BERT pre-trained language models, which mainly learns the ambiguity of the shad (“།”) in different positions in modern Tibetan texts and determines through the model whether the shad (“།”) in the texts has the function of segmenting sentences. Meanwhile, this study introduces four models, BERT-CNN, BERT-RNN, BERT-RCNN, and BERT-DPCNN, based on the BERT model for performance comparison. Finally, to verify the performance of the pre-trained language models on the SBD task, this study conducts SBD experiments on both the publicly available Tibetan pre-trained language model TiBERT and the multilingual pre-trained language model (Multi-BERT). The experimental results show that the F1 score of the BERT (BPE) model trained in this study reaches 95.32% on 465,669 Tibetan sentences, nearly five percentage points higher than BERT (SP) and Multi-BERT. The SBD method based on pre-trained language models in this study lays the foundation for establishing datasets for the later tasks of Tibetan pre-training, summary extraction, and machine translation. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

13 pages, 415 KB  
Article
Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach
by Saima Shaukat, Muhammad Asad and Asmara Akram
Appl. Sci. 2023, 13(8), 5103; https://doi.org/10.3390/app13085103 - 19 Apr 2023
Cited by 7 | Viewed by 3904
Abstract
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. [...] Read more.
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Back to TopTop