MDPI - Publisher of Open Access Journals

16 pages, 1651 KiB

Open AccessArticle

Modular Pipeline for Text Recognition in Early Printed Books Using Kraken and ByT5

by Yahya Momtaz, Lorenza Laccetti and Guido Russo

Electronics 2025, 14(15), 3083; https://doi.org/10.3390/electronics14153083 - 1 Aug 2025

Viewed by 193

Early printed books, particularly incunabula, are invaluable archives of the beginnings of modern educational systems. However, their complex layouts, antique typefaces, and page degradation caused by bleed-through and ink fading pose significant challenges for automatic transcription. In this work, we present a modular [...] Read more.

Early printed books, particularly incunabula, are invaluable archives of the beginnings of modern educational systems. However, their complex layouts, antique typefaces, and page degradation caused by bleed-through and ink fading pose significant challenges for automatic transcription. In this work, we present a modular pipeline that addresses these problems by combining modern layout analysis and language modeling techniques. The pipeline begins with historical layout-aware text segmentation using Kraken, a neural network-based tool tailored for early typographic structures. Initial optical character recognition (OCR) is then performed with Kraken’s recognition engine, followed by post-correction using a fine-tuned ByT5 transformer model trained on manually aligned line-level data. By learning to map noisy OCR outputs to verified transcriptions, the model substantially improves recognition quality. The pipeline also integrates a preprocessing stage based on our previous work on bleed-through removal using robust statistical filters, including non-local means, Gaussian mixtures, biweight estimation, and Gaussian blur. This step enhances the legibility of degraded pages prior to OCR. The entire solution is open, modular, and scalable, supporting long-term preservation and improved accessibility of cultural heritage materials. Experimental results on 15th-century incunabula show a reduction in the Character Error Rate (CER) from around 38% to around 15% and an increase in the Bilingual Evaluation Understudy (BLEU) score from 22 to 44, confirming the effectiveness of our approach. This work demonstrates the potential of integrating transformer-based correction with layout-aware segmentation to enhance OCR accuracy in digital humanities applications. Full article

(This article belongs to the Special Issue Electronics and Computer Science for Cultural Heritage: Advancements, Preservation, and Applications, 2nd Edition)

► Show Figures

Figure 1

23 pages, 2410 KiB

Open AccessArticle

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

by Gailius Raškinis, Darius Amilevičius, Danguolė Kalinauskaitė, Artūras Mickus, Daiva Vitkutė-Adžgauskienė, Antanas Čenys and Tomas Krilavičius

Mathematics 2025, 13(13), 2107; https://doi.org/10.3390/math13132107 - 27 Jun 2025

Viewed by 319

Abstract

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for [...] Read more.

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

36 pages, 2061 KiB

Open AccessArticle

A Symmetric Dual-Drive Text Matching Model Based on Dynamically Gated Sparse Attention Feature Distillation with a Faithful Semantic Preservation Strategy

by Peng Jiang and Xiaodong Cai

Symmetry 2025, 17(5), 772; https://doi.org/10.3390/sym17050772 - 15 May 2025

Viewed by 759

Abstract

A new text matching model based on dynamic gated sparse attention feature distillation with a faithful semantic preservation strategy is proposed to address the fact that text matching models are susceptible to interference from weakly relevant information and that they find it difficult [...] Read more.

A new text matching model based on dynamic gated sparse attention feature distillation with a faithful semantic preservation strategy is proposed to address the fact that text matching models are susceptible to interference from weakly relevant information and that they find it difficult to obtain key features that are faithful to the original semantics, resulting in a decrease in accuracy. Compared to the traditional attention mechanism, with its high computational complexity and difficulty in discarding weakly relevant features, this study designs a new dynamic gated sparse attention feature distillation method based on dynamic gated sparse attention, aiming to obtain key features. Weakly relevant features are obtained through the synergy of dynamic gated sparse attention, a gradient inversion layer, a SoftMax function, and projection theorem literacy. Among these, sparse attention enhances weakly correlated feature capture through multimodal dynamic fusion with adaptive compression. Then, the projection theorem is used to identify and discard the noisy features in the hidden layer information to obtain the key features. This feature distillation strategy, in which the semantic information of the original text is decomposed into key features and noise features, forms an orthogonal decomposition symmetry in the semantic space. A new variety of faithful semantic preservation strategies is designed to make the key features faithful to the original semantic information. This strategy introduces an interval loss function and calculates the angle between the key features and the original hidden layer information with the help of cosine similarity in order to ensure that the features reflect the semantics of the original text. This can further update the iterative key features and thus improve the accuracy. The strategy builds a feature fidelity verification mechanism with a symmetric core of bidirectional considerations of semantic accuracy and correspondence to the original text. The experimental results show that the accuracies are 89.10% and 95.01% in the English datasets MRPC and Scitail, respectively; 87.8% in the Chinese dataset PAWX; and 80.32% and 80.27% in the Ant Gold dataset, respectively. Meanwhile, the accuracies in the KUAKE-QTR dataset and Macro-F1 are 70.10% and 68.08%, respectively, which are better than other methods. Full article

(This article belongs to the Section Mathematics)

► Show Figures

Figure 1

25 pages, 866 KiB

Open AccessArticle

Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis

by Jawad Khan, Niaz Ahmad, Youngmoon Lee, Shah Khalid and Dildar Hussain

Mathematics 2025, 13(9), 1456; https://doi.org/10.3390/math13091456 - 29 Apr 2025

Cited by 2 | Viewed by 837

Abstract

Sentiment analysis (SA) analyzes online data to uncover insights for better decision-making. Conventional text SA techniques are effective and easy to understand but encounter difficulties when handling sparse data. Deep Neural Networks (DNNs) excel in handling data sparsity but face challenges with high-dimensional, [...] Read more.

Sentiment analysis (SA) analyzes online data to uncover insights for better decision-making. Conventional text SA techniques are effective and easy to understand but encounter difficulties when handling sparse data. Deep Neural Networks (DNNs) excel in handling data sparsity but face challenges with high-dimensional, noisy data. Incorporating rich domain semantic and sentiment knowledge is crucial for advancing sentiment analysis. To address these challenges, we propose an innovative hybrid sentiment analysis approach that combines established DNN models like RoBERTA and BiGRU with an attention mechanism, alongside traditional feature engineering and dimensionality reduction through PCA. This leverages the strengths of both techniques: DNNs handle complex semantics and dynamic features, while conventional methods shine in interpretability and efficient sentiment extraction. This complementary combination fosters a robust and accurate sentiment analysis model. Our model is evaluated on four widely used real-world benchmark text sentiment analysis datasets: MR, CR, IMDB, and SemEval 2013. The proposed hybrid model achieved impressive results on these datasets. These findings highlight the effectiveness of this approach for text sentiment analysis tasks, demonstrating its ability to improve sentiment analysis performance compared to previously proposed methods. Full article

(This article belongs to the Special Issue High-Dimensional Data Analysis and Applications)

► Show Figures

Figure 1

29 pages, 63247 KiB

Open AccessArticle

Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics

by Adriano Ettari, Massimo Brescia, Stefania Conte, Yahya Momtaz and Guido Russo

J. Imaging 2025, 11(5), 136; https://doi.org/10.3390/jimaging11050136 - 28 Apr 2025

Viewed by 529

Abstract

Over the last decades, plenty of ancient manuscripts have been digitized all over the world, and particularly in Europe. The fruition of these huge digital archives is often limited by the bleed-through effect due to the acid nature of the inks used, resulting [...] Read more.

Over the last decades, plenty of ancient manuscripts have been digitized all over the world, and particularly in Europe. The fruition of these huge digital archives is often limited by the bleed-through effect due to the acid nature of the inks used, resulting in very noisy images. Several authors have recently worked on bleed-through removal, using different approaches. With the aim of developing a bleed-through removal tool, capable of batch application on a large number of images, of the order of hundred thousands, we used machine learning and robust statistical methods with four different methods, and applied them to two medieval manuscripts. The methods used are (i) non-local means (NLM); (ii) Gaussian mixture models (GMMs); (iii) biweight estimation; and (iv) Gaussian blur. The application of these methods to the two quoted manuscripts shows that these methods are, in general, quite effective in bleed-through removal, but the selection of the method has to be performed according to the characteristics of the manuscript, e.g., if there is no ink fading and the difference between bleed-through pixels and the foreground text is clear, we can use a stronger model without the risk of losing important information. Conversely, if the distinction between bleed-through and foreground pixels is less pronounced, it is better to use a weaker model to preserve useful details. Full article

(This article belongs to the Section Document Analysis and Processing)

► Show Figures

Figure 1

18 pages, 2018 KiB

Open AccessArticle

Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare

by Suresh Neethirajan

AI 2025, 6(4), 65; https://doi.org/10.3390/ai6040065 - 25 Mar 2025

Cited by 2 | Viewed by 1341

Abstract

Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, [...] Read more.

Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, to decode chicken vocalizations. Our primary objective was to determine whether Whisper could effectively identify acoustic patterns associated with emotional and physiological states in poultry, thereby enabling real-time, non-invasive welfare assessments. To achieve this, chicken vocal data were recorded under diverse experimental conditions, including healthy versus unhealthy birds, pre-stress versus post-stress scenarios, and quiet versus noisy environments. The audio recordings were processed through Whisper, producing text-like outputs. Although these outputs did not represent literal translations of chicken vocalizations into human language, they exhibited consistent patterns in token sequences and sentiment indicators strongly correlated with recognized poultry stressors and welfare conditions. Sentiment analysis using standard NLP tools (e.g., polarity scoring) identified notable shifts in “negative” and “positive” scores that corresponded closely with documented changes in vocal intensity associated with stress events and altered physiological states. Despite the inherent domain mismatch—given Whisper’s original training on human speech—the findings clearly demonstrate the model’s capability to reliably capture acoustic features significant to poultry welfare. Recognizing the limitations associated with applying English-oriented sentiment tools, this study proposes future multimodal validation frameworks incorporating physiological sensors and behavioral observations to further strengthen biological interpretability. To our knowledge, this work provides the first demonstration that Transformer-based architectures, even without species-specific fine-tuning, can effectively encode meaningful acoustic patterns from animal vocalizations, highlighting their transformative potential for advancing productivity, sustainability, and welfare practices in precision poultry farming. Full article

(This article belongs to the Special Issue Artificial Intelligence in Agriculture)

► Show Figures

Figure 1

20 pages, 682 KiB

Open AccessArticle

Sentence Interaction and Bag Feature Enhancement for Distant Supervised Relation Extraction

by Wei Song and Qingchun Liu

AI 2025, 6(3), 51; https://doi.org/10.3390/ai6030051 - 4 Mar 2025

Viewed by 925

Abstract

Background: Distant supervision employs external knowledge bases to automatically match with text, allowing for the automatic annotation of sentences. Although this method effectively tackles the challenge of manual labeling, it inevitably introduces noisy labels. Traditional approaches typically employ sentence-level attention mechanisms, assigning lower [...] Read more.

Background: Distant supervision employs external knowledge bases to automatically match with text, allowing for the automatic annotation of sentences. Although this method effectively tackles the challenge of manual labeling, it inevitably introduces noisy labels. Traditional approaches typically employ sentence-level attention mechanisms, assigning lower weights to noisy sentences to mitigate their impact. But this approach overlooks the critical importance of information flow between sentences. Additionally, previous approaches treated an entire bag as a single classification unit, giving equal importance to all features within the bag. However, they failed to recognize that different dimensions of features have varying levels of significance. Method: To overcome these challenges, this study introduces a novel network that incorporates sentence interaction and a bag-level feature enhancement (ESI-EBF) mechanism. We concatenate sentences within a bag into a continuous context, allowing information to flow freely between them during encoding. At the bag level, we partition the features into multiple groups based on dimensions, assigning an importance coefficient to each sub-feature within a group. This enhances critical features while diminishing the influence of less important ones. In the end, the enhanced features are utilized to construct high-quality bag representations, facilitating more accurate classification by the classification module. Result: The experimental findings from the New York Times (NYT) and Wiki-20m datasets confirm the efficacy of our suggested encoding approach and feature improvement module. Our method also outperforms state-of-the-art techniques on these datasets, achieving superior relation extraction accuracy. Full article

(This article belongs to the Section AI Systems: Theory and Applications)

► Show Figures

Figure 1

29 pages, 3481 KiB

Open AccessArticle

Translation Can Distort the Linguistic Parameters of Source Texts Written in Inflected Language: Multidimensional Mathematical Analysis of “The Betrothed”, a Translation in English of “I Promessi Sposi” by A. Manzoni

by Emilio Matricciani

AppliedMath 2025, 5(1), 24; https://doi.org/10.3390/appliedmath5010024 - 4 Mar 2025

Viewed by 1836

Abstract

We compare, mathematically, the text of a famous Italian novel, I promessi sposi, written by Alessandro Manzoni (source text), to its most recent English translation, The Betrothed by Michael F. Moore (target text). The mathematical theory applied does not measure the efficacy [...] Read more.

We compare, mathematically, the text of a famous Italian novel, I promessi sposi, written by Alessandro Manzoni (source text), to its most recent English translation, The Betrothed by Michael F. Moore (target text). The mathematical theory applied does not measure the efficacy and beauty of texts; only their mathematical underlying structure and similarity. The translation theory adopted by the translator is the “domestication” of the source text because English is not as economical in its use of subject pronouns as Italian. A domestication index measures the degree of domestication. The modification of the original mathematical structure produces several consequences on the short–term memory buffers required for the reader and on the theoretical number of patterns used to construct sentences. The geometrical representation of texts and the related probability of error indicate that the two texts are practically uncorrelated. A fine–tuning analysis shows that linguistic channels are very noisy, with very poor signal–to–noise ratios, except the channels related to characters and words. Readability indices are also diverse. In conclusion, a blind comparison of the linguistic parameters of the two texts would unlikely indicate they refer to the same novel. Full article

► Show Figures

Figure 1

14 pages, 223 KiB

Open AccessProceeding Paper

Handling Semantic Relationships for Classification of Sparse Text: A Review

by Safuan and Ku Ruhana Ku-Mahamud

Eng. Proc. 2025, 84(1), 61; https://doi.org/10.3390/engproc2025084061 - 17 Feb 2025

Viewed by 724

Abstract

The classification of sparse text, common in short or specialized content, is challenging for natural language processing. These challenges stem from high-dimensional data and scarce relevant features because sparse text can result from noisy, short, or contextually limited inputs. This paper reviews approaches [...] Read more.

The classification of sparse text, common in short or specialized content, is challenging for natural language processing. These challenges stem from high-dimensional data and scarce relevant features because sparse text can result from noisy, short, or contextually limited inputs. This paper reviews approaches for handling semantic relationships in sparse text classification. Approaches like FastText and Latent Dirichlet Allocation are discussed for addressing feature sparsity while maintaining semantic integrity. Embedding techniques, such as Word2Vec and BERT, are crucial for capturing contextual meanings and improving accuracy. Recent advances include hybrid models that combine deep learning and traditional methods for better performance. These approaches work across various datasets, including social media and scientific publications. Finally, progress in using semantic relationships for sparse text classification is reviewed, and open challenges and future research directions are identified to better integrate semantic understanding in sparse text classification. Full article

(This article belongs to the Proceedings of The 8th Mechanical Engineering, Science and Technology International Conference)

18 pages, 4280 KiB

Open AccessArticle

Language-Guided Semantic Clustering for Remote Sensing Change Detection

by Shenglong Hu, Yiting Bian, Bin Chen, Huihui Song and Kaihua Zhang

Sensors 2024, 24(24), 7887; https://doi.org/10.3390/s24247887 - 10 Dec 2024

Viewed by 1266

Abstract

Existing learning-based remote sensing change detection (RSCD) commonly uses semantic-agnostic binary masks as supervision, which hinders their ability to distinguish between different semantic types of changes, resulting in a noisy change mask prediction. To address this issue, this paper presents a Language-guided semantic [...] Read more.

Existing learning-based remote sensing change detection (RSCD) commonly uses semantic-agnostic binary masks as supervision, which hinders their ability to distinguish between different semantic types of changes, resulting in a noisy change mask prediction. To address this issue, this paper presents a Language-guided semantic clustering framework that can effectively transfer the rich semantic information from the contrastive language-image pretraining (CLIP) model for RSCD, dubbed LSC-CD. The LSC-CD considers the strong zero-shot generalization of the CLIP, which makes it easy to transfer the semantic knowledge from the CLIP into the CD model under semantic-agnostic binary mask supervision. Specifically, the LSC-CD first constructs a category text-prior memory bank based on the dataset statistics and then leverages the CLIP to transform the text in the memory bank into the corresponding semantic embeddings. Afterward, a CLIP adapter module (CAM) is designed to fine-tune the semantic embeddings to align with the change region embeddings from the input bi-temporal images. Next, a semantic clustering module (SCM) is designed to cluster the change region embeddings around the semantic embeddings, yielding the compact change embeddings that are robust to noisy backgrounds. Finally, a lightweight decoder is designed to decode the compact change embeddings, yielding an accurate change mask prediction. Experimental results on three public benchmarks including LEVIR-CD, WHU-CD, and SYSU-CD demonstrate that the proposed LSC-CD achieves state-of-the-art performance in terms of all evaluated metrics. Full article

(This article belongs to the Special Issue Image Processing and Analysis for Object Detection: 2nd Edition)

► Show Figures

Figure 1

22 pages, 9696 KiB

Open AccessArticle

Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval

by Qiang Zou, Shuli Cheng, Anyu Du and Jiayi Chen

Entropy 2024, 26(11), 911; https://doi.org/10.3390/e26110911 - 27 Oct 2024

Viewed by 1643

Abstract

Deep hashing technology, known for its low-cost storage and rapid retrieval, has become a focal point in cross-modal retrieval research as multimodal data continue to grow. However, existing supervised methods often overlook noisy labels and multiscale features in different modal datasets, leading to [...] Read more.

Deep hashing technology, known for its low-cost storage and rapid retrieval, has become a focal point in cross-modal retrieval research as multimodal data continue to grow. However, existing supervised methods often overlook noisy labels and multiscale features in different modal datasets, leading to higher information entropy in the generated hash codes and features, which reduces retrieval performance. The variation in text annotation information across datasets further increases the information entropy during text feature extraction, resulting in suboptimal outcomes. Consequently, reducing the information entropy in text feature extraction, supplementing text feature information, and enhancing the retrieval efficiency of large-scale media data are critical challenges in cross-modal retrieval research. To tackle these, this paper introduces the Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval (TEGAH) framework. TEGAH incorporates a deep text feature extraction network and a multiscale label region fusion network to minimize information entropy and optimize feature extraction. Additionally, a Graph-Attention-based modal feature fusion network is designed to efficiently integrate multimodal information, enhance the affinity of the network for different modes, and retain more semantic information. Extensive experiments on three multilabel datasets demonstrate that the TEGAH framework significantly outperforms state-of-the-art cross-modal hashing methods. Full article

(This article belongs to the Section Multidisciplinary Applications)

► Show Figures

Figure 1

13 pages, 5724 KiB

Open AccessArticle

Comparative Approach to De-Noising TEMPEST Video Frames

by Alexandru Mădălin Vizitiu, Marius Alexandru Sandu, Lidia Dobrescu, Adrian Focșa and Cristian Constantin Molder

Sensors 2024, 24(19), 6292; https://doi.org/10.3390/s24196292 - 28 Sep 2024

Viewed by 1242

Abstract

Analysis of unintended compromising emissions from Video Display Units (VDUs) is an important topic in research communities. This paper examines the feasibility of recovering the information displayed on the monitor from reconstructed video frames. The study holds particular significance for our understanding of [...] Read more.

Analysis of unintended compromising emissions from Video Display Units (VDUs) is an important topic in research communities. This paper examines the feasibility of recovering the information displayed on the monitor from reconstructed video frames. The study holds particular significance for our understanding of security vulnerabilities associated with the electromagnetic radiation of digital displays. Considering the amount of noise that reconstructed TEMPEST video frames have, the work in this paper focuses on two different approaches to de-noising images for efficient optical character recognition. First, an Adaptive Wiener Filter (AWF) with adaptive window size implemented in the spatial domain was tested, and then a Convolutional Neural Network (CNN) with an encoder–decoder structure that follows both classical auto-encoder model architecture and U-Net architecture (auto-encoder with skip connections). These two techniques resulted in an improvement of more than two times on the Structural Similarity Index Metric (SSIM) for AWF and up to four times for the SSIM for the Deep Learning (DL) approach. In addition, to validate the results, the possibility of text recovery from processed noisy frames was studied using a state-of-the-art Tesseract Optical Character Recognition (OCR) engine. The present work aims to bring to attention the security importance of this topic and the non-negligible character of VDU information leakages. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

22 pages, 749 KiB

Open AccessArticle

Improving Distantly Supervised Relation Extraction with Multi-Level Noise Reduction

by Wei Song and Zijiang Yang

AI 2024, 5(3), 1709-1730; https://doi.org/10.3390/ai5030084 - 23 Sep 2024

Cited by 3 | Viewed by 1412

Abstract

Background: Distantly supervised relation extraction (DSRE) aims to identify semantic relations in large-scale texts automatically labeled via knowledge base alignment. It has garnered significant attention due to its high efficiency, but existing methods are plagued by noise at both the word and [...] Read more.

Background: Distantly supervised relation extraction (DSRE) aims to identify semantic relations in large-scale texts automatically labeled via knowledge base alignment. It has garnered significant attention due to its high efficiency, but existing methods are plagued by noise at both the word and sentence level and fail to address these issues adequately. The former level of noise arises from the large proportion of irrelevant words within sentences, while noise at the latter level is caused by inaccurate relation labels for various sentences. Method: We propose a novel multi-level noise reduction neural network (MLNRNN) to tackle both issues by mitigating the impact of multi-level noise. We first build an iterative keyword semantic aggregator (IKSA) to remove noisy words, and capture distinctive features of sentences by aggregating the information of keywords. Next, we implement multi-objective multi-instance learning (MOMIL) to reduce the impact of incorrect labels in sentences by identifying the cluster of correctly labeled instances. Meanwhile, we leverage mislabeled sentences with cross-level contrastive learning (CCL) to further enhance the classification capability of the extractor. Results: Comprehensive experimental results on two DSRE benchmark datasets demonstrated that the MLNRNN outperformed state-of-the-art methods for distantly supervised relation extraction in almost all cases. Conclusions: The proposed MLNRNN effectively addresses both word- and sentence-level noise, providing a significant improvement in relation extraction performance under distant supervision. Full article

(This article belongs to the Section AI Systems: Theory and Applications)

► Show Figures

Figure 1

18 pages, 37868 KiB

Open AccessArticle

3D Character Animation and Asset Generation Using Deep Learning

by Vlad-Constantin Lungu-Stan and Irina Georgiana Mocanu

Appl. Sci. 2024, 14(16), 7234; https://doi.org/10.3390/app14167234 - 16 Aug 2024

Viewed by 3043

Abstract

Besides video content, a significant part of entertainment is represented by computer games and animations such as cartoons. Creating such entertainment is based on two fundamental steps: asset generation and character animation. The main problem stems from its repetitive nature and the needed [...] Read more.

Besides video content, a significant part of entertainment is represented by computer games and animations such as cartoons. Creating such entertainment is based on two fundamental steps: asset generation and character animation. The main problem stems from its repetitive nature and the needed amounts of concentration and skill. The latest advances in deep learning and generative techniques have provided a set of powerful tools which can be used to alleviate these problems by facilitating the tasks of artists and engineers and providing a better workflow. In this work we explore practical solutions for facilitating and hastening the creative process: character animation and asset generation. In character animation, the task is to either move the joints of a subject manually or to correct the noisy data coming out of motion capture. The main difficulties of these tasks are their repetitive nature and the needed amounts of concentration and skill. For the animation case, we propose two decoder-only transformer based solutions, inspired by the current success of GPT. The first, AnimGPT, targets the original animation workflow by predicting the next pose of an animation based on a set of previous poses, while the second, DenoiseAnimGPT, tackles the motion capture case by predicting the clean current pose based on all previous poses and the current noisy pose. Both models obtained good performances on the CMU motion dataset, with the generated results being imperceptible to the untrained human eye. Quantitative evaluation was performed using mean absolute error between the ground truth motion vectors and the predicted motion vector. For both networks AnimGPT and DenoiseAnimGPT errors were 0.345, respectively 0.2513 (for 50 frames) that indicates better performances compared with other solutions. For asset generation, diffusion models were used. Using image generation and outpainting, we created a method that generates good backgrounds by combining the idea of text conditioned generation and text conditioned image editing. A time coherent algorithm that creates animated effects for characters was obtained. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence and Machine Learning in Games)

► Show Figures

Figure 1

23 pages, 1009 KiB

Open AccessArticle

Enhancement of English-Bengali Machine Translation Leveraging Back-Translation

by Subrota Kumar Mondal, Chengwei Wang, Yijun Chen, Yuning Cheng, Yanbo Huang, Hong-Ning Dai and H. M. Dipu Kabir

Appl. Sci. 2024, 14(15), 6848; https://doi.org/10.3390/app14156848 - 5 Aug 2024

Cited by 1 | Viewed by 2712

Abstract

An English-Bengali machine translation (MT) application can convert an English text into a corresponding Bengali translation. To build a better model for this task, we can optimize English-Bengali MT. MT for languages with rich resources, like English-German, started decades ago. However, MT for [...] Read more.

An English-Bengali machine translation (MT) application can convert an English text into a corresponding Bengali translation. To build a better model for this task, we can optimize English-Bengali MT. MT for languages with rich resources, like English-German, started decades ago. However, MT for languages lacking many parallel corpora remains challenging. In our study, we employed back-translation to improve the translation accuracy. With back-translation, we can have a pseudo-parallel corpus, and the generated (pseudo) corpus can be added to the original dataset to obtain an augmented dataset. However, the new data can be regarded as noisy data because they are generated by models that may not be trained very well or not evaluated well, like human translators. Since the original output of a translation model is a probability distribution of candidate words, to make the model more robust, different decoding methods are used, such as beam search, top-k random sampling and random sampling with temperature T, and others. Notably, top-k random sampling and random sampling with temperature T are more commonly used and more optimal decoding methods than the beam search. To this end, our study compares LSTM (Long-Short Term Memory, as a baseline) and Transformer. Our results show that Transformer (BLEU:

27.80

in validation,

1.33

in test) outperforms LSTM (

3.62

in validation,

0.00

in test) by a large margin in the English-Bengali translation task. (Evaluating LSTM and Transformer without any augmented data is our baseline study.) We also incorporate two decoding methods, top-k random sampling and random sampling with temperature T, for back-translation that help improve the translation accuracy of the model. The results show that data generated by back-translation without top-k or temperature sampling (“no strategy”) help improve the accuracy (BLEU

38.22

, +

10.42

on validation,

2.07

, +

0.74

on test). Specifically, back-translation with top-k sampling is less effective (

k = 10

, BLEU

29.43

, +

1.83

on validation,

1.36

, +

0.03

on test), while sampling with a proper value of T,

T = 0.5

makes the model achieve a higher score (

T = 0.5

, BLEU

35.02

, +

7.22

on validation,

2.35

, +

1.02

on test). This implies that in English-Bengali MT, we can augment the training set through back-translation using random sampling with a proper temperature T. Full article

► Show Figures

Figure 1

Search Results (59)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (59)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI