Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (7)

Search Parameters:
Keywords = manually aligned corpus

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1405 KiB  
Article
Multimodal Pragmatic Markers of Feedback in Dialogue
by Ludivine Crible and Loulou Kosmala
Languages 2025, 10(6), 117; https://doi.org/10.3390/languages10060117 - 22 May 2025
Viewed by 574
Abstract
Historically, the field of discourse marker research has moved from relying on intuition to more and more ecological data, with written, spoken, and now multimodal corpora available to study these pervasive pragmatic devices. For some topics, video is necessary to capture the complexity [...] Read more.
Historically, the field of discourse marker research has moved from relying on intuition to more and more ecological data, with written, spoken, and now multimodal corpora available to study these pervasive pragmatic devices. For some topics, video is necessary to capture the complexity of interactive phenomena, such as feedback in dialogue. Feedback is the process of communicating engagement, alignment, and affiliation (or lack thereof) to the other speaker, and has attracted a lot of attention recently, from fields such as psycholinguistics, conversation analysis, or second language acquisition. Feedback can be expressed by a variety of verbal/vocal and visual/gestural devices, from questions to head nods and, crucially, discourse or pragmatic markers such as “okay, alright, yeah”. Verbal-vocal and visual-gestural forms often co-occur, which calls for more investigation of their combinations. In this study, we analyze multimodal pragmatic markers of feedback in a corpus of French dialogues, where all feedback devices have previously been categorized into either “alignment” (expression of mutual understanding) or “affiliation” (expression of shared stance). After describing the distribution and forms within each modality taken separately, we will focus on interesting multimodal combinations, such as [negative oui ‘yes’ + head tilt] or [mais oui ‘but yes’ + forward head move], thus showing how the visual modality can affect the semantics of verbal markers. In doing so, we will contribute to defining multimodal pragmatic markers, a status which has so far been restricted to verbal markers and manual gestures, at the expense of other devices in the visual modality. Full article
(This article belongs to the Special Issue Current Trends in Discourse Marker Research)
Show Figures

Figure 1

26 pages, 12966 KiB  
Article
Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants
by Alexander Hartelt, Tim Eipert and Frank Puppe
Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 - 20 Aug 2024
Cited by 2 | Viewed by 1339
Abstract
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR [...] Read more.
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR workflow in the context of the Corpus Monodicum project, enabling the transcription of historical chants. In addition to typical OMR tasks such as staff line detection, layout detection, and symbol recognition, the rarely addressed tasks of text and syllable recognition and assignment of syllables to symbols are tackled. For quantitative and qualitative evaluation, we use documents written in square notation developed in the 11th–12th century, but the methods apply to many other notations as well. Quantitative evaluation measures the number of necessary interventions for correction, which are about 0.4% for layout recognition including the division of text in chants, 2.4% for symbol recognition including pitch and reading order and 2.3% for syllable alignment with correct text and symbols. Qualitative evaluation showed an efficiency gain compared to manual transcription with an elaborate tool by a factor of about 9. In a second use case with printed chants in similar notation from the “Graduale Synopticum”, the evaluation results for symbols are much better except for syllable alignment indicating the difficulty of this task. Full article
Show Figures

Figure 1

15 pages, 557 KiB  
Article
WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation
by Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao and Tadahiro Matsumoto
Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140 - 26 Feb 2023
Cited by 8 | Viewed by 3612
Abstract
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable [...] Read more.
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

16 pages, 783 KiB  
Article
Data-Driven Approach for Spellchecking and Autocorrection
by Alymzhan Toleu, Gulmira Tolegen, Rustam Mussabayev, Alexander Krassovitskiy and Irina Ualiyeva
Symmetry 2022, 14(11), 2261; https://doi.org/10.3390/sym14112261 - 27 Oct 2022
Cited by 5 | Viewed by 2674
Abstract
This article presents an approach for spellchecking and autocorrection using web data for morphologically complex languages (in the case of Kazakh language), which can be considered an end-to-end approach that does not require any manually annotated word–error pairs. A sizable web of noisy [...] Read more.
This article presents an approach for spellchecking and autocorrection using web data for morphologically complex languages (in the case of Kazakh language), which can be considered an end-to-end approach that does not require any manually annotated word–error pairs. A sizable web of noisy data is crawled and used as a base to infer the knowledge of misspellings with their correct forms. Using the extracted corpus, a sub-string error model with a context model for morphologically complex languages are trained separately, then these two models are integrated with a regularization parameter. A sub-string alignment model is applied to extract symmetric and non-symmetric patterns in two sequences of word–error pairs. The model calculates the probability for symmetric and non-symmetric patterns of a given misspelling and its candidates to obtain a suggestion list. Based on the proposed method, a Kazakh Spellchecking and Autocorrection system is developed, which we refer to as QazSpell. Several experiments are conducted to evaluate the proposed approach from different angles. The results show that the proposed approach achieves a good outcome when only using the error model, and the performance is boosted after integrating the context model. In addition, the developed system, QazSpell, outperforms the commercial analogs in terms of overall accuracy. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

12 pages, 295 KiB  
Article
Zero-Shot Topic Labeling for Hazard Classification
by Andrea Rondinelli, Lorenzo Bongiovanni and Valerio Basile
Information 2022, 13(10), 444; https://doi.org/10.3390/info13100444 - 21 Sep 2022
Cited by 6 | Viewed by 2691
Abstract
Topic classification is the task of mapping text onto a set of meaningful labels known beforehand. This scenario is very common both in academia and industry whenever there is the need of categorizing a big corpus of documents according to set custom labels. [...] Read more.
Topic classification is the task of mapping text onto a set of meaningful labels known beforehand. This scenario is very common both in academia and industry whenever there is the need of categorizing a big corpus of documents according to set custom labels. The standard supervised approach, however, requires thousands of documents to be manually labelled, and additional effort every time the label taxonomy changes. To obviate these downsides, we investigated the application of a zero-shot approach to topic classification. In this setting, a subset of these topics, or even all of them, is not seen at training time, challenging the model to classify corresponding examples using additional information. We first show how zero-shot classification can perform the topic-classification task without any supervision. Secondly, we build a novel hazard-detection dataset by manually selecting tweets gathered by LINKS Foundation for this task, where we demonstrate the effectivenes of our cost-free method on a real-world problem. The idea is to leverage a pre-trained text-embedder (MPNet) to map both text and topics into the same semantic vector space where they can be compared. We demonstrate that these semantic spaces are better aligned when their dimension is reduced, keeping only the most useful information. We investigated three different dimensionality reduction techniques, namely, linear projection, autoencoding and PCA. Using the macro F1-score as the standard metric, it was found that PCA is the best performing technique, recording improvements for each dataset in comparison with the performance on the baseline. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

12 pages, 465 KiB  
Article
EANT: Distant Supervision for Relation Extraction with Entity Attributes via Negative Training
by Xuxin Chen and Xinli Huang
Appl. Sci. 2022, 12(17), 8821; https://doi.org/10.3390/app12178821 - 2 Sep 2022
Cited by 2 | Viewed by 1945
Abstract
Distant supervision for relation extraction (DSRE) automatically acquires large-scale annotated data by aligning the corpus with the knowledge base, which dramatically reduces the cost of manual annotation. However, this technique is plagued by noisy data, which seriously affects the model’s performance. In this [...] Read more.
Distant supervision for relation extraction (DSRE) automatically acquires large-scale annotated data by aligning the corpus with the knowledge base, which dramatically reduces the cost of manual annotation. However, this technique is plagued by noisy data, which seriously affects the model’s performance. In this paper, we introduce negative training to filter them out. Specifically, we train the model with the complementary label based on the idea that “the sentence does not express the target relation”. The trained model can discriminate the noisy data from the training set. In addition, we believe that additional entity attributes (such as description, alias, and types) can provide more information for sentence representation. On this basis, we propose a DSRE model with entity attributes via negative training called EANT. While filtering noisy sentences, EANT also relabels some false negative sentences and converts them into useful training data. Our experimental results on the widely used New York Times dataset show that EANT can significantly improve the relation extraction performance over the state-of-the-art baselines. Full article
Show Figures

Figure 1

21 pages, 4662 KiB  
Article
Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
by Arne Defauw, Sara Szoc, Anna Bardadym, Joris Brabers, Frederic Everaert, Roko Mijic, Kim Scholte, Tom Vanallemeersch, Koen Van Winckel and Joachim Van den Bogaert
Informatics 2019, 6(3), 35; https://doi.org/10.3390/informatics6030035 - 1 Sep 2019
Cited by 2 | Viewed by 6244
Abstract
To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of [...] Read more.
To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance. Full article
(This article belongs to the Special Issue Advances in Computer-Aided Translation Technology)
Show Figures

Figure 1

Back to TopTop