Next Article in Journal
Transformer-Based Attention Network for Vehicle Re-Identification
Previous Article in Journal
Dual Voltage Forward Topology for High Efficiency at Universal Mains
Previous Article in Special Issue
Adaptive Lossless Image Data Compression Method Inferring Data Entropy by Applying Deep Neural Network
Article

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka 820-8502, Japan
*
Author to whom correspondence should be addressed.
Academic Editor: Manohar Das
Electronics 2022, 11(7), 1014; https://doi.org/10.3390/electronics11071014
Received: 28 February 2022 / Revised: 19 March 2022 / Accepted: 21 March 2022 / Published: 24 March 2022
(This article belongs to the Special Issue Data Compression and Its Application in AI)
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data. View Full-Text
Keywords: byte-pair encoding; locally consistent parsing; vocabulary; word embedding byte-pair encoding; locally consistent parsing; vocabulary; word embedding
Show Figures

Figure 1

MDPI and ACS Style

Nonaka, K.; Yamanouchi, K.; I, T.; Okita, T.; Shimada, K.; Sakamoto, H. A Compression-Based Multiple Subword Segmentation for Neural Machine Translation. Electronics 2022, 11, 1014. https://doi.org/10.3390/electronics11071014

AMA Style

Nonaka K, Yamanouchi K, I T, Okita T, Shimada K, Sakamoto H. A Compression-Based Multiple Subword Segmentation for Neural Machine Translation. Electronics. 2022; 11(7):1014. https://doi.org/10.3390/electronics11071014

Chicago/Turabian Style

Nonaka, Keita, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita, Kazutaka Shimada, and Hiroshi Sakamoto. 2022. "A Compression-Based Multiple Subword Segmentation for Neural Machine Translation" Electronics 11, no. 7: 1014. https://doi.org/10.3390/electronics11071014

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop