Next Article in Journal
A Computationally Efficient Joint Cell Search and Frequency Synchronization Scheme for LTE Machine-Type Communications
Previous Article in Journal
Improved Image Splicing Forgery Detection by Combination of Conformable Focus Measures and Focus Measure Operators Applied on Obtained Redundant Discrete Wavelet Transform Coefficients
Open AccessArticle

Self-Supervised Contextual Data Augmentation for Natural Language Processing

Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Korea
*
Author to whom correspondence should be addressed.
Symmetry 2019, 11(11), 1393; https://doi.org/10.3390/sym11111393
Received: 10 October 2019 / Revised: 5 November 2019 / Accepted: 7 November 2019 / Published: 11 November 2019
In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data. View Full-Text
Keywords: data augmentation; self-supervised learning; natural language processing; text classification data augmentation; self-supervised learning; natural language processing; text classification
Show Figures

Figure 1

MDPI and ACS Style

Park, D.; Ahn, C.W. Self-Supervised Contextual Data Augmentation for Natural Language Processing. Symmetry 2019, 11, 1393.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop