Mask Transformer: Unpaired Text Style Transfer Based on Masked Language

: Currently, most text style transfer methods encode the text into a style-independent latent representation and decode it into new sentences with the target style. Due to the limitation of the latent representation, previous works can hardly get satisfactory target style sentence especially in terms of semantic remaining of the original sentence. We propose a “Mask and Generation” structure, which can obtain an explicit representation of the content of original sentence and generate the target sentence with a transformer. This explicit representation is a masked text that masks the words with the strong style attribute in the sentence. Therefore, it can preserve most of the semantic meaning of the original sentence. In addition, as it is the input of the generator, it also simpliﬁed this process compared to the current work who generate the target sentence from scratch. As the explicit representation is readable and the model has better interpretability, we can clearly know which words changed and why the words changed. We evaluate our model on two review datasets with quantitative, qualitative, and human evaluations. The experimental results show that our model generally outperform other methods in terms of transfer accuracy and content preservation.


Introduction
The text style transfer task aims to change the stylistic attributes of sentences (e.g., emotions), while retaining the style-independent content of the context as much as possible. This method can be widely used to transfer review sentiment, rewrite news, change dialogue emotion, and so on. For example, a positive comment "The restaurant's dishes are very delicious, I highly recommend it!" can be transformed into a negative comment "The restaurant's dishes are a bit disappointing, and I will never come again!" Currently, there is no specific and common definition of text style, so we usually set the definition depending on the task. In addition, due to the high construction costs of parallel corpus, the current research is mainly conducted using nonparallel corpora [1].
Currently, a deep neural network model based on the seq2seq framework is the main text style transfer method. The first class of methods focuses on disentangling the content and style in the latent space and encodes sentences into latent representations in the semantic space and style space, respectively. After disentangling the content, the style-independent representation will be decoded into a sentence with the target style via a generative model [2,3]. Another class of methods attempts to learn the mixed content and style distribution in the latent space and directly map it in the latent space to complete the style transfer [4,5].
Both classes of methods have certain problems. In the first class of methods, the quality of the separated hidden vector is difficult to evaluate, and the latent representation has difficulties to keep rich semantic information of the original sentence due to its limited capabilities, especially for long text. Regarding the direct transfer method, because of the limitation of the latent space, part, we locate the words with higher style attributes in the sentence and replace them with the mask symbols. We follow [7] to train a self-attention [8] style classifier [9] in which the learned attention weights can be used to analyze the style attribute of the words in the sentences. The larger the weight is, the stronger the style attribute. Using this attention mechanism and the style dictionary method, we can find the words with strong style attributes in the sentence. Then, we mask those words to turn the sentence into a neutral one. In the generation part, we take the sentence with the mask symbols as input and use the powerful self-attention mechanism of transformer to generate new sentence according to the target style. Different from [10], who filled [11] the mask position in the sentence, we generate a new sentence with the target style by training a transformer-based [12] generation model for the input masked sentence and the specified style. This approach finds an explicit neutral representation of the sentence via the mask, and then, it flexibly generates the sentence with the target style through the transformer. This not only ensures that the model can capture rich semantic information for the conversion but also maintains high style transfer accuracy.
Our contributions are summarized as follows: 1. We propose a "Mask and Generation" structure, which can get an explicit representation of the content of original sentence and generate the target sentence with a transformer. This explicit representation is a masked text that masks the words with the strong style attribute in the sentence. Therefore, it can preserve most of the semantic meaning of the original sentence and also simplify the generator process. 2. As the explicit representation is readable, the model has better interpretability, we can clearly know which words changed and why the words changed. 3. We use a self-attention mechanism and the style dictionary method to analyze the contributions of the style attributes for each component in the sentence. 4. We generate the target sentence by training a transformer-based generation model for the input masked sentence and the specified style.
The experimental results show that our method generally outperforms most models, especially in style accuracy and content preservation.

Masker
The dishes in this restaurant are very delicious and I highly recommend It! The dishes in this restaurant are a bit disappointing and I will never come again!   Our contributions are summarized as follows:

Style Generator Discriminator
We propose a "Mask and Generation" structure, which can get an explicit representation of the content of original sentence and generate the target sentence with a transformer. This explicit representation is a masked text that masks the words with the strong style attribute in the sentence. Therefore, it can preserve most of the semantic meaning of the original sentence and also simplify the generator process.

2.
As the explicit representation is readable, the model has better interpretability, we can clearly know which words changed and why the words changed.

3.
We use a self-attention mechanism and the style dictionary method to analyze the contributions of the style attributes for each component in the sentence.

4.
We generate the target sentence by training a transformer-based generation model for the input masked sentence and the specified style.
The experimental results show that our method generally outperforms most models, especially in style accuracy and content preservation.

Related Work
In the early stages of the research, text style transfer methods tended to find the style and content representations of sentences. Reference [13] propose a cross-aligned autoencoder with adversarial training [14] to learn a shared latent content distribution and a separated latent style distribution. Reference [15] focused on disentangling style and content representations. They designed a multitask and adversarial loss for a variational autoencoder to ensure the separation. In addition, there are some methods that implicitly find the neutral representation of a sentence. Reference [16] believe that text style transfer can be accomplished using a combination of delete, modify, and generate operations. This is done by deleting words with strong style attributes to find the style-independent part of the sentence and generating new sentences using modifying and generating operations. Reference [10] designed a two-step "Mask and Infill" approach by masking sentimental tokens and predicting words according to the target sentiment.
Although separating the style and content in a sentence will help us in the text style transfer task, it makes it hard to guarantee the quality of the separated style distribution and content distribution. In addition, the fixed-size latent representation limits the generation ability of the model, and it inevitably loses some content information from the original sentence. Therefore, some researchers have proposed some methods that do not need to separate the style and content. Reference [7] was inspired by the cycle style transfer method [17] in Computer Vision. They proposed a cycle reinforcement learning method for nonparallel sentiment transfer tasks. Reference [18] utilized the powerful attention mechanism in the transformer to implement the direct transfer of text style. Reference [19] propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context. Reference [20] define a generative probabilistic model that treats a non-parallel corpus in two domains as a partially observed parallel corpus.
The direct transfer method also loses the semantic meaning of the original sentence for the limitation of the latent space. In addition, because of its poor interpretability, it has difficulties dealing with some adaption problems or incorporating external information. Besides, most previous work used recurrent neural networks (RNNs) as encoders and decoders [6], which are limited by their weak abilities to capture long-term dependencies. In summary, previous works can hardly get satisfactory target style sentence especially in terms of semantic remaining of the original sentence.
Based on previous work, our method explicitly obtains a sentence's neutral representation through masking the strong style attribute words in the sentence. This method greatly enhances the content remaining of the original sentence as well as the interpretability of style transfer. Different from the previous work that directly predicts the masked words, we use the powerful attention mechanism in the transformer to regenerate the neutral masked sentences according to the target style.

Approach
In this section, we will introduce our text style transfer method in detail. Section 3.1 is the basic definition of the problem, Section 3.2 is an overview of our model, and Sections 3.3-3.5 will show the details of the modules one by one.

Problem Formalization
In this paper, we define the text style transfer problem as follows. Consider a collection of datasets , and each dataset D i consists of many natural language sentences. For all sentences in a single dataset D i , they have some common specific characteristics (for example, they are all positive reviews of a specific product), and we call such shared characteristics the styles of these sentences. In other words, the style is defined by the distribution of the dataset. Suppose we have K different datasets D i . Then, we can define K different styles, and each style is represented by the symbol s (i) . The goal of text Appl. Sci. 2020, 10, 6196 4 of 15 style transfer is the following: given a sentence x of any style and the target styleŝ , rewrite x as a new sentencex with styleŝ and retain the content as much as possible.

Model Overview
To solve the style transfer problem defined above, our goal is to learn a model with (ŝ, x) as the input, where x is the sentence with the original style attribute,ŝ is the target style, and the output of the model is sentencex with styleŝ.
Our method consists of three parts: a Masker module, a Style Generator module, and a discriminator module, as shown in Figure 2. In Section 3.3, the Masker module combines the advantages of the two methods by using the self-attention classification model and the auxiliary style dictionary to find the words with strong style attributes in the sentence. After that, it will perform the mask operation to obtain the masked sentence. In Section 3.4, we take the masked sentence and the specified target style as the input, and the Style Generator will regenerate a sentence with the target style. Unlike the method of filling the mask position, our method will regenerate a complete sentence based on the masked sentence, so that the generated sentence is more flexible in terms of structure and semantics. In addition, a major problem in text style transfer is that there are not enough parallel corpora. Therefore, we cannot directly train our style conversion model in a supervised manner. In Section 3.5, we will introduce a discriminator-based approach [21,22] to conducting supervised training using nonparallel corpora. Finally, we will combine these three parts to train our style transfer model through the learning algorithm in Section 3.6.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 14 target style ̂∈ { } , rewrite as a new sentence with style ̂ and retain the content as much as possible.

Model Overview
To solve the style transfer problem defined above, our goal is to learn a model with (, ) as the input, where is the sentence with the original style attribute, ̂ is the target style, and the output of the model is sentence with style .
Our method consists of three parts: a Masker module, a Style Generator module, and a discriminator module, as shown in Figure 2. In Section 3.3, the Masker module combines the advantages of the two methods by using the self-attention classification model and the auxiliary style dictionary to find the words with strong style attributes in the sentence. After that, it will perform the mask operation to obtain the masked sentence. In Section 3.4, we take the masked sentence and the specified target style as the input, and the Style Generator will regenerate a sentence with the target style. Unlike the method of filling the mask position, our method will regenerate a complete sentence based on the masked sentence, so that the generated sentence is more flexible in terms of structure and semantics. In addition, a major problem in text style transfer is that there are not enough parallel corpora. Therefore, we cannot directly train our style conversion model in a supervised manner. In Section 3.5, we will introduce a discriminator-based approach [21,22] to conducting supervised training using nonparallel corpora. Finally, we will combine these three parts to train our style transfer model through the learning algorithm in Section 3.6.

Masker
We first introduce the method based on the self-attention classification model and then introduce an assisted style dictionary method. Finally, we propose a fusion method that combines the advantages of both methods.

Self-Attention Classifier-Based Method
For a sentence, each word in the sentence contributes differently to the sentence's style. If a word or phrase contributes more to the style of a sentence, it means that this component has a higher style attribute. In other words, if we can find the components with large style contributions in each sentence and mask them, we can obtain masked sentences that tend to be neutral. This is an approach to approximately obtain style-independent representations. For a sentence =< , , … , > with words, we use bidirectional LSTM (long short-term memory) to encode the sentence and concatenate the forward hidden state and backward hidden state of each word to obtain the final hidden state: where is the length of the given sentence. The self-attention mechanism calculates an attention weight vector . In addition, the weighted hidden state vector is obtained by multiplying the

Masker
We first introduce the method based on the self-attention classification model and then introduce an assisted style dictionary method. Finally, we propose a fusion method that combines the advantages of both methods.

Self-Attention Classifier-Based Method
For a sentence, each word in the sentence contributes differently to the sentence's style. If a word or phrase contributes more to the style of a sentence, it means that this component has a higher style attribute. In other words, if we can find the components with large style contributions in each sentence and mask them, we can obtain masked sentences that tend to be neutral. This is an approach to approximately obtain style-independent representations. For a sentence x = t 1 , t 2 , . . . , t N with N words, we use bidirectional LSTM (long short-term memory) to encode the sentence and concatenate the forward hidden state and backward hidden state of each word to obtain the final hidden state: Appl. Sci. 2020, 10, 6196 where N is the length of the given sentence. The self-attention mechanism calculates an attention weight vector a. In addition, the weighted hidden state vector c is obtained by multiplying the hidden weight vector H by a. Finally, we convert c into a probability distribution y through the Softmax layer: where w, W, and W are the network parameters. W maps the hidden layer vector to a high-dimensional space to learn the impact of the input on the label. w is used to map the vector to a scalar, and the attention weight is obtained by Softmaxing the sequence. W maps the output vector after the weighted summation to the category dimension and obtains the probability of each category through Softmax. After sufficient training, the classifier can achieve 97% accuracy. We use cls ρ to represent the attention-based classification model, where ρ represents the model parameters.
Definitely, the attention value calculated during classification can be extracted and used to analyze the contribution of the style attributes for each component of the sentence. Considering that the length of each sentence may be different, using an average attention value or fixed value as a threshold [7] has certain limitations and cannot be adapted to each sentence. We propose a mask method based on the proportion of the sentence length. For example, the words of a sentence are sorted according to their attention values, and then the top 15% of the words will be masked. This strategy can adapt to various sentence lengths.

Style-Dictionary-Based Method
We define this method as follows. For any dataset {D i } K i=1 , let count(u, D i ) denote the number of times the n-gram word u appears in dataset D i . The smoothed frequency ratio represents the significance of u relative to the dataset D i and is calculated as follows: where λ is the smoothing parameter. When score(u, i) is greater than the threshold γ score , the word u is considered to be a component with a large contribution to the style attribute. Finally, these words form the auxiliary style dictionary v i .

Fusion Method
In this paper, we propose a fusion method that combines the advantages of these two approaches described above. The auxiliary-style-dictionary based method has the advantages of stability and scalability. When the threshold γ score is larger, the words in style dictionary v i will be more characteristic. The self-attention classifier-based method has the advantages of flexibility and self-adaptation. It can automatically analyze the style attributes of each component of the sentence and find some components with potential style attributes.
The specific process is as follows. First, calculate the style dictionary v i through (4). For each sample, mask all the words appearing in the style dictionary v i . If the words masked by the assisted style dictionary method do not reach the mask proportion, the self-attention classifier-based method will remask the sentences by calculating the attention value of each component. For sample x in dataset D, we use x to represent the masked sentences-that is, x = mask(x)-and the dataset of masked sentences is denoted as D mask .

Style Generator
In the Style Generator, we chose the standard Transformer model, following the classic encoder-decoder structure. For example, for the input x = (x 1 , x 2 , . . . , x N ), the transformer Appl. Sci. 2020, 10, 6196 6 of 15 encoder Encoder θ ( x) maps it to a latent continuous representation vector c = (c 1 , c 2 , . . . , c N ). Then, the transformer decoder Decoder θ (c) generates the conditional probability of output y = (y 1 , y 2 , . . . , y N ) through an autoregressive calculation as follows: For each time t, in the decoder, the probability of generating a word is calculated by a Softmax layer: where o t is the logit vector output by the decoder.
To apply the target style control to the generation, we additionally add a mark of the target style before the input, similar to the <cls> mark for BERT. That is, Encoder θ (s, x). Therefore, the model can calculate the output probability under the conditions of input x and target style s: We denote the Style Generator model as f θ , where θ represents the model parameters. Then, the predicted sentence calculated above is denoted by f θ (s, x).

Discriminator
The purpose of introducing the discriminator module is to solve the problem of nonparallel corpus training. When we take samples from the dataset D and obtain (s, x) through the masker module, but due to the lack of a parallel corpus, we cannot obtain the corresponding reference to sentence f θ (ŝ, x), while target styleŝ s. Therefore, we introduce a discriminator module to learn via supervised training from nonparallel corpora.
For the data (s, x), we can intuitively restore x to sentence x according to its original style s. Furthermore, we use its own supervised training to make the model have a certain style transfer ability. For the target styleŝ s, we train a discriminator network to constrain the optimization direction of the generation module in order to better generate target style sentences.
The discriminator network we use includes a Transformer encoder, which is used to distinguish the styles of sentences. The style control information of the discriminator network will be passed to the generation module. Different from the traditional discriminator, in order to better guide the generation module during training, we refer to the discriminator training method of [18] and use two different discriminator structures. We denote the discriminant model as d ϕ , where ϕ is the model parameter.

Conditional Discriminator
Similar to the discriminator in conditional GANs (Generative Adversarial Networks), the conditional discriminator makes decisions based on the input sentence and style. Specifically, the conditional discriminator d ϕ needs to complete a binary classification task, and its inputs are the sentence x and the matching style s. The output of the discriminator d ϕ (s, x) determines whether the style of the input sentence x is s.
In the discriminator training process, for the style s, the positive sample is the real sentence x and the reconstructed sentence f θ (s, x), and the negative sample is the transfer sentence f θ (ŝ, x), while the target styleŝ s. In the training process of the Style Generator, the goal of the generator f θ is to maximize the probability that the discriminator determines that d ϕ (ŝ, f θ (ŝ, x)) is true.

Multi-Class Discriminator
Compared to the former, the multiclass discriminator only uses one sentence as its input, and its goal is to judge the style of the sentence. Unlike traditional discriminators, for K-style tasks, multiclass discriminators need to perform K + 1 classification tasks. The first K categories are K styles, and the last category is the transfer sentence f θ (ŝ, x) of the target styleŝ s. The purpose of this design is to help the generation module learn more accurate knowledge from the discriminator. As the transfer sentence f θ (ŝ, x) is usually poor at the beginning of training, setting these sentences as another class can make the generator closer to the distribution of real sentences during the iterative training process.
In the discriminator training process, the real sentence x and the reconstructed sentence f θ (ŝ, x) will be labeled as style s, and the transfer sentence f θ (ŝ, x) will be labeled as class 0. In the training process of the Style Generator, the goal of f θ is to maximize the probability that the discriminator determines that f θ (ŝ, x) has styleŝ.

Training Algorithm
This section will mainly introduce the training algorithm of each module.

Masker Training Algorithm
The training algorithm of the masker module mainly trains a classification model based on self-attention. Its goal is to determine the style category of each sentence to obtain the attention weight as the subsidiary product. This is the basis for analyzing and masking the sentence. The loss function for the Masker is the cross-entropy loss of the classification problem; that is, where dataset D is the original training set. The learning algorithm of the discriminator mainly trains a classification model based on the Transformer encoder. Its goal is to distinguish between the original sentence x, the reconstructed sentence f θ (s, x), and the transfer sentence f θ (ŝ, x). The loss function for the discriminator is the cross-entropy loss of the classification problem.
For the conditional discriminator, and for multiclass discriminator, where dataset D consists of x, f θ (s, x) and f θ (ŝ, x). The details are given in Algorithm 1.

Style Generator Training Algorithm
The training of the style generator is divided into two parts. One part is the case where the target styleŝ = s, and the other part is the case where the target styleŝ s.
Sentence reconstruction: For the case when the target styleŝ = s, we can directly apply a training method that reconstructs the mask sentence x into the original sentence x by using its own supervision information. Specifically, when using the style s and the masked sentence x as input, the model output is as f θ (s, x) close as possible to the original sentence x. The training goal is to minimize the negative log-likelihood: where the dataset D mask is obtained by masking the sentences in the original training set D.
Style generation: For the case when the target styleŝ s, we introduce a loss function to control the generation of the style, so that the transformed sentences are closer to the distribution of the real sentences and the reconstructed sentences.
Using the conditional discriminator, we can obtain the probability that d ϕ (ŝ, f θ (ŝ, x) is true. The goal of the generator is to minimize the negative log-likelihood: Using the multiclass discriminator, we can calculate the probability that d ϕ ( f θ (ŝ, x)) has the stylê s. The goal of the generator is to minimize the negative log-likelihood of the class probability: Combining the loss functions described above, we can conduct training for the generator. Algorithm 2 shows the details of the training process.

Summarization
By combining the training algorithms of the above modules, we can obtain the whole training process of the model. In the mask part, first, the self-attention model of the Masker is trained. After the model convergence is stable, the Masker masks the sentences in the original dataset D and obtains the dataset D mask . In the style generation part, using the alternate training method of the generator and discriminator in GANs [14], we also alternately train the Style Generator and Discriminator. In each iterative round, we first train the Discriminator n d steps to obtain an optimized discrimination module. Under the updated Discriminator, the Style Generator will be trained n g steps to optimize the results of the generation. The iterative training continues until the model converges and stabilizes. The specific algorithm is given in Algorithm 3.
The learning algorithm of the discriminator mainly trains a classification model based on the Transformer encoder. Its goal is to distinguish between the original sentence x, the reconstructed sentence f θ (s, x) and the transfer sentence f θ (ŝ, x). The loss function for the discriminator is the cross-entropy loss of the classification problem.
In addition, there is a problem in training that needs to be briefly explained. Due to the discrete nature of natural language, when we obtain the transfer sentence and input it into the discrimination module, the gradient calculated by the discrimination module cannot be propagated back to the generation module. To solve this problem, it is common to use the Gumbel-Softmax strategy or the reinforcement learning method to evaluate the gradient from the discriminator. However, both methods have the problem of high variance, which makes it difficult for the model to converge and stabilize. Therefore, we use the way that [18] deal with discrete sample problems. Instead of directly using the generated words as the input, we use the Softmax distribution generated by f θ as the input. Similarly, for the decoder of the generator, the decoding method is also changed from greedy decoding to continuous decoding. Specifically, at each calculation time, instead of using the word with the highest probability predicted in the previous step, we use the probability distribution as the input. Regarding the input in the form of a probability distribution, the decoder will calculate the weighted average representation of the probability distribution through the embedding matrix.

Evaluation
An ideal transfer sentence should be prominent, content-complete, and fluent. After referring to the evaluations in previous work, we mainly focus on the following three aspects of sentence generation: (1) the degree of style transfer, (2) content preservation, and (3) fluency. In terms of the current evaluation methods, we include two parts: automatic evaluation and manual evaluation.

Automatic Evaluation
Style transfer: To evaluate the style transfer, we use the accuracy of the generated sentences in the style classification as an automatic evaluation metric. Specifically, we refer to [9] and use fastText to train a style classifier based on the Yelp and IMDb datasets, respectively. For the transfer sentence f θ (ŝ, x), we use the classifier to evaluate the style accuracy.
Content preservation: To measure the content preservation, we adopt the BLEU (Bilingual Evaluation Understudy) score [23] as the evaluation metric. Specifically, we use the calculation tool provided by NLTK (Natural Language Toolkit) to calculate the BLEU score for the transfer sentence and the original sentence. A high BLEU score indicates that there is a high degree of similarity between the converted sentence and the original sentence on the word-level. This indicates that there is good content reservation. In addition, if the dataset provides artificial reference sentences, we will also calculate the BLEU score of the transfer sentence and the reference sentence. The two BLEU metrics are defined as self-BLEU and ref-BLEU.
Fluency: A common way to evaluate the fluency is calculating the perplexity of the transfer sentence. Specifically, we use KenLM to train a 5-g language model on the Yelp and IMDb datasets, and we use this model to calculate the perplexity of the sentence. The lower the perplexity is, the higher the generation probability and fluency of the sentence.

Manual Evaluation
For the manual evaluation, we chose the scoring method and hired three reviewers to score the output of our model and the best models of [18] (Style Transformer) and [16] (Delete and Retrieve). Scoring is similar to the automatic evaluation. We mainly evaluate three aspects: the style transfer, the content reservation, and the fluency of the output results. Our scoring scale has scores that range from 1 (very poor) to 5 (very good). For each dataset, we will sample 100 sentences for each target style in the evaluation.

Training Details
Mask part: For the self-attention classification model, the word embedding size is 256, the bidirectional LSTM hidden size is 256, and the number of hidden layers is two. For training, we use the Adam optimizer with a learning rate of 0.001 and train for five epochs. For the assisted style dictionary method, we use the statistical results of 1-g and 2-g, set the smoothing parameter λ to 1, and set the threshold γ score to 0.75.
Generation part: In the generator and the discriminator, we use four layers of transformers, and each layer has a multi-head attention of eight. The word embedding, position encoding, and style embedding are all 256 in size. In the encoder, style embedding will be added to the head of the sentence as a mark, and the mark does not use position encoding information. For the discriminator, similar to the BERT [24], we add the cls mark to the head of the input sentence, and the output corresponding to this mark will be passed to a Softmax layer to obtain the output of the discriminator. In terms of the training parameters, n d is set to 10 and n g is set to five. We use the Adam optimizer with a learning rate of 0.0001.
In the experiment, in order to improve the robustness of the model and converge to a more reasonable result, we use the practice of [18]. When the model calculates the reconstruction loss function (6), we conduct random word dropout on the input, which makes the model more robust. The experiments show that the random word dropout improves the transfer results in some cases. Table 2 shows the automatic evaluation results of the model on the two datasets. In the Yelp dataset, although RetrieveOnly achieved the highest accuracy rate of 92.6%, and the perplexity is the lowest of 7. While in terms of semantic retention, the self-BLEU of RetrieveOnly was only 0.7, and the ref-BLEU was only 0.4. Almost no information about the original sentence was retained. This is because RetrieveOnly uses a retrieve method to find the most suitable template in the target style dataset, and directly use the template as the generated sentence. Therefore, the style conversion degree of this method is very high and the fluency of the template sentence almost as natural language. However, since the content of the generated sentence is completely changed compared to the original sentence, this method has no practical value. In terms of IMDB, the CycleRL method suffers from the same problem. Although it achieves an accuracy of 97.8% in style conversion, it only has a self-BLEU of 4.9 and a perplexity of 177 in terms of semantic retention and language fluency. We can see that our model has achieved the best results in style accuracy and content reservation for both datasets, and it also has better perplexity. For the manual evaluation, we select two representative models: Delete And Retrieve [16] and the Style Transformer [18]. For our method, we chose the model based on the multiclass discriminator. We randomly sample 100 sentences for each style in the dataset, and the transfer sentences were scored by three reviewers after being scrambled. The results are shown in Table 3. It can be seen that our model performs significantly better than other methods in style transfer accuracy as well as content remaining and fluency of the target sentence. To better understand the characteristics of our model, we sampled some sentences in the Yelp dataset, as shown in Table 4. ST the burgers are while cooked to the point the meat loved crunchy! Ours the burgers were cooked to the point and the meat was delicious. Reference the burgers were cooked perfectly and the meat was juicy.

Conclusions
In this paper, we focus on solving the style transfer problem of nonparallel corpora and propose a style generation method based on a "Mask and Generation" structure, which can be trained using nonparallel corpora and has good interpretability. The experimental results on two review datasets show that our method outperforms previous approaches in terms of style conversion and content reservation. The masked sentences can keep the semantic meaning of the original sentence and also help us to understand the transfer process. In the future, we want to further explore the introduction of prior knowledge in the masker module and improve the model for more fine-grained tasks (multiple emotions).