Fine-Grained Sentiment-Controlled Text Generation Approach Based on Pre-Trained Language Model

: Sentiment-controlled text generation aims to generate texts according to the given sentiment. However, most of the existing studies focus only on the document-or sentence-level sentiment control, leaving a gap for ﬁner-grained control over the content of generated results. Fine-grained control allows a generated review to express different opinions toward multiple aspects. Some previous works attempted to generate reviews conditioned on aspect-level sentiments, but they usually suffer from low adaptability and the lack of an annotated dataset. To alleviate these problems, we propose a novel pre-trained extended generative model that can dynamically refer to the prompt sentiment, together with an auxiliary classiﬁer that extracts the ﬁne-grained sentiments from the unannotated sentences, thus we conducted training on both annotated and unannotated datasets. We also propose a query-hint mechanism to further guide the generation process toward the aspect-level sentiments at every time step. Experimental results from real-world datasets demonstrated that our model has excellent adaptability in generating aspect-level sentiment-controllable review texts with high sentiment coverage and stable quality since, on both datasets, our model steadily outperforms other baseline models in the metrics of BLEU-4, METETOR, and ROUGE-L etc. The limitation of this work is that we only focus on ﬁne-grained sentiments that are explicitly expressed. Moreover, the implicitly expressed ﬁne-grained sentiment-controllable text generation will be an important puzzle for future work.


Introduction
In recent years, Transformer-based pre-trained language models (LMs) have greatly improved the state-of-the-art of natural language processing tasks as well as natural language generation (NLG). Large-scale autoregressive Transformer models [1] that leverage large amounts of unannotated data and a simple log-likelihood training objective have achieved remarkable results in many text-generation tasks, such as machine translation, text summarization, and text style transfer. Meanwhile, for other real-world text-generation applications, such as review generation and essay writing, users prefer the generated text to be more controllable. However, since the LMs are trained on unannotated data, controlling attributes of generated text becomes difficult without modifying the model architecture to allow for extra input attributes or fine-tuning with attribute-specific data [2,3]. Therefore, some approaches, such as Plug-and-Play-Language-Models (PPLM) [4], control generated text through attribute models without changing the architecture or weights of pre-trained LMs. These models usually regard controllable text generation as generating tasks conditioned on the attributes, such as topic and sentiment at the sentence-or document-level, leaving a gap for finer-grained (e.g., aspect-level) control over the content of generated texts.
The fine-grained sentiment-conditioned text-generation task aims to automatically generate a highly relevant statement when given a series of fine-grained sentiments (e.g., aspect-opinion, aspect-sentiment) as input. Zang and Wan [5] first introduced the aspectsentiment information to perform aspect-level sentiment-controllable review generation. They conducted conditional training by adopting a supervised method requiring a large dataset annotated with sentence-level aspect-sentiment labels. However, very few datasets provide such sufficient fine-grained labels, and it is also labor-intensive and time-consuming to conduct annotation on all data instances. Chen et al. [6] proposed a mutual learning framework leveraging large unlabeled data through interactive learning between the generator and the classifier. Besides the aspect-sentiment, aspect-opinion pairs also express aspect-level sentiment information. Therefore, inspired by them, in this work, we introduce the aspect-opinion information into the fine-grained sentiment-controllable text generation.
The aspect-opinion pairs represent the fine-grained sentiments that could be expressed within a review sentence, where the aspect term refers to the target of an opinion, and the opinion term refers to the sentimental words that describe the aspect term. For example, in the sentence of Figure 1, ("hotdog", "better") is an aspect-opinion pair, where "hotdog" is an aspect term, and "better" is an opinion term, together they form the backbone of finegrained sentiment in the review text. Therefore, the aspect-opinion conditioned generation task aims to generate a review text X that correctly contains the sentiment information from n non-repeated aspect-opinion pairs (a, o) 1:n . Most previous works [5,7,8] used the aspect-polarity pairs rather than the aspect-opinion pairs, and they used a straightforward data-to-text modeling approach, which is much more difficult due to the discrete and sparsity of the input data. To tackle this problem, relying on the natural characteristics of aspect-opinion pairs directly presented in sentences, our approach proposed a query-hint mechanism as a dynamic prompt strategy to guide the generation direction. Furthermore, in order to guarantee the quality of the generated results, in the generator, we incorporate a GPT-2 345M model [9] as the "super generator," then by extending this state-of-the-art model with our proposed query-hint mechanism and our sentiment control loss function to guide the generating process toward the given controlling information. Moreover, to further enhance the generator's performance, with the assistance of a classifier by extracting the fine-grained sentiments, we leveraged a large unlabeled dataset to train the generator. The experimental results demonstrate the effectiveness of these components.

Sentence:
Their hotdog is better compared with tasteless bread. Aspect-Opinion Pairs: {(hotdog, better), (bread, tasteless)} Figure 1. An illustrative example of how the aspect-opinion pairs are expressed in a review sentence. The terms highlighted in red and blue are aspect terms and opinion terms, respectively.

Our Contributions:
• We propose our conditional generative model by extending a pre-trained state-of-theart Transformer-based generative model with our introduced query-hint mechanism and sentiment control loss function to further guide the text generation at a finergrained level. • To better model a text-to-text schema, we introduce the aspect-opinion pair as the fine-grained sentiment unit to control the constrained text generation. • Through employing an auxiliary classifier, we leverage a large unannotated dataset to re-train and fine-tune an end-to-end conditioned text generative model.
The remainder of this paper is organized as follows. Section 2 discusses the related works in controlled text generation, including the review generation and the aspect-level sentiment-controlled generation, which is less studied. Section 3 introduces our proposed approach that achieved finer-grained sentiment control in generation. In Section 4, the experimental settings are detailed, and evaluation metrics and results are also discussed to demonstrate the validity of our approach. Finally, we conclude this work in Section 5 while discussing future work.

Controlled Text Generation
Recently, there has been many studies that aim to generate text conditioned on input attributes with neural networks. Some of the earlier efforts have studied this controlled text generation by training a conditional generative model [10,11] while fine-tuning pretrained models with Reinforcement Learning (RL) [3] and training a Generative Adversarial Network [12] have also shown inspiring results. The Conditional-Transformer-Language (CTRL) model [2] is a recent approach that trains a language model conditioned on a variety of control codes (e.g., "Reviews" and "Legal" control the model to generate reviews and legal texts, respectively), which prepended meta-data to the text during generation. Although it uses a GPT-2-like architecture to generate high-quality text, the result is at the cost of fixing the control codes and training a very large model. PPLM [4] composed a pre-trained LM with attribute controllers guiding text generation toward the desired attribute. At the same time, its flexible design allows it to control the generating process through relatively small "pluggable" attribute models while keeping parameters in the LM fixed. Chan et al. [13] incorporated a pre-trained GPT-2 model with a Content-Conditioner (CoCon) to control the generated text under the guidance of target text content. Yu et al. [14] proposed a simple and flexible method, infusing attribute representations into a pre-trained unconditional LM without changing the LM parameters to achieve sentiment-and topiccontrolled generation. Different from our fine-grained sentiment-controlled text-generation (FSCTG) task, these works focus on sentence-based sentiment and topic control in text generation. In the FSCTG task, the text-generation process is controlled by a series of fine-grained sentiments (e.g., aspect-opinion or aspect-sentiment).

Review Generation
Review generation [7,15], a generation task aiming to automatically generate review text, is a related area that generates reviews conditioned on the given information. While most of the previous approaches [7,8] have framed review generation as A2T (Attributeto-Text problem), leaving a gap between attributes (e.g., user, product, and rating) and linguistic data. To tackle this problem, Kim et al. [16] proposed AT2T (Attribute-matched-Text-to-Text) by augmenting inductive biases of attributes with matching reference reviews to learn the rich representations of attributes.

Aspect-Level Sentiment Control
Nevertheless, most of these works only focus on sentence-level sentiments and ignore the aspect-level sentiment control, and very few researchers have studied generating reviews from fine-grained sentiments due to the lack of announced data. Zang and Wan [5] gave the first attempt to generate reviews from aspect-sentiment scores, which requires the reviews with sentence-level aspect-sentiment score annotations. This makes it impractical in real-world applications due to the lack of labeled data. To tackle this problem, Chen et al. [6] proposed a semi-supervised aspect-level sentiment-controllable review generation method, under their proposed mutual learning framework with the assistance of a classifier, it can take advantage of large-scale unlabeled data to achieve aspect-level sentiment control in review generation with few labeled data. Fei et al. [17] combined fine-grained sentiment classification and generation tasks as a joint dual learning system, strengthening the mutual connection of both tasks. To overcome the defect of sparsity and discrete nature brought by the input data in the data-to-text scheme, Yuan et al. [18] proposed a hierarchical templatetransformer (HTT); they split the generation task into two corresponding pipeline subtasks, i.e., opinion phrase generation and review composition, which were jointly trained on the HTT. Although in different ways, they all trained an efficient end-to-end generative model. However, they did not attempt to dynamically adjust the attention weights during the model's generation process since some contents (e.g., the completion of sentiment words generation) are informative to the global generation and need to be notified.

Method
In this section, we introduce our fine-grained sentiment-controllable text-generation task together with a conditional generative model named Aspect-level Sentiment Conditioner (AlSeCond), which was trained with both labeled and unlabeled data to learn a fine-grained sentiment review generator with the assistance of a classifier.
First, we give the formalization of our fine-grained sentiment-controllable textgeneration task. Specifically, given the fine-grained sentiment units (i.e., aspect-polarities or aspect-opinions) as the input s, the model generates a target text X that covers the input sentiments. As a straightforward approach, as other studies have used [5,7,8], the data-to-text modeling can be much more difficult when compared with the text-to-text modeling due to the discrete and sparsity of the input data [17]. Therefore, in this work, we consider a translation of this task to the text-to-text formulation. More conveniently, given aspect and polarity, it is effortless to retrieve opinion phrases from aspect sentiment triplets (AST [19], i.e., the triplet of aspect, opinion, and sentiment polarity) extracted from the review text. This work, therefore, set s = {(a 1 , o 1 ), (a 2 , o 2 ), . . . , (a n , o n )} and aims to generate a review text X comprising m words (X = {x 1 , x 2 , . . . , x m }), which presents each aspect phrase a i and its corresponding opinion phrase o i (i ∈ {1, 2, . . . , n}) properly.
In this task, we have a labeled dataset L and an unlabeled dataset U. In the labeled dataset L, each labeled datum ∈ L comprises a review text and a list of aspect-opinion phrase pairs s, i.e., = X, s , while in the unlabeled dataset U, each u ∈ U only contains a review text, i.e., u = X .
In the following subsections, we first introduce our main framework for how to train a generator on both labeled and unlabeled datasets. Then, we explain our generator and classifier in detail.

Main Framework
To make full use of both the limited labeled dataset and the large unlabeled dataset, inspired by Chen et al. [6], in the case of a text generator G, our proposed method additionally employs a sentiment classifier C, which is incorporated to extract all aspect sentiment triplets (aspect, opinion, polarity) in each sentence through a sequence-labeling schema, thus yielding pseudo labels for the unlabeled dataset. We assume that the generator can enhance itself by leveraging a large dataset with pseudo labels predicted by the classifier.
In order to benefit from both the data size of the unlabeled dataset and the correctness of the labeled dataset, we train our model sequentially using these two datasets. Specifically, as shown in Figure 2, following Chen et al. [6], we adopt three steps to make full use of the large unlabeled dataset: Step 1 Step 2 Step 3

Limited Annotated Dataset
Large Unannotated Dataset Train Figure 2. Illustration of the training steps for the generator and classifier. Note that "X", "s", "G", and "C" represent the review text, fine-grained sentiment, generator, and classifier, respectively.

•
Step 1: We train both our generator and classifier on a limited labeled dataset to get G0 and C0, respectively. • Step 2: The C0 is then used to extract the fine-grained sentiments in the large unlabeled dataset, thus yielding the pseudo labels for the next step's training. • Step 3: Again, the generator is trained on the unlabeled dataset that is attached with pseudo labels. Finally, the generator is fine-tuned with the labeled dataset (used in Step 1) to receive the final generator G1.
As a result, we obtain an enhanced generator G1 trained on both the limited labeled dataset and the large unlabeled dataset.

Generator
Unconditional language models (LMs) are trained on the huge amount of unlabeled text data to optimize the probability of p(x i |x 1 :x i−1 ) in an auto-regressive manner [20,21] where x i is the next token and x 1 :x i−1 are the previous tokens. While in the controlled text generation, the conditional distribution p(x i |a, x 1 :x i−1 ) is optimized, where a is the attribute for the model to control the generation.
To make use of the LM pre-trained with large unlabeled datasets, we need to infuse attribute a into the unconditional distribution p(x i |x 1 :x i−1 ). What is more, the pre-trained Transformer-based language model GPT-2 [9] has demonstrated remarkable natural text generation in an auto-regressive manner in recent years. Thereby, to improve the generated texts' quality, our generative model incorporates a pre-trained GPT-2 model as the "supergenerator," and we further use the fine-grained sentiment infusion blocks, which are stacked in the AlSeCond to extend this pre-trained state-of-the-art language model's decoder blocks.
Essentially, the GPT-2 model is stacked with numerous Transformer-Decoder blocks, each consisting of layer normalization [22], multi-head self-attention [1], and position-wise feed-forward operations. Therefore, our AlSeCond blocks extend this kind of decoder block and incorporate a sentiment infusion operation together with our proposed queryhint mechanism to conditionally infuse the fine-grained sentiments into the next-token prediction process.
The sentiment infusion operation is performed inside the AlSeCond's blocks. Figure 3 briefly illustrated how our AlSeCond model works. Specifically, the target fine-grained sentiment pairs s0 are appended sequentially as a prompt to the head of the regular sequence s1 to form the S. This special appended sequence S is then encoded to h (h = [h 0 ; h 1 ], h 0 , h 1 is the hidden representation of s0 and s1, respectively) through numerous AlSeCond blocks, thus h 1 t self-attends to the hidden states of the regular sequence h 1 for previous t time steps and, further, all time steps of the fine-grained sentiment pairs h 0 . Therefore, the sentiment representation h 0 is infused into the intermediate representation h 1 to control the next token logits (o) and hence the generation process.
Prompt food good s ever rude They had good food but the s ever was rude had good food but the s ever was rude  Our AlSeCond's block (illustrated in the pink block in Figure 4) is a special Transformer-Decoder block that incorporates our proposed query-hint mechanism to guide the controlled generation process. Specifically, for fine-grained sentiment-appended hidden states, h = [h 0 ; h 1 ] (h 0 and h 1 are the hidden states for the sentiment and regular sequence, respectively.), its key, value, and a special hinted query matrix (K, V, Q ∈ R (l s +t)×d , l s , t is the length of the appended sentiments and regular sequence, respectively) are computed to perform a query-hinted self-attention. Furthermore, during the computation of the hinted query (Q ) matrix, we infuse K 0 ∈ R l s ×d , the sentiments' part of K, into Q 1 ∈ R t×d at their corresponding time step as the query-hint: where f hint (·) is our proposed function, it strategically allocated the sentiments' representation to Q 1 as the query-hint information, and M h ∈ R t×n is an adjacency matrix, representing which sentiment pair should be hinted for each time step in Q 1 , and n is the number of sentiment pairs, l a (a ∈ {1, 2, . . . , n}) is the end index of the a-th sentiment pair in S. As a result, we guide the text generation by infusing the sentiment information into the generation process through the query-hinted self-attention operation.  Figure 4. Architecture of the generator. This model is stacked with 24 AlSeCond blocks with the same structure. The dashed lines in the block represent the general attention, while the red solid lines represent the attention that is hinted at with prompt key values.

Query-Hint Mechanism
Since the distance from the prompt and the next-token prediction correlates negatively with the prompt's influence [23], which makes it difficult to use a prompt to guide a nonadjacent piece of text, especially when the generation time step is far away from the prompt. In other words, prompt and regular sentences share equal importance, which is inadequate for prompt-based generative models because the prompt tokens propagate less dominant information to the next-token prediction as the sequence expands. Our idea is similar to Xia et al. [24], where the actual importance of information from different sentiment units is unequal to each token in a sentence, so they need to be attended to differently. Therefore, as mentioned in Section 3.2, we introduced a query-hint mechanism to further remind each generation time step about the following content. The main idea of this mechanism is to let the generation process understand what text to generate in order to catch the next sentiment text.
Specifically, for each general sentiment pair, its aspect and opinion phrases have their own corresponding subsequence to provide query-hints. As shown in Figure 5 (e.g., 1 to 1), a sentiment pair's member starts query-hint at the beginning of the sentence or the end step of the previous sentiment pair and closes before its own full-presenting. The hinted steps form a "hint-unit" (framed in the red dotted line in Figure 5).  Figure 5. Strategy of the query-hint mechanism, this illustration demonstrates two different instances of query-hint strategy, i.e., "1 to 1" and "1 to n," which correspond to the one-to-one and one-to-many situations for aspect-opinion pairs, respectively.
In the source sentences, however, there are also some sentiment pairs that share the same phrase either in aspect or opinion (e.g., (food, great), (drinks, great)). Therefore, in order to make query-hint consistent in the training and generation process, given n sentiment pairs that share the same aspect/opinion phrase, their query-hints are merged into one "hint-unit". As shown in Figure 5 (e.g., 1 to n), within the "hint-unit", each aspect/opinion phrase gives the query-hint sequentially.
Although our proposed strategy of query-hint in the training process is almost identical to the generation process, there is still a slight difference between them. During the training process, the corresponding time steps in the sentence are provided with query-hint according to the position of each sentiment information presented in the sentence. While in the generation process, since the part of the sentence that has not been generated is unknown, query-hint should be allocated according to the generated part of the sentence.

Loss Functions
Generation loss function: through an LM training objective, we train our conditional generative model with the general generating loss term conditioned on previous x :t−1 and input sentiment information s: where x t is the predicted token at time step t. I x (·) is the index function of a vector. Sentiment control loss function: To encourage the generator to output texts incorporating the input sentiment information (phrases), we train the generator additional with our proposed sentiment-control loss function. The main idea of this loss function is to maximize the probability value of the one with the highest probability in terms of given aspect/opinion word from all the next-word predictions of a sentence. Specifically, for every aspect phrase a and opinion phrase o presented in the source text, the training loss is defined as: where L a and L o are the losses for aspect and opinion term inclusion, respectively. Mask a,t/o,t is a one-hot vector with the size of V (vocabulary size), and only the element in the index of a t /o t is 1. φ mean is a hyper-parameter controlling how much the prediction of aspect/opinion terms should be enhanced. p max (·) is a max-pooling operation with a kernel size of l t × 1 (l t is the length of the target text). and ⊕ represent the element-wise product and XOR, respectively.
As a result, our final loss function comprehensively considers the loss of generation quality and the loss of sentiment control: where λ values are hyper-parameters controlling how much the loss terms dominate the training.

Classifier
In this section, first, we give the task definition of Aspect Opinion Pair Extraction (AOPE), then we briefly introduced the model architecture of our sentiment classifier C.
The task of AOPE aims to extract aspect terms and their corresponding opinion terms as pairs [25][26][27]. This task can be defined as follows: Given a sentence with m words X = {x 1 , x 2 , . . . , x m }, the goal of this task is to extract all aspect-opinion pairs τ = {(a, o) n } |τ| n=1 from X, where {(a, o) n } is an aspect-opinion pair presented in X and the notations a and o denote an aspect term and an opinion term, respectively.
For the overall architecture of our classifier, the two-dimensional interaction-based multi-task learning framework (2D-IMLF) is shown in Figure 6. Given an input sentence, two highly related works of the extraction task (aspect term extraction and opinion term extraction) are adopted to learn aspect-related and opinion-related features, respectively. Then, to capture different interactive features of aspect terms and opinion terms, a 2D interactive representation is obtained by tensor composition. Finally, the classifier model regards the AOPE task as a grid tagging problem and in the end, obtains the final results by applying a decoding algorithm [28].
As shown in Figure 6, we first use a group of CNN layers to encode the input sentence and get their hidden state: where k ∈ {1, 2, 3, . . .} represents the kernel size of an 1D-CNN. Then, a Bi-LSTM layer together with multi-head self-attention is incorporated to extract the context information from the sentences: Afterward, we concatenate the hidden state H c with their transferring state H T c to get a grid-formed feature. We then obtain the prediction probabilities of P c a and P c o for aspect and opinion terms, respectively, from the final logits P: Finally, by using a grid-formed tagging schema [28], we can easily obtain a series of aspect-opinion pairs. ... ... Figure 6. Architecture of the classifier. This model incorporates 2D interaction representation and grid-formed tagging schema [28] to extract all aspect and opinion phrases in a sentence.

Experiments
In this section, we first introduce datasets and settings in our experiment and then report the evaluation metrics and results.

Dataset and Settings
We conduct experiments on three real-world datasets, two labeled and one unlabeled; the statistics of the datasets are reported in Table 1. Moreover, the experimental settings are also listed in this subsection. Table 1. Statistics of the labeled and unlabeled datasets. Note that "Val" is short for "Validation", the ASTE-Data-V2-Rest is labeled with aspect, opinion, and polarity, while the MAMS-ASTA is labeled with only aspect and polarity. We conduct experiments of aspect-opinion and aspect-polarity pairs of conditioned controllable text generation on English restaurant reviews with ASTE-Data-V2 from Xu et al. [29] and MAMS-ASTA from Jiang et al. [30], respectively.
MAMS-ASTA: From MAMS (https://github.com/siat-nlp/MAMS-for-ABSA, accessed on accessed on 14 May 2022) (Multi-Aspect Multi-Sentiment), ref. [30] is an aspectlevel sentiment-labeled dataset. Wherein, each datum instance in MAMS-ASTA is labeled with at least two aspects and different sentiment polarities, while no opinion term is labeled. Therefore, by using our classifier to retrieve opinion phrases according to the original pairs of aspect-polarity, we also conduct aspect-level sentiment-controllable text generation on MAMS-ASTA.

Unlabeled Dataset
To ensure that the training data are in the relevant review domain, we use Yelp's review dataset (https://www.kaggle.com/yelp-dataset/yelp-dataset, accessed on accessed on 18 May 2022) as the unlabeled dataset and filter out the sentences with a length greater than 150. Unlike the labeled datasets, the Yelp dataset did not contain fine-grained sentiment labels. Therefore, we only use the sentences in the unlabeled data and discard other items, including user information.

Experimental Settings
Generator: In the experiment, we train our AlSeCond model that extends from a pre-trained GPT-2 medium 345M model [9]. The AlSeCond's blocks clone the GPT-2 Transformer blocks' parameters and settings. To ensure the generator can compute the probability of (and also generate) any string, we apply Byte Pair Encoding (BPE) [34] for the inputs. The max generating length was set to 32. We tune the λ G together with λ senti to 1 and 8, respectively. Adam [35] is used for optimization, while the batch size is set to 16, and the learning rate is set to 5 × 10 −5 . During the period of G0, the generator is trained with the labeled and pseudo-labeled dataset for 4 and 2 epochs, respectively. In the following G1, the generator is fine-tuned with the labeled dataset for 24 epochs. We apply the above steps to train our model on an RTX A4000 GPU for 20 h. Furthermore, the above steps are also applied to train other baseline models. We ran our model and all baselines five times to average the scores.
Classifier: Following GTS [28], we combine a 300-dimension domain-general embedding from pre-trained GloVe [36] and a 100-dimension domain-specific embedding trained with fastText [37] to initialize double word embeddings. We use Adam as the optimizer, and the learning rate is 5 × 10 −4 . The batch size and dropout rate are set to 32 and 0.5, respectively. The number of hidden units in Bi-LSTM is set to 128.

Baselines
We compare with six baselines. PPLM [4] incorporates an attribute model BoW (bag of words) to steer a pre-trained GPT-2 model toward increasing the generating probability of the target words. In this baseline, the BoW is formed with the words contained in the target sentiment pairs. For HTT [18], we omit the process of opinion phrase generation and only use its results (i.e., sentiment pairs) to compose the review. Through prepending the task description before the input text, the state-of-the-art text-to-text model T5 [38] is pre-trained with a multi-task objective. Following this schema, we append the sentiment pairs into the prompt, thus forming: "generate a sentence with a 1 is o 1 , . . . , a n is o n .", and fine-tune the model with the target sentence. Its coverage of the input sentiment pairs in the baselines serves as an upper bound. Moreover, we also fine-tune UniLM [39], UniLM-v2 [40], and BERT-Gen [40] in a similar sequence-to-sequence fashion with both the large unlabeled dataset and the limited labeled dataset.

Generated Quality Evaluation
To study the performance of these models in a diversified manner, we conduct evaluations on both the quality and sentiment coverage of the generated text.

Fluency and Diversity Evaluation
We conduct a fluency evaluation on the generated texts with some automatic metrics: BLEU [41], ROUGE [42], and METEOR [43], which compare the similarity between the generated text and ground truth based on n-gram matching. Moreover, the diversity of generations is also an important indicator. We measure diversity for the generated results with Dist-1,-2,-3 [44] scores and Self-Bleu [45]. Table 2 shows the fluency and diversity evaluation results by the automatic evaluations. From the results, we can observe that: (1) Compared with baseline models, our AlSeCond model extended from the GPT-2 achieves better performance in fluency evaluations. (2) Comparing results in diversity metrics, it can be observed that our AlSeCond model performs much better than the rest of the baselines in the MAMS-ASTA dataset, which means the results generated by our model are less like the template-generated text than that generated by other models. Table 2. Results for the fluency and diversity evaluation. Note that "↑" means the higher the better, "↓" means the lower the better, "w/o" means "no".

Sentiment Evaluation
As to measure the quality of sentiment containment in the generated sentence and indicate whether the input sentiments are correctly expressed in the generated text, we employ two metrics: Coverage (Cov.), just like in Lin et al. [46], which is the average rate of input sentiment pairs presented in the generated texts. This metric includes Cov-a, Cov-o, and Cov-ao, representing the presenting rate of aspect, opinion, and aspect-opinion pairs, respectively. Accuracy (Acc.) is a rate indicating how many fine-grained sentiments are accurately expressed in the sentence, and it is evaluated by the external sentiment classifier [30] trained on MAMS-ASTA. Table 3 shows the results of sentiment coverage and accuracy for generated texts. It is worth noting that for a linguistically complicated sentence, its aspect-level sentiments are more difficult to be correctly predicted by the external classifier than a relatively simple sentence, so its sentiment accuracy may be lower than the actual situation. What is more, T5's original seq2seq architecture allows it to generate texts that highly correspond to the input sequences. Hence its coverage and accuracy scores serve as an upper bound, although its generated results' syntax is relatively simple and repetitive. Table 3. Results for the sentiment evaluation. Note that Accuracy (Acc.) is a rate indicating how many fine-grained sentiments are accurately expressed in the sentence, and it is automatically evaluated by an external classifier. Comparing the above metrics results for all models on different datasets, we can observe that our model has stable advantages over both ASTE-Data-V2 and MAMS-ASTA, which indicates that our AlSeCond model has stronger adaptability. Additionally, Figure 7 presents the learning curves for fine-tuning all models with the labeled dataset, which also demonstrates the strong capabilities of our model compared to baselines.   Figure 8 presents some generated cases from AlSeCond, HTT, T5, UniLM, BERT-Gen, and UniLM-v2. From the cases, we found that: AlSeCond tends to generate more linguistically complicated sentences, while the other baselines are more likely to focus on generating review texts that simply express the input information and less on the complexity of the expressions and the syntaxes.

Conclusions and Future Work
In this paper, we propose a fine-grained sentiment-controllable text-generation method based on the pre-trained language model and the auxiliary sentiment classifier that utilizes both the labeled and unlabeled dataset to reach the aspect-level sentiment control in text generation. Our proposed query-hint mechanism and fine-grained sentiment control loss function have greatly enhanced the generator in controlling the sentiment during the textgenerating process. Experiments on real-world datasets have demonstrated our generator's ability to generate aspect-level sentiment-controllable review statements with high quality and diverse syntax.
For future work, we will explore the controllable text generation for implicitly expressed fine-grained sentiments (e.g., in this sentence: "We had to constantly ask the waiter to top up water glasses.", the reviewer had a negative opinion of the waiter although there is no related opinion phrase in the sentence.), since the query-hint mechanism proposed in this paper is only effective for explicitly expressed fine-grained sentiments.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: