Sentiment Analysis Methods for HPV Vaccines Related Tweets Based on Transfer Learning

The widespread use of social media provides a large amount of data for public sentiment analysis. Based on social media data, researchers can study public opinions on human papillomavirus (HPV) vaccines on social media using machine learning-based approaches that will help us understand the reasons behind the low vaccine coverage. However, social media data is usually unannotated, and data annotation is costly. The lack of an abundant annotated dataset limits the application of deep learning methods in effectively training models. To tackle this problem, we propose three transfer learning approaches to analyze the public sentiment on HPV vaccines on Twitter. One was transferring static embeddings and embeddings from language models (ELMo) and then processing by bidirectional gated recurrent unit with attention (BiGRU-Att), called DWE-BiGRU-Att. The others were fine-tuning pre-trained models with limited annotated data, called fine-tuning generative pre-training (GPT) and fine-tuning bidirectional encoder representations from transformers (BERT). The fine-tuned GPT model was built on the pre-trained generative pre-training (GPT) model. The fine-tuned BERT model was constructed with BERT model. The experimental results on the HPV dataset demonstrated the efficacy of the three methods in the sentiment analysis of the HPV vaccination task. The experimental results on the HPV dataset demonstrated the efficacy of the methods in the sentiment analysis of the HPV vaccination task. The fine-tuned BERT model outperforms all other methods. It can help to find strategies to improve vaccine uptake.

Human papillomavirus (HPV) is the most widespread sexually transmitted infection (STI) around the world. It has been established that approximately 4% of all cancers are associated with HPV [1]. HPV vaccines can prevent most cancers and diseases caused by HPV infections [8]. Despite the recommendation about the vaccine's safety and effect, HPV vaccination rates in many countries are still far lower than the goal set by Healthy People 2020 of 80% series completion for both adolescent males and females [9]. We need to explore the public sentiments towards HPV vaccination and then take corresponding measures to improve the vaccination rate further. Du et al. [10] collected and manually annotated 6000 tweets related to the HPV vaccine. Then, they constructed a hierarchical SVMs (support vector machines) and evaluated different feature combinations. Finally, they optimized the model parameters to maximize the model performance in analyzing public attitudes. Zhou et al. [11] used the connection information on social networks to improve the recognition of the negative emotions towards HPV vaccination.
However, most of these works were based on machine learning methods. These conventional methods cost significant time and labor on task-specific feature engineering [12]. Differently, deep learning methods can automatically extract features by unsupervised or semi-supervised learning algorithms [13]. Moreover, it can generate high-quality vector representations that differ from the low-quality vector representations generated by feature engineering [14]. However, the application of deep learning methods needs a large amount of annotated data. In some domains, such as public health, it is challenging to construct a large-scale annotated dataset because of the costly expense of data acquisition and annotation.
Transfer learning can solve the problem by leveraging knowledge obtained from a large-scare source domain to improve the classification performance in the target domain [15]. At its simplest, migrating pre-trained word vectors initializes the input of the deep learning model. The pre-trained word vectors obtained based on massive text data are an essential part of the learned semantic knowledge that can significantly improve natural language processing tasks based on deep learning. In natural language processing (NLP) tasks, there are several ways to employ transfer learning strategies. Generally, we can initialize input words by transferring pre-trained word embedding. The pre-trained word embeddings on large-scare corpus contain abundant syntactic and semantic knowledge, which significantly promotes the NLP tasks based on deep learning methods [16]. However, static word vectors such as Word2Vec only produce a fixed vector representation. They cannot solve the problem that the same word may have different meanings when it appears in different positions in the text. The emergence of deep neural networks allows language models to dynamically generate word vectors to solve the ambiguity of words in different situations.
With the emergence of pre-trained language models such as bidirectional encoder representations from transformers (BERT) [17], the model can generate dynamic word embeddings to tackle the polysemy. Recently, fine-tuning the pre-trained language model with limited annotated domain-specific data has achieved excellent performance in a series of NLP tasks [18]. Adhikari et al. [19] established stated-of-the-art results for four accessible datasets (Reuters, AAPD, IMDB, Yelp 2014) by fine-tuning BERT for document classification.
To find a transfer learning system that is able to extract comprehensive public sentiment on HPV vaccines on Twitter with satisfying performance, three transfer learning approaches were proposed to tackle the limitation of annotated data in the public health area. (i) One was separately transferring diverse word embeddings and then processing by bidirection gated recurrent unit with attention mechanism (BiGRU-Att), called DWE-BiGRU-Att (Diverse Word Embeddings Processed by BiGRU-Att). In this way, we could exploit the syntax and semantics in the pre-trained word embeddings to improve the deep learning model's performance. (ii) As the static word embeddings could not solve the polysemy, we proposed the other two transferring learning methods. These two were fine-tuning pre-trained models with limited annotated data, called fine-tuning generative pre-training (GPT) and fine-tuning BERT.

Related Work
Nowadays, anyone with access to the Internet can express their opinions on various social media. Especially on public health issues such as infectious disease prevention, drug safety supervision, and vaccination, the public tends to post their medical experience or search for professional medical information online. Because of the public's open participation, the information related to public health issues can be spread on the Internet in a fast way.
Many studies have analyzed public opinions based on social media data. Salathe et al. collected publicly available tweets during the outbreak of H1N1 influenza [20]. They manually annotated part of the collected tweets. Each tweet was annotated with negative, positive, or neutral sentiment towards influenza vaccination. Then, they trained a machine learning model with the labeled data. The model was used to classify the sentiment of the remaining unlabeled tweets automatically. Finally, they used the fully classified dataset to study the sentiment distribution of influenza vaccination.
Myslín et al. [5] used support vector machines, Naive Bayes, and k-Nearest Neighbors to analyze the public opinions towards tobacco and tobacco-related products based on Twitter data. Ginn et al. [21] manually annotated 10,822 tweets and then trained two machine learning models to monitor adverse drug reactions.
However, these works were mostly based on machine learning methods. These methods need sophisticated feature engineering. Moreover, the sparse vectors generated by feature engineering are inferior to the dense vectors generated by deep learning methods. However, high-quality, dense vectors need to be trained on a large corpus. In this way, we transferred the dense vectors pre-trained on the large-scare corpus to improve the deep learning model's performance. Pre-trained dense vectors, containing learned syntax and semantics, can offer significant improvements over deep learning NLP tasks. Kim [22] initialized embeddings to pre-trained word vectors pre-trained on 100 billion words of Google News. Zhang et al. [23] treated multiple pre-trained word embeddings (Word2Vec, GloVe, and Syntactic embedding) as distinct groups and then applied convolutional neural networks (CNNs) independently to each group. The corresponding feature vectors (one per embedding) were then concatenated to form the final feature vector.
Transferring the learned semantics and syntax knowledge from the other missions has aroused a great interest in natural language processing (NLP) [24]. As an essential component of learned semantic knowledge, pre-trained word embeddings can offer significant improvements over deep learning NLP tasks. The generalization of word embeddings, sentence embeddings, or paragraph embeddings was also used as features in downstream missions like sentiment analysis, text classification, clustering, and translation [10]. Even though pre-trained word embeddings can improve the performance, the static word embeddings, such as Word2Vec, GloVe, and FastText [25], only produce fixed embedding and cannot solve the polysemy. With the emergence of deep neural networks, language models can yield dynamic word embedding to tackle the polysemy. McCann et al. [26] proposed contextualized word vectors (CoVe) by computing contextualized representations with neural machine translation encoder. Embeddings from language models (ELMo) [27] generated dynamic word embeddings by the concatenation of independently trained left-to-right and right-to-left long short-term memory networks (LSTM).
Bidirectional encoder representations from transformers (BERT) is a technique for NLP (natural language processing) pre-training developed by Jacob Devlin and his colleagues from Google [17]. The BERT model has achieved better performance in many sentiments analysis tasks of social media [28][29][30][31][32][33]. For example, in the work of Wang et al. [29], the BERT model was used to identify public negative sentiment categories in China regarding COVID-19 on Sina Weibo. In the work of Müller et al. [30], the COVID-Twitter-BERT model was a transformer-based model that pre-trained on a large corpus of Twitter messages on the topic of coronavirus disease 2019 . It outperformed the BERT-Large model on five different classification datasets. A Framework for twitter sentiment analysis based on BERT has been proposed in the work of Azzouza et al. [31]. The framework achieved high performance on the SemEval 2017 dataset. A knowledge enhanced BERT Model was proposed for depression and anorexia detection on social media in the work of [33].
In addition, the method of fine-tuning pre-trained language models has made a breakthrough in a series of NLP tasks. It can tackle the polysemy and only need a little annotated data to train the model. Howard et al. [18] proposed ULMFiT, the first universal method for text classification by the fine-tuning pre-trained language model. In the work of Biseda et al. [34], BERT models were fine-tuned for three pharmacovigilance of adverse drug reactions (ADRs) tasks and achieved high performance.
Myagmar et al. [32] fine-tuned pre-trained language models of BERT and XLNet for the cross-domain sentiment classification. The experimental results showed that fine-tuning methods outperformed previous state-of-the-arts methods while exploiting up to 120 times fewer data.

Methods
In this section, we described in detail our three transfer learning approaches. One is transferring diverse word embeddings passed through BiGRU-Att (Section 3.1). The others are fine-tuning pre-trained models processed by a fully connected softmax layer (Section 3.2).

Diverse Word Embeddings Processed by BiGRU-Att
We proposed the diverse word embeddings processed by BiGRU-Att (DWE-BiGRU-Att). Our four transfer learning methods are ELMo-BiGRU-Att, GloVe-BiGRU-Att, FastText-BiGRU-Att, and Word2Vec-BiGRU-Att. As shown in Figure 1, the architecture of our DWE-BiGRU-Att contains four components: embedding Layer, BiGRU Layer, attention Layer, and output Layer. For example, we took the sentence "I think the vaccine has side effects" as our method's input.

Methods
In this section, we described in detail our three transfer learning approaches. One is transferring diverse word embeddings passed through BiGRU-Att (Section 3.1). The others are fine-tuning pretrained models processed by a fully connected softmax layer (Section 3.2).

Diverse Word Embeddings Processed by BiGRU-Att
We proposed the diverse word embeddings processed by BiGRU-Att (DWE-BiGRU-Att). Our four transfer learning methods are ELMo-BiGRU-Att, GloVe-BiGRU-Att, FastText-BiGRU-Att, and Word2Vec-BiGRU-Att. As shown in Figure 1, the architecture of our DWE-BiGRU-Att contains four components: embedding Layer, BiGRU Layer, attention Layer, and output Layer. For example, we took the sentence "I think the vaccine has side effects" as our method's input.

Embedding Layer
This layer maps each word into a dense dimension vector through transferring pre-trained word embedding. In this paper, we compared the results of static word embedding Word2Vec, GloVe, FastText, and contextualized word embedding ELMo.
The static word embeddings are separately 3 million 300-dimension Word2Vec word embedding trained on GoogleNews, 1 million 300-dimension FastText word embedding trained on Wikipedia, and 1.2 million 200-dimension GloVe word embedding trained on Twitter. If the word is concluded in the pre-trained embedding, we can get the word vector directly. If not, we generate the word vector randomly.
Deep contextualized word embeddings supposed by language model ELMo improve word representation quality and handle the polysemy problem to a certain extent. Different from the static word embeddings, it represents a word according to its context.

Embedding Layer
This layer maps each word into a dense dimension vector through transferring pre-trained word embedding. In this paper, we compared the results of static word embedding Word2Vec, GloVe, FastText, and contextualized word embedding ELMo.
The static word embeddings are separately 3 million 300-dimension Word2Vec word embedding trained on GoogleNews, 1 million 300-dimension FastText word embedding trained on Wikipedia, and 1.2 million 200-dimension GloVe word embedding trained on Twitter. If the word is concluded in the pre-trained embedding, we can get the word vector directly. If not, we generate the word vector randomly.
Deep contextualized word embeddings supposed by language model ELMo improve word representation quality and handle the polysemy problem to a certain extent. Different from the static word embeddings, it represents a word according to its context. ELMo embedding is a combination of multiple layer representations in the bidirectional language model (biLM). Language model (LM) is the maximum likelihood of multiple sequences of K tokens, (t 1 , t 2 , . . . , t K ). The forward LM computes the probability of the next word t n given the history (t 1 , t 2 , . . . , t n−1 ).
Similarly, a backward LM predicts the before token based on the future context.
A biLM combines the forward LM and backward LM and then maximizes the log-likelihood of the forward and backward LM. Θ x and Θ s are respectively the token representation and softmax parameters, which are shared in the forward and backward directions.
For each token t n , a L-layer biLM computes a set of 2L + 1 representations: h LM n,j is calculated by h LM n,j = → h LM n,j ; ← h LM n,j for each biLSTM layer. ELMo integrates the output R n of multilayer biLM into a single vector, ELMo n = E(R n , Θ e ). The simplest case is that ELMo uses only the topmost output, E(R n ) = h LM n,j . Here, our ELMo adds the output of all biLM layer multiplied by the softmax-normalized weights s task . γ task is a hyperparameter for optimization and scaling the ELMo vector.

BiGRU Layer
This layer is built to aggregate the word representations containing the bidirectional information. The BiGRU layer takes the dense word embeddings V ∈ R t×d as input. t is the number of words in the input context and d is the dimension of the word vector. The BiGRU layer consists of two GRU layers that process the information from both forward GRU neuron and backward GRU neuron. Figure 2 shows the structure of the cell unit in the GRU. Two new gates r i and z i are added to the cell unit to solve the gradient disappearance problem of standard RNN. r i determines how much of the past information needs to be retained, and z i helps the model determine how much of the past information needs to be passed to the candidate hidden state. The calculation process of the reset gate r i is as follows: where σ is the activation function, x i is the input, h i−1 is the hidden state of the previous cell unit, and W r and U r are the weight matrix. neuron. Figure 2 shows the structure of the cell unit in the GRU. Two new gates and are added to the cell unit to solve the gradient disappearance problem of standard RNN. determines how much of the past information needs to be retained, and helps the model determine how much of the past information needs to be passed to the candidate hidden state. The calculation process of the reset gate is as follows: Similarly, the update gate z i is calculated as follows: Formally, the formula of current hidden state h i can be formalized as The formula for calculating h t is as follows: The forward GRU extracts the word feature as . h i concatenates the bidirectional information to summarize the information of the whole context centered around the word.

Attention Layer
Because not all words make the same contribution in understanding the sentence's meaning, we employed the attention mechanism to implement the contribution of important words.
The resulted concatenation of the representations of the forward and backward GRU, is then converted to Formula (10) through a fully connected layer.
Then, the probability distribution α i , representing the importance of each sentence in the context, is obtained by calculating the similarity between u i and the context vector u w and softmax operation.
At last, the document representation s i is the weighted sum of α i and h i .

Output Layer
The vector representation of the input text generated by the attention layer represents the probability distribution that s i gets the public's opinion labels on public health issues through the fully connected Softmax layer. Figure 3 shows the multi-class fully connected and Softmax layers corresponding to the output layer. The function of the fully connected and Softmax layer is to map the n dimension vector composed of n real numbers between negative infinity to positive infinity into the K dimension vector composed of K real numbers between 0 and 1. Moreover, the sum of K real numbers is equal to 1. The calculation process is shown in Formula (13).
Healthcare 2020, 8, x 8 of 19 The Softmax is calculated as follows: The specific probability of each category is calculated as follows, where represents the weight vector composed of the same color in the Figure 3.
The representation generated from the attention layer is fed into a fully connected softmax layer to obtain the distribution of class probability. We minimized categorical cross-entropy loss function in which loss increases as the predicted probability deviates from the actual label . the loss function is calculated as follows:

Fine-Tuning Pre-trained Models
Although transferring word embeddings can offer significant improvements in many NLP tasks, it is more efficient to fine-tune pre-trained language models with a little labeled target-domain data. In this section, we respectively described our fine-tuned GPT and fine-tuned BERT public sentiment analysis classifier.

Fine-Tuning GPT
In Figure 4, GPT uses multi-layer transformer decoders as a feature extractor. The transformer decoders are more powerful than the LSTM in handling long-term dependency. Our fine-tuned GPT public sentiment analysis classifier must apply the same structure as GPT pre-training. We also need to process the input context differently. The Softmax is calculated as follows: The specific probability of each category is calculated as follows, where w j represents the weight vector composed of the same color in the Figure 3.
The representation s i generated from the attention layer is fed into a fully connected softmax layer to obtain the distribution of class probability. We minimized categorical cross-entropy loss function J in which loss increases as the i th predicted probability p i deviates from the actual label y i . the loss function J is calculated as follows:

Fine-Tuning Pre-trained Models
Although transferring word embeddings can offer significant improvements in many NLP tasks, it is more efficient to fine-tune pre-trained language models with a little labeled target-domain data. In this section, we respectively described our fine-tuned GPT and fine-tuned BERT public sentiment analysis classifier.

Fine-Tuning GPT
In Figure 4, GPT uses multi-layer transformer decoders as a feature extractor. The transformer decoders are more powerful than the LSTM in handling long-term dependency. Our fine-tuned GPT public sentiment analysis classifier must apply the same structure as GPT pre-training. We also need to process the input context differently. We assumed a labeled dataset in which each case contains a sequence of words, ( , … , ), along with a label . For our classification task, our inputs need to add randomly initialized start and end tokens (<s>, <e>). The pre-trained GPT model processes the recombined inputs. Then, we obtained ℎ , which was the output of the final transformer block. The ℎ is then fed into a fully connected softmax layer with matrix to predict .
Lastly, we got the optimization objective to maximize:

Fine-Tuning BERT
Unlike GPT employing a left-to-right transformer, BERT utilizes a bidirectional transformer. In this paper, we fine-tuned BERTbase. It contains 12 transformer blocks, 12 self-attention heads, and 768 hidden units. As seen in Figure 5, BERT base takes a sequence of no more than 512 tokens as input and outputs the representation of the sequence. We assumed a labeled dataset C in which each case contains a sequence of words, (x 1 , . . . , x n ), along with a label y. For our classification task, our inputs need to add randomly initialized start and end tokens (<s>, <e>). The pre-trained GPT model processes the recombined inputs. Then, we obtained h n l , which was the output of the final transformer block. The h n l is then fed into a fully connected softmax layer with matrix W y to predict y. P y|x 1 , . . . , x n = so f tmax h n l W y Lastly, we got the optimization objective to maximize:

Fine-Tuning BERT
Unlike GPT employing a left-to-right transformer, BERT utilizes a bidirectional transformer. In this paper, we fine-tuned BERTbase. It contains 12 transformer blocks, 12 self-attention heads, and 768 hidden units. As seen in Figure 5, BERT base takes a sequence of no more than 512 tokens as input and outputs the representation of the sequence. In our classification task, BERT base takes the final hidden state C ∈ R H of the first token [CLS] as the representation of the whole sequence. We introduced a fully connected softmax layer over the final hidden state C. The softmax classifier parameter matrix is W ∈ R K×H , where H is the dimension of the hidden state vectors and K is the number of classes. P = so f tmax CW T (19) We minimized the categorical cross-entropy loss and fine-tune all the parameters from BERT as well as W to maximize the probability of the correct label.

Data Source
Experiments were conducted on 6000 annotated HPV-related tweets [10]. The combinations of keywords (HPV, human papillomavirus, Gardasil, and Cervarix) are used to collect public tweets using the official Twitter application programming interface (API) [35]. 33,228 English tweets containing HPV vaccines related keywords in total were collected from 15 July 2015 to 17 August 2015. Then, the URLs and duplicate tweets were removed. 6000 tweets were selected for annotation randomly. In Table 1, we divided the dataset into eight categories through the hierarchical structure. Then, each tweet had one category label. The hierarchical structure was based on the subdivision of unbalanced data. First, according to whether the tweet was related to HPV or not, we divided the tweet into a related class or unrelated class. Next, the tweets that belong to the related class were divided into positive class, neutral class, and negative class. Last, the negative tweets were classified into NegSafety class, NegEfficacy class, NegResistant class, NegCost class, and NegOthers class based on some most common worries about the vaccination like side effects, efficacy, cost, and culture-related issues. The detailed proportion of each category is shown in Table 1.

Data Processing
There were some essential data cleaning and pre-processing work we had done, including lowercase letter replacement, deleting punctuation, excluding hashtags, user names (e.g., @user), and replacing all URLs (e.g., 'http://xx.com') with 'URL'. Table 2 showed two processed sentences.

Experimental Setup
We applied 10-fold cross-validation to make full use of the small dataset and ensure the same evaluation indicators with the work of [10]. So, leave one out cross-validation is not applied in this paper. For each category, we treated it as a binary classification and assessed consequence with the F 1 -score. The F 1 -score is defined as the harmonic mean of the precision and recall of a binary decision rule [36]. For overall performance, we used micro-F 1 as multiclass classification assessment indexes. The Formula (20) showed the specific calculation process: Micro_F 1 calculates the proportion of instances predicted correctly in the predicted samples (regardless of the category) with Formula (20) where Micro_P is micro-average of precision, Micro_R is micro-average of recall, TP i is the true positive sample, FP i is the false positive sample, and FN i is false negative sample.
The optimal parameter settings are given in Table 3.

Baselines
We compared our transfer learning methods with traditional machine learning models, including plain support vector machines (SVMs) and hierarchical SVMs, and general deep learning models (i.e., attention-based BiGRU model [37]). The plain SVM classification used word-ngrams as features and chose default SVMs parameters. The hierarchical SVMs used three SVMs models trained independently and chose word-ngrams as features. The results of these models came from [10].

Average of Micro Index
The micro-average can be a useful measure when your dataset varies in small size. In Table 4, the 10-fold cross-validation performance of the average of micro index on the baseline models (plain SVM and hierarchical SVMs and BOW-BiGRU-Att) and our transfer learning models are shown. The plain SVM classification results used word-ngrams as the feature and chose default SVMs parameters and the hierarchical SVMs that used three SVMs models trained independently and chose word-ngrams as the feature are the official numbers from [10]. The columns of BOW-BiGRU-Att, Word2Vec-BiGRU-Att, FastText-BiGRU-Att, GloVe-BiGRU-Att, and ELMo-BiGRU-Att are our experiment results of bidirectional long short-term memory combined with bag-of-word, Word2Vec, pre-trained FastTest embedding, GloVe embedding, and ELMo embedding respectively. The columns of FT-GPT-FC and FT-BERT-FC are the results of fine-tuning GPT and fine-tuning BERT models, respectively. FT-GPT-FC and FT-BERT-FC respectively represent the fine-tuned model with a fully connected neural network. The row of Average/Method represents the average of micro-F 1 score of 10-fold cross-validation on each method. The column of Average/Fold represents the average of micro-F 1 score of each fold on all methods. The FT-BERT-FC gets the best performance with the bold number. The result of FT-BERT-FC is 0.769. It makes 14.8% and 6.95% increase than the plain SVM and the hierarchical SVMs score (0.670 and 0.719), respectively. The ELMo-BiGRU-Att and FT-GPT-FC also increase by 2.68% and 1.53% more than hierarchical SVMs on micro-F 1 average, respectively. The better performance of FT-BERT-FC can be attributed to the fact that the left-to-right and right-to-left transformers of BERT is more powerful than the left-to-right transformer of GPT. The bidirectional transformer concentrates on the left and right context of the word, but the left-to-right transformer can only focus on the left context of the word. Thus, pre-trained BERT can extract more high-quality feature vectors. The other results are 0.654, 0.697, 0.708, and 0.702 are all lower than hierarchical SVMs. However, the performance of all except the BOW-BiGRU-Att is better than plain SVM. Among all DWE-BiGRU-Att models (ELMo-BiGRU-Att, GloVe-BiGRU-Att, FastText-BiGRU-Att, and Word2Vec-BiGRU-Att), ELMo-BiGRU-Att obtain the highest micro-F 1 average. The results indicate that dynamic word embedding (ELMo) is more efficient than static word embeddings (GloVe, FastText, and Word2Vec). Meantime, ELMo can solve the polysemy that cannot be handled by static word embeddings. However, the Micro−F 1 average of FT-GPT-FC are increased by 0.82% than ELMo-BiGRU-Att.
Compared with BOW-BiGRU-Att, the micro-F 1 average of Word2Vec-BiGRU-Att is still increased by 6.57%. The significant improvement means that transferring pre-trained word embedding is efficient in promoting the classification performance of deep learning methods.

Standard Deviation and Root Mean Square Error
The standard deviation (SD) of the micro-F 1 score for all methods is given in the row of SD to measure the variance of a model's performance. Root mean square error (RMSE) is applied as an error analysis. The RMSE is calculated as follows where m is the sample size, y The SD and RMSE are shown in Table 5. The SD and RMSE of FT-BERT-FC are lower than the values of the plain SVM and hierarchical SVMs in average of micro index. The performance of the plain SVM and hierarchical SVMs depend on the feature extracting from the training data set, so the performance of each fold is more different. The SD and RMSE of the dynamic embeddings model (such as FT-BERT-FC and FT-GPT-FC) are lower than that of the static embeddings model (such as FastText-BiGRU-Att and GloVe-BiGRU-Att). Generally, the static embeddings model requires much larger amounts of data. They can get major improvements when trained on millions or more annotated training examples. However, the BERT model trained general-purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training). The FT-BERT-FC method is not needed to extract high-quality language features from the text data, we fine-tuned the model with BERT on the HPV vaccination task to produce state-of-the-art predictions.  Table 4 shows the specific micro-F 1 scores of different models in each fold (F-1, F-2, . . . , F-10). We sum up the true positives (TP), false positives (FP), and false negatives (FN) of the system for different sets and apply them to get the statistics. The bold number denotes the largest number in that row. In all folds, FT-BERT-FC gains the highest scores all, which indicates the robustness of the model. The micro-F 1 score of FT-GPT-FC and ELMo-BiGRU-Att are relatively close in each fold. The difference between their scores of each fold does not exceed 0.02. Furthermore, the performance of Word2Vec-BiGRU-Att, FastText-BiGRU-Att, and GloVe-BiGRU-Att is similar in each fold. It indicates that the Word2Vec, FastText, and GloVe.embedding mechanisms have similar effects on the HPV dataset.

Micro-F 1 Scores in Each Fold
The worst overall performance of all methods emerges in the third fold F-3, which means the overall micro index performance of all models is the worst. The average micro-F 1 score of the third ford is 0.686. Correspondingly, the highest average micro-F 1 score of each ford is 0.732 in the fourth fold F-4. That means the overall micro index performance of all models is the best in this fold.

Statistical Test
We chose the Friedman test with the Nemenyi post hoc test based on [38]. The Friedman test is a non-parametric statistical test developed by Milton Friedman [39]. It can be used to detect differences in multiple methods across multiple test data sets. The steps of the Friedman test and the Nemenyi test for this paper are given as follows.
(1) Define Null and Alternative Hypotheses H 0 : There is no difference between the nine methods; H 1 : There is a difference between the nine methods.
(2) Calculate Test Statistic First, from Table 5, We ranked the methods for each fold (F-1, F-2, . . . , F-10) separately on micro-F 1 score. Second, we replaced our original values with the rankings as shown in Table 6. Let r j i be the rank of the j − th of k methods on the i − th of N fold. The Friedman test compares the average ranks (mean ranks) of methods, R j = 1 N N i=1 r j i . The Friedman statistic is distributed according to χ 2 F with k − 1 degrees of freedom. Friedman's χ 2 F is undesirably conservative and derived a better statistic was proposed by Iman and Davenport [40].
F F is distributed according to the F-distribution with k − 1 and (k − 1) (N − 1) degrees of freedom. The table of critical values can be found in any statistical book. In this paper, N = 10, k = 9. With nine algorithms and 10-fold cross-validation data sets, F F is distributed according to the F distribution with 9 − 1 = 8 and (9 − 1) × (10 − 1) = 72 degrees of freedom. The critical value of F (8,72) for α = 0.05 is 2.07. We got χ 2 F = 73.57, F F = 103.03 with Equations (23) and (24). F F > F 0.05 (8,72) where α = 0.05. So, we reject the null hypothesis. We proceed with a post hoc test using the Nemenyi test [41].
(3) Nemenyi test The performance of different methods is significantly different if the corresponding average ranks differ by at least the critical difference (CD).
At p-value = 0.05, q 0.05 = 3.102 were obtained from a F-distribution table in any statistical book where α = 0.05. Then, CD is 3.80 calculated with Equation (25).
All the R j ± CD 2 were got and shown in Table 6. The critical difference (CD) diagrams are shown in Figure 6. We can identify the performance of FT-BERT-FC is significantly better than that of plain SVM [10], hierarchical SVMs [10], BOW-BiGRU-Att, Word2Vec-BiGRU-Att, FastText-BiGRU-Att, and GloVe-BiGRU-Att. We cannot tell that there is a significant difference between FT-BERT-FC, ELMo-BiGRU-Att, and FT-GPT-FC. We can conclude that the post hoc test is not powerful enough to detect any significant differences between the ELMo-BiGRU-Att, FT-GPT-FC, and hierarchical SVMs at p-value is equal to 0.05. where = 0.05. So, we reject the null hypothesis. We proceed with a post hoc test using the Nemenyi test [41].
(3) Nemenyi test The performance of different methods is significantly different if the corresponding average ranks differ by at least the critical difference (CD).
All the ± were got and shown in Table 6. The critical difference (CD) diagrams are shown in Figure 6. We can identify the performance of FT-BERT-FC is significantly better than that of plain SVM [10], hierarchical SVMs [10], BOW-BiGRU-Att, Word2Vec-BiGRU-Att, FastText-BiGRU-Att, and GloVe-BiGRU-Att. We cannot tell that there is a significant difference between FT-BERT-FC, ELMo-BiGRU-Att, and FT-GPT-FC. We can conclude that the post hoc test is not powerful enough to detect any significant differences between the ELMo-BiGRU-Att, FT-GPT-FC, and hierarchical SVMs at p-value is equal to 0.05.

2.00
GloVe-BiGRU-Att  Figure 6. Comparison of all methods against each other with Nemenyi test.

Limitations and Future Researches
We have demonstrated how the methods of the sentiment analysis of the HPV vaccination task. However, only one dataset with 6000 tweets is verified. One of the next steps is to study the performance of these methods working on different sizes and multi-domain.
Some tweets are not be processed by FT-BERT-FC shown in Table 7. The current methods usually neglect to consider commonsense knowledge for public opinions on public health issues.

Limitations and Future Researches
We have demonstrated how the methods of the sentiment analysis of the HPV vaccination task. However, only one dataset with 6000 tweets is verified. One of the next steps is to study the performance of these methods working on different sizes and multi-domain.
Some tweets are not be processed by FT-BERT-FC shown in Table 7. The current methods usually neglect to consider commonsense knowledge for public opinions on public health issues. Knowledge enhanced ensemble learning models on social media should be tried to address this problem. Furthermore, uneven data distribution is an excellent challenge for the current models. There are only 6 NegResistant and 6 NegCost tweets in the dataset. Some deep learning approaches for processing imbalanced data should be studied as [7].

Conclusions
We try to find a transfer learning system that can extract comprehensive public sentiment on HPV vaccines on Twitter with satisfying performance. We proposed three transfer learning approaches to analyze public sentiments towards public health issues for the goal. To exploit syntax and semantics pre-trained on a large corpus, a method of transferring diverse word embeddings was combined with BiGRU-Att layer. As the static word embeddings could not solve the polysemy, we proposed the other two methods of fine-tuning GPT and fine-tuning BERT. In this way, we could take advantage of the strong feature extraction capability of large neural networks by using a little annotated target-domain data to fine-tune the language model. The experimental results showed the superiority of FT-BERT-FC for the HPV vaccination issue. With the success of this work, our transfer learning approaches were expected to be further applied to other public sentiments tasks towards public health issues.
Author Contributions: Conceptualization, L.Z. and G.R.; methodology, C.P. and G.R.; supervision, G.R.; writing-original draft, C.P., Q.C., and H.F.; writing-review and editing, Q.C. and H.F. All authors have read and agreed to the published version of the manuscript.