Learning Improved Semantic Representations with Tree-Structured LSTM for Hashtag Recommendation: An Experimental Study

: A hashtag is a type of metadata tag used on social networks, such as Twitter and other microblogging services. Hashtags indicate the core idea of a microblog post and can help people to search for speciﬁc themes or content. However, not everyone tags their posts themselves. Therefore, the task of hashtag recommendation has received signiﬁcant attention in recent years. To solve the task, a key problem is how to effectively represent the text of a microblog post in a way that its representation can be utilized for hashtag recommendation. We study two major kinds of text representation methods for hashtag recommendation, including shallow textual features and deep textual features learned by deep neural models. Most existing work tries to use deep neural networks to learn microblog post representation based on the semantic combination of words. In this paper, we propose to adopt Tree-LSTM to improve the representation by combining the syntactic structure and the semantic information of words. We conduct extensive experiments on two real world datasets. The experimental results show that deep neural models generally perform better than traditional methods. Specially, Tree-LSTM achieves signiﬁcantly better results on hashtag recommendation than standard LSTM, with a 30% increase in F1-score, which indicates that it is promising to utilize syntactic structure in the task of hashtag recommendation.


Introduction
Social networking services (SNSs), such as Twitter, Facebook and Google+, have attracted millions of users to publish and share the most up-to-date information, emergent social events, and personal opinions [1].As a typical representative of SNSs, microblog services have become increasingly popular.For example, Twitter [2] has more than 240 million active users, and users generate approximately 500 million tweets every day.The massive amounts of information generated every day makes it difficult to discover the hidden information in that data.To organize this information more accurately and effectively, many microblogging services allow users to create and use hashtags by placing the pound sign #, usually in front of a word or unspaced phrase in a post.The hashtag may contain letters, digits, and underscores.Searching for that hashtag yields each message that has been tagged with it.A hashtag archive is consequently collected into a single stream under the same hashtag.For example, the hashtag #HarryPotter allows users to find all the posts that have been tagged using that hashtag.It has also been proven that hashtags are important for many applications in microblogs, including microblog retrieval [3], query expansion [4], and sentiment analysis [5][6][7].However, only about 11% of tweets were annotated with one or more hashtags by their authors [8].Therefore, hashtag recommendation is a key challenge and has attracted much attention of researchers.
Existing approaches to hashtag recommendation range from classification and collaborative filtering to probabilistic models, such as naive Bayes and topic models.We formally formulate the hashtag recommendation task as a multi-class classification problem only using the text of posts.A typical approach is to learn the representation of a microblog post and then performing text classification based on this representation.Hence, it is of practical values to explore how to effectively represent the text of a microblog post for hashtag recommendation.We perform an experimental study of the short text representation methods in hashtag recommendation task.To be more specific, we study two major kinds of text representation methods for hashtag recommendation.For the first kind, we mainly consider shallow textual features, including bag-of-word (BoW) models and exquisitely designed patterns, which have been used in previous work [9,10].However, feature engineering is labor-intensive and the sparse and discrete features cannot effectively encode semantic and syntactic information of words.Given neural network models can effectively learn feature representation, their advantages in recommendation research have gradually emerged.We further propose to learn deep textual features using multiple deep neural network models in hashtag recommendation task.In addition to the widely used Convolutional neural network (CNN) [11] and Long Short-Term Memory network (LSTM) [12] in text classification, we propose using Tree-LSTM [13] to introduce syntactic structure while learning the representation of microblog post.Experiments were designed on two real-world datasets to compare our method with many competitive baselines.It was found that Tree-LSTM outperformed all the baselines by incorporating the syntactic information.Specifically, Tree-LSTM achieved significantly better results in hashtag recommendation than standard LSTM, with a 30% increase in F1-score.In order to foster reproducibility of our experiments we also made our code [14] publicly available.
Our contributions are summarized as follows: • We take the initiative to study how the syntactic structure can be utilized for hashtag recommendation, and our experimental results have indicated that the syntactic structure is indeed useful in the recommendation task.To the best of our knowledge, few studies address this task in a systematic and comprehensive way.

•
We propose and compare various approaches to represent the text for hashtag recommendation, and extensive experiments demonstrate the powerful predictive ability of our deep neural network models on two real-world datasets.

Methodology for Hashtag Recommendation
We formulate the task of hashtag recommendation as a multi-class classification problem.We consider two major approaches: one extracts textual features using traditional text representation models, and the other learns effective text representations using deep neural network models.Specially, we propose to incorporate syntactic structure to enhance the representation learning of the text.In what follows, we introduce the two approaches: a traditional classification approach with shallow textual features and a neural network approach with deep and syntactic textual features respectively.

Naive Bayes (NB)
To solve the above recommendation problem, traditional machine learning algorithms such as naive Bayes (NB), was applied to model the posterior probability of each hashtag given a microblog in [15].Similarly, we use a naive Bayes model to determine the relevance of hashtags to an individual tweet.Given an input microblog s with N words w 1 , ..., w N , the features of a tweet can be represented as a one-hot vector x where x i = 0 or 1 to indicate the presence or absence of the ith dictionary word.
Using Bayes' rule, we can determine the posterior probability of C i given the features of a tweet x 1 , ..., x N : In the simplest case, the prior probability p(C i ) may be assumed the same for all C i to give a maximum likelihood approach.The term p(x 1 ...x N ) essentially serves as a normalizing constant and can be ignored when selecting the most probable hashtags because it does not depend on the hashtag C i .Finally, using the naive Bayes assumption that the features x i are conditionally independent given C i , we can write the likelihood p(x 1 , ..., x N |C i ) as the product p(x 1 |C i )...p(x N |C i ).Each of these conditional probabilities is based on the frequency with which each word appears in tweets with hashtag C i .

Latent Dirichlet Allocation (LDA)
Following the same idea from [16], we use the standard LDA model [17] for the task.In LDA, each post s has a probability on a particular topic z i , while each hashtag C i is also assigned a probability belonging to each topic z i .Therefore, we can obtain the probability of each hashtag C i assigned to a post s by the combination of these two probabilities as follows: 2.1.3.Term Frequency-Inverse Document Frequency (TF-IDF) The vector space model is an algebraic model to represent text as vectors of identifiers.It is widely used in information retrieval and relevancy ranking.We use the text of all posts in the dataset to generate the vocabulary V which contains all words.A microblog post feature vector x ∈ R |V| can be considered as a representation of all words in the post over the dictionary V.The weighting of a word in w i is determined by term frequency-inverse document frequency (i.e., tf-idf) model [18]; then, based on the TF-IDF feature vectors, we build a multi-class SVM classification model [19] for hashtag recommendation.

Neural Network Approach with Deep Textual Features
It may be not effective enough if the above shallow textual features are directly used for classification task [20,21], since these features cannot capture the script beyond the surface form of the language.Inspired by the recent success of deep learning techniques in many NLP tasks, we further propose to learn deep textual features using multiple deep neural network models including FastText, CNN, LSTM and its variants in hashtag recommendation task.Specially, we propose to incorporate syntactic structure to enhance the representation learning of the text using Tree-LSTM model.
We typically use the neural models to learn the representation of a microblog post.Then, text classification is performed based on the representation of the microblog post.These deep neural network models are based on the fact that each word is represented as a low dimensional, continuous, and real-valued vector, which is also known as a word embedding [22,23].All the word vectors are stacked in a word embedding matrix L w ∈ R d emb ×|V| , where d emb is the dimension of word vector and |V| is vocabulary size.To make better use of the semantic and grammatical associations of words, the values of the word vectors are pre-trained from the text corpus with the word2vec embedding learning algorithm [23].Given an input microblog s, we take the embeddings x t ∈ R d emb ×1 for each word in the microblog to obtain the first layer.Hence, a microblog post of length N is represented with a sequence of word vectors X = (x 1 , x 2 , ..., x N ).
The output of deep neural network models is the representation vector of a microblog post vec.The final output vector vec is then fed to a linear layer whose output length is the number of hashtags.A softmax layer is finally added to output the probability distributions of all candidate hashtags.The softmax function is calculated as shown below, where C is the number of hashtag categories: FastText is a simple and efficient text classification method [24], which treats the average of word/n-gram embeddings as document embeddings, and then feeds document embeddings into a classifier.In Figure 1, the text representation is an hidden variable which can be potentially be reused.This architecture is similar to the CBOW (Continuous Bag-of-Words) model of [23], where the middle word is replaced by a label.We use the softmax function to compute the probability distribution over the predefined hashtags.

Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN) model [11] has subsequently achieved excellent results in many NLP tasks.The architecture of the CNN model is shown in Figure 2. In the training process, a convolution operation uses a filter w which has a window of h words to produce a new feature.For example, a feature f i generated from a window h of words sequence x i:i+h−1 is defined as follows: where b is a bias term and g is a non-linear function such as hyperbolic tangent.This filter is applied to each possible window of words in the sentence x 1:h , x 2:h+1 , ..., x n−h+1:n to produce a feature map Then a max-pooling operation is applied on the feature map to capture the most important feature by taking the maximum value f = max{f}.Finally, we use a softmax function to generate the probability distribution over labels.

Standard LSTM
Long Short-Term Memory [12] is a special RNN that addresses the shortcomings of RNN in long-term dependencies.The transition equations for LSTM are as follows: where h t represents the hidden layer, x t represents the current word, and the last hidden layer h t−1 forms a new cell state c t with current new word x t .To determine the importance of the currently entered word, the input gate i t uses the last hidden layer h t−1 and the current word x t .The forget gate f t is opposite to the input gate i t , and it uses the last hidden layer h t−1 and the current word x t to determine the importance of the old memory.The final cell state c t is obtained by adding the new cell state c t under the condition of the input gate i t , and the previous cell state c t−1 under the condition of forget gate f t .The output gate o t is used to distinguish the current final cell state c t from the current hidden layer h t , denotes element wise multiplication, and b is the bias.As shown from the above-stated analysis, the hidden layer h t summarizes the information of the whole sequence centered around x t .
The last hidden vector from the standard LSTM is used as the microblog representation to recommend hashtags.Then, it is fed to a linear layer whose output length is the number of hashtags.Finally, a softmax layer is added to output the probability distributions of all candidate hashtags.The structure of this model is shown in Figure 3.

LSTM with Average Pooling (AVG-LSTM)
However, a potential issue with standard LSTM approach is that all the necessary information of the input post has to be compressed into a fixed-length vector.This may make it difficult to cope with long sentences.Each annotation h t contains information about the whole input microblog with a strong focus on the parts surrounding the tth word of the input microblog.One possible solution is to perform an average pooling operation over the hidden vectors of LSTM [12].We call it AVG-LSTM for short.

Bidirectional LSTM (Bi-LSTM)
Bidirectional LSTM is an extension of traditional LSTM that enables the hidden states to capture both historical and future context information.In problems where all time steps of the input sequence are available, Bidirectional LSTM models text semantics both from forward and backward.For a microblog X = (x 1 , x 2 , ..., x N ), the forward LSTM reads sequence from x 1 to x N and the backward LSTM reads sequence from x N to x 1 , and similarly processes the sequence according to Equation (5).Then we concatenate the forward hidden state − → h t and backward hidden state where the [•; •] denotes concatenation operation.Finally, the h t summarizes the information of the whole sequence centred around x t .

Tree-LSTM
All the above deep neural models can learn deep textual features, however, they ignore the syntactic structures.We take the initiative to study how the syntactic structure can be utilized for hashtag recommendation with Tree structured LSTM model.It has been demonstrated that Tree-LSTM is able to improve the performance of many tasks, such as sentiment analysis [13], natural language inference [25], etc.In our task, the text is spoken language, which is quite different from the texts in previous research tasks.It is worth exploring that whether the syntactic information is effective short informal text like tweet.
The difference between the standard LSTM units and Tree-LSTM units is that gating vectors and memory cell updates are dependent on the states of possibly many child units.Unlike the standard LSTM, in which each node of LSTM only has one son, Tree-LSTM has a forget gate for each child node, which selectively combines the semantic information of each child node.For example, if there are three words in a sentence, c 2 and c 3 are child nodes of c 1 in the syntactic tree, the structure of LSTM units should be constructed as Figure 4, while the Tree-LSTM units should be constructed as Figure 5. Furthermore, if we use C t to represent the set of all the children of current node t, then the transition equation of Child-Sum Tree-LSTM is as follows: As above, each Tree-LSTM node includes an input gate i t , multiple forget gates f tk corresponding to multiple child nodes, an output gate o t , a memory cell c t , and a hidden layer h t .Additionally, x t is the input at the current step, σ denotes the logistic sigmoid function, denotes element wise multiplication, U represents the weight matrix of the hidden layer of the children, W represents the weight matrix of the current step, and b is the threshold vector.The values of W, U, b, and h are initially set to random numbers between (−1.0, 1.0).
Syntactic analysis is one of the key underlying technologies in natural language processing (NLP).Its basic task is to determine the syntactic structure of a sentence or the dependency between words in a sentence.Dependency parsing is a typical approach of syntactic parsing in which syntactic units are arranged according to the dependency relation, as opposed to the constituency relation of phrase structure grammars.Therefore, in this task, we use the dependency tree structure as the syntactic information.
Through an example, this section introduces how to incorporate syntactic information into the Tree-LSTM architecture for the task of hashtag recommendation.
Example tweet : Am I the only one in the world that does not watch glee?Given the tweet above, the Stanford Parser [26] is first used to obtain the dependency tree structure of the tweet, which is shown in Figure 6.However, there are some alternatives, such as the Berkeley parser [27], LTP [28], and FudanNLP [29].A tree structure for the example tweet in Figure 7 can then be obtained, corresponding to Figure 6.The Tree-LSTM model in Figure 8 can be constructed based on the results of the tree structure.In this task, each x j is a vector representation of a word in a sentence and h j is the hidden representation of each word.As shown in Figure 7, "one" is the head word of "the", "only", "in" and "watch".Respectively, in Figure 8, h 5 is the parent node of h 3 , h 4 , h 6 and h 12 , and so on.Each node from the upper layer in the tree takes the vector corresponding to the head word from the lower layer as input recursively.Finally, the hidden representation of the root in the tree is used as the representation of the tweet.

Model Training
We train all our deep neural models in a supervised manner by minimizing the cross-entropy error of the hashtag classification.For tweets with more than one hashtag, each hashtag and the post are used as a training instance.The loss function is given below: where S stands for all training instances and tags(s) is the hashtag collection for microblog s.

Experiment
We apply the proposed methods to the task of hashtag recommendation and evaluate the performance on two real-world datasets.This section describes the experiments that were designed to answer the research question that whether syntactic structure is useful for this task, and how much does syntactic structure help to improve the recommendation performance.

Twitter Dataset
The data used in this paper was constructed from a larger Twitter dataset ranging from June to December of 2009, with a total of 476,553,560 tweets [30].The original data set was larger than 1 TB, and a sub dataset with 185,391,742 tweets from October to December were collected.Among them, there were 16,744,189 tweets including hashtags annotated by users.Due to the computational performance, we randomly select 50,000 tweets as training set, along with 5000 tweets as the validation and test set, which is similar to previous work [20,21,31].We repeat this process for five times, and then take the average results on the test data as the final experimental results.
The statistics of our dataset is shown in Table 1.Figures 9 and 10 show the distribution of the number of tweets for each hashtag and the distribution of the number of hashtags for each tweet, respectively, which follow a power-law distribution.From the two figures, the following can be learned: (1) more than 95% of the hashtags appeared in fewer than 20 tweets; and (2) about 10% of the tweets had more than one hashtag, while 98% of the tweets had less than five hashtags.

Zhihu Dataset
The second dataset was crawled from Zhihu, a Chinese question-and-answer website.We collected our dataset from the Zhihu topic square [32].We randomly crawled 150,000 hot questions and their hashtags from 100 subtopics under the 33 main topics.Then, the hashtags that appeared less than three times were removed as well as questions that had more than five hashtags.Finally, 278 42,060 questions and their 3487 hahstags remained.We randomly se80% questions as the training data, while 10% were used as the validation and 10% for test data.We repeat this process for five times, and then take the average results on the test data as the final experimental results.The statistics of the dataset are shown in Table 2 and Figures 11 and 12.

Experimental Settings
We recommend hashtags in the following ways: we first use the training dataset to train our model and save the model, which has the best performance on the validation dataset.For the unlabelled data, we use the trained model to perform the hashtag classification.We repeat this process five times, and then take the average results on the test data as the final experimental results.
All the neural models were trained with sentences of lengths up to 50 words.For each of the above models, the dimension of all the hidden states in the LSTMs was set to 200 and the dimension of word embeddings was 300, unless otherwise noted.We use a minibatch stochastic gradient descent (SGD) algorithm together with the Adam method to train each model [33].The hyperparameter β1 was set to 0.9 and β2 was set to 0.999 for optimization.The learning rate was set to 0.001.The batch size was set to 50.The network was trained for 20 epochs with early stopping.For both our models and the baseline methods, we use the validation data to tune the hyperparameters, and report the results of the test data in the same setting of hyperparameters.Furthermore, the word embeddings used in all methods are pre-trained from the original twitter data released by [30] with the word2vec toolkit [23].
We use hashtags annotated by users as the golden set.To evaluate the performance, we use precision (P), recall (R), and F1-score (F) as the evaluation metrics.Precision means the percentage of "tags truly assigned" among "tags assigned by system".Recall denotes that "tags truly assigned" among "tags manually assigned".F1-score is the average of Precision and Recall.The same settings are adopted by previous work [20,21,31].

Experimental Results
Tables 3 and 4 show a comparison between the results of our method and the state-of-the-art discriminative and generative methods on the two datasets respectively.It is worth noting that we only take the hashtag with highest probability as the result predicted by the model.In other words, we only evaluate the top-1 recommendation result.To summarize, we find the following conclusions.
First, considering the comparison between the NB and TF-IDF methods, it can be observed that TF-IDF performed much better than NB.This indicates that the TF-IDF features were more effective than bag-of-words (BoW).
Secondly, for the general topic model LDA, due to the word-gap problem, sometimes the topic of the sentence was not captured well, and it could only obtain sparse features of sentences.LDA was able to achieve a little improvement in results over the NB and TF-IDF methods in Zhihu dataset.However, in Twitter dataset, the results of LDA is worse than TF-IDF, we suppose this is because the topics of twitter are more diverse.
Thirdly, the neural models had better experimental results than the traditional baseline methods, such as NB, TF-IDF and LDA.This indicates that neural networks are effective in learning the semantic information of microblog posts and can improve the performance considerably.Additionally, LSTM, AVG-LSTM and Bi-LSTM have produced comparable results.
Finally, Tree-LSTM, which incorporates the syntactic information, obtained the best results.Tree-LSTM achieved a more than 30% of relative improvement in the F1-score compared with LSTM and CNN.It is clear that the syntactic structure was effective in this task.
Tables 5 and 6 show the influence of the number of training data elements in two datasets.We randomly select a part of the training set according to a certain proportion.As shown, the performance of Tree-LSTM increased when larger training data sets were used.The results also demonstrate that our proposed method could achieve significantly better performance than the traditional baseline methods even with only 20% of the training data.Many microblog posts have more than one hashtag; therefore, we also evaluated the top-k recommendation results of different methods.show the precision, recall, and F1 curves of NB, LDA, CNN, LSTM, AVG-LSTM, Bi-LSTM and Tree-LSTM on the test data respectively.Each point of a curve represents the extraction of a different number of hashtags, ranging from 1 to 5. It can be observed that although the precision and F1-score of Tree-LSTM decreased when the number of hashtags was larger, Tree-LSTM still outperformed the other methods.Moreover, the relative improvement on extracting only one hashtag was higher than that on more than one hashtag, which shows that it is more difficult to recommend hashtags for a microblog post with more than one hashtag.Similarly, we also evaluated the top-k recommendation results of different methods of Zhihu dataset in Figures 16-18.It can be observed that Tree-LSTM still outperformed the other methods substantially in all cases.However, for all methods, the performance of the Zhihu dataset is worse than that of the twitter dataset.We suppose this is because the average number of hashtags in the Zhihu dataset is large.

Parameter Sensitive Analysis
We further investigate the effect of hyperparameters to the performance in Twitter dataset.We vary the dimension of hidden layers in Tree-LSTM from 50 to 300, with a gap of 50.As shown in Figure 19, we find the performance improved when the dimension of hidden layer increased.

Qualitative Analysis
A qualitative analysis of our results was also conducted through case studies in Twitter data.Example 1: Worldcup football fans have their say on the draw-socceroos.#worldcup Example 2: Gon na watch some flashforward now, while waiting for parcelforce to arrive hopefully!#flashforward In both examples, the hashtag #worldcup and #flashforward were correctly recommended by Tree-LSTM; however, they were not recommended by LSTM.As shown in Figure 20, the hashtag "worldcup" acts as a noun phrase in the sentence.In Figure 21, the hashtag "flashforward" is an object in a verb phrase.Through the analysis of the recommendation results, it was found that Tree-LSTM has a strong advantage over the standard LSTM model in the above two cases.These indicate that the syntactic structure is useful in this task.

Related Work
There has been much research in the field of hashtag recommendation.The practical approaches can be roughly divided into two categories: content-based methods and collaborative filtering methods.Content-based methods use different techniques to measure the content of the tweets to find the relevant hashtags from historical tweets [9,10].Zangerle et al. [9] presented an approach based on the similar microblogs and the hashtags in these microblogs.Mazzia and Juett [15] used the naive Bayesian model to calculate the maximum posterior probability of a hashtag under the condition that all words in the microblog are known.Krestel et al. [16] used a topic model based on the Latent Dirichlet Allocation (LDA) [17] to train the tags and the words in the text together to obtain the topic distribution of each document and the topic distribution of the words.Ding et al. [34] treated the text and its labels as different descriptions of the same resource, and used the topic model and single-word alignment techniques to learn the probability of each word being aligned with the label under a particular topic.When performing the prediction task, they calculated the ranking score of a certain hashtag relative to each microblog according to the probability distribution of the topic and the word alignment probability.
Collaborative filtering is a popular technique for recommendation systems.It considers user preferences by building a model from a user's past behaviours as well as similar decisions made by other users.Kywe et al. [35] proposed a collaborative filtering model based on a user group similar to the target user or based on a microblog similar to the target microblog to obtain the hashtags.
Wang et al. [36] proposed a joint model based on topic modelling and collaborative filtering to take advantages of both local (the current microblog content and the user) and global (hashtag-related content and like-minded users' usage preferences) information.Zhao et al. [37] used a hashtag-LDA recommendation approach combines user profile-based collaborative and LDA-based collaborative filtering.They jointly model the relations between users, hashtags, and words through latent topics.
In addition to these works, some researchers have attempted to incorporate external information to improve the recommendation.Li et al. [38] incorporated the topic enhanced word embeddings, tweet entity data, tag frequency, tag temporal data and tweet URL domain information to recommend the hashtags for microblogs.Ma et al. [39] explored the latent relationship between tweet content, user interest, time, and tag.Motivated by the observation that the average hashtag experiences a life cycle of increasing and decreasing popularity, Lu et al. [40] proposed a model captures the temporal clustering effect of latent topics in tweets.Kowald et al. [41] proposed a cognitive-inspired hashtag recommendation algorithm that is based on temporal usage patterns of hashtags derived from empirical evidence (individual hashtag reuse patterns and social hashtag reuse patterns).
Recently, there have been some attempts to use deep neural models for hashtag recommendation [21,31,42].Gong et al. [31] proposed a method incorporating textual and visual information.After observing that hashtags indicate the primary topics of microblog posts, Li et al. [20] proposed an attention-based LSTM model that incorporates topic modelling into the LSTM architecture through an attention mechanism.In this paper, we take the initiative to study how to effectively represent the text of a microblog post for hashtag recommendation.We study two major kinds of text representation methods for hashtag recommendation, including shallow textual features and deep textual features learned by deep neural models.Most existing work tries to use deep neural networks to learn microblog post representation based on the semantic combination of words.Specially, we propose to adopt Tree-LSTM to improve the representation by combining the syntactic structure and the semantic information of words.To the best of our knowledge, this is the first attempt to introduce syntactic information into a neural network for this task.

Conclusions
This paper performs an experimental study of how to effectively represent the text of a microblog post for hashtag recommendation.We study two major kinds of text representation methods including traditional approach with shallow textual features and neural network approach with deep textual features.In addition to the widely used CNN, LSTM models in text classification, we propose using Tree-LSTM to introduce syntactic structure while learning the representation of microblog post.The experimental results on two real-world datasets demonstrated that Tree-LSTM had the best performance, which indicates that it is promising to utilize syntactic structure in the task of hashtag recommendation.
For future work, we will try to incorporate more external information into the task, such as time, author occupation and user relationships.

Figure 1 .
Figure 1.Model architecture of FastText for a post with N n-gram features x 1 , ..., x N .The features are embedded and averaged to form the hidden variable.

Figure 4 .Figure 5 .
Figure 4.The structure of LSTM units with three nodes.

Figure 6 .
Figure 6.The dependency relations of the example tweet.

Figure 7 .
Figure 7.The syntactic tree structure of the example tweet.

Figure 9 .Figure 10 .
Figure 9.The distribution of the number of tweets for each hashtag.

Figure 11 .
Figure 11.The distribution of the number of questions for each hashtag.

Figure 12 .
Figure 12.The distribution of the number of hashtags for each question.

Figure 13 .
Figure 13.Precision with recommended hashtags range from 1 to 5 in Twitter dataset.

Figure 19 .
Figure 19.Results of Tree-LSTM with dimension of hidden layers range from 50 to 300.

Table 1 .
Statistics of the dataset, Nt(avg) is the average number of hashtags in the Twitter dataset.

Table 2 .
Statistics of the dataset, Nt(avg) is the average number of hashtags in the Zhihu dataset.

Table 3 .
Evaluation results of different methods for top-1 hashtag recommendation in Twitter dataset.

Table 4 .
Evaluation results of different methods for top-1 hashtag recommendation in Zhihu dataset.

Table 5 .
The influence of number of training data in Twitter dataset.

Table 6 .
The influence of number of training data in Zhihu dataset.