Feature Weighting Based on Inter-Category and Intra-Category Strength for Twitter Sentiment Analysis

: The rapid growth in social networking services has led to the generation of a massive volume of opinionated information in the form of electronic text. As a result, the research on text sentiment analysis has drawn a great deal of interest. In this paper a novel feature weighting approach is proposed for the sentiment analysis of Twitter data. It properly measures the relative signiﬁcance of each feature regarding both intra-category and intra-category distribution. A new statistical model called Category Discriminative Strength is introduced to characterize the discriminability of the features among various categories, and a modiﬁed Chi-square ( χ 2)-based measure is employed to measure the intra-category dependency of the features. Moreover, a ﬁne-grained feature clustering strategy is proposed to maximize the accuracy of the analysis. Extensive experiments demonstrate that the proposed approach signiﬁcantly outperforms four state-of-the-art sentiment analysis techniques in terms of accuracy, precision, recall, and F1 measure with various sizes and patterns of training and test datasets.


Introduction
Recently, exponential growth of the Internet has spurred the usage of information in diverse formats [1]. Especially, social networking services (SNS) such as Twitter provide users with a platform for presenting their preferences and opinions on various topics. Such opinionated textual information has become a gold mine for the institutions and companies in understanding the users' sentiments regarding their products and services [2,3]. Sentiment analysis refers to a research field of automatic language processing, which extracts and analyzes the opinions, sentiments, and subjectivities from the text [4]. It has drawn increasing interest, and become an active research topic due to the importance of satisfying the emerging demands [5]. Text categorization (TC) is the task of effectively organizing a massive amount of data, which is commonly employed for sentiment analysis [6].
TC is the process of automatically classifying each piece of the text in the form of a document into predefined category [6]. Generally, it consists of four stages: processing, document representation, feature selection and weighting, and finally classification. In the processing stage, the basic language processes such as tokenization and stop-word removal are performed to normalize the text. For the document representation, the Vector Space Model (VSM) is the most commonly used method, wherein the documents are represented as vectors in the feature space. In feature selection and weighting, a filtering technique based on Chi-square statistics, mutual information, or information gain is employed to extract a set of useful features from the feature space [7]. Also, a weight is assigned to each term • A novel mathematical model called Category Discriminative Strength (CDS) is proposed, which measures the strength of the terms in differentiating the categories. Also, the Intra-Distribution Measure (IDM) is introduced to characterize the partial significance of the terms inside a category. • A modified Chi-square statistics model is introduced to measure the intra-category dependency of the features. • A fine-grained feature clustering strategy is presented, which properly defines the margin of discriminative features and aggregates the features of similar distributions for efficient feature weighting. • An adaptive weighting strategy is proposed to properly decide the weight of each feature, and the inter-category and intra-category relevance of the features are included to maximize the efficiency of the weighting.
An extensive experiment demonstrates that the proposed approach is superior to the existing state-of-the-art schemes in terms of classification accuracy, precision, recall, and F1 measure. The rest of the paper is organized as follows. Section 2 discusses the background of sentiment analysis. In Section 3 the proposed scheme is presented, and Section 4 shows the results of the experiment with the proposed scheme. Finally, the paper is concluded in Section 5.

Twitter Sentiment Analysis and Opinion Mining
Twitter is the most popular microblogging platform allowing the users to create 140-character long text messages called "tweets". A tweet is a kind of opinionated message expressing the feelings or Appl. Sci. 2019, 9,92 3 of 18 opinions of the users on the subject. Several millions of active users send and receive over 500 million messages daily [11], which are mixed with various web services such as instant messaging, etc. [12]. Twitter has become a gold mine, and the collection of tweets extracted from the Twitter API has become a source corpus for sentiment analysis [13]. The objective of sentiment analysis is to predict the polarity (positive/negative) of an opinionated piece of text to evaluate the mood of people concerning a particular topic [14]. The lexicon-based approach and machine learning-based approach are the two main approaches that have been adopted for sentiment analysis. The lexicon-based approach uses a sentiment lexicon to analyze the sentiment of tested instances, while the machine learning-based approach employs a machine learning algorithm to train a text classifier for the classification.
Opinion mining is a domain of natural language processing, computational linguistics, and text mining associated with the computational learning of opinions, sentiments, and emotions from the text [12]. The emotion-based views, thoughts, or attitudes toward particular entities is often known as sentiment; hence, opinion mining is also regarded as sentiment analysis. Opinion mining has various applications in the field of education, research, and marketing, etc. For example, it can be used to evaluate the feedback from the customers on the product advertisement, and helps the marketers find out what products or services are popular. Previously, the research studies on opinion mining were not extensive, owing to limited amount of the information people generated. However, massive volumes of opinionated information have become available these days with the rapid growth of various SNS.

Lexicon-Based Approach
The lexicon-based approach is based on the unsupervised learning technique where a sentiment lexicon is constructed to determine the polarity of the given text via the predefined function of the indicator of positive and negative [15]. The popular lexicons include the Opinion Lexicon [16], AFINN Lexicon [17], and NRC Emotion Lexicon [18]. The value averaging the polarities derived from each word of the text in the lexicon serves as the sentiment score, which reflects the orientation of the polarity of the text [19]. The existing studies generally utilize the polarity of the previous sentence as a tie-breaker when the classification of tested sentences cannot be derived from the scoring function. Therefore, the generation of a meaningful lexicon based on the unsupervised labeling corpus is crucial with the lexicon-based approach [15]. As the creation of a lexicon is a time-consuming process, and a predesigned lexicon is not necessarily directly applicable to different domains and languages, the conventional lexicon-based methods are dedicated to extracting domain-specific words [20].
Extensive researches have been made with the lexicon-based approach. In [21], a visual framework, EmoWeb, was proposed to dynamically analyze the sentiment of textual content relying on a well-constructed lexicon. Experiments were conducted by collecting data from websites to show the applicability of the proposed scheme. A Neural Network (NN) technique was used to overcome the shortcoming of the lexicon-based approach, where some words are ignored by the lexicon [22]. A Portuguese context-sensitive lexicon [23] was constructed for sentiment polarity identification. It employed LexReLi as methodology to build the proposed lexicon, and the efficiency of different combinations of the techniques was also investigated. In [24], an Arabic senti-lexicon was created for the sentiment analysis of Arabic language, where a Multi-domain Arabic Sentiment Corpus (MASC) was built.

Machine Learning-Based Approach
The machine learning-based approach employs a supervised learning technique to train a text classifier using a manually labeled training dataset and extract the features. More specifically, the process can be described as follows. Given a training dataset consisting of labeled text documents, D = {(d 1 , l 1 ), . . . , (d m , l m )}, where d i represents a document belonging to the initial training corpus, D, and l i is the manual label reflecting a specific preset class among the set of classes, L. Here, the main task is to detect unlabeled text data by generating an effective classification model from the labeled training dataset [19]. Figure 1 depicts the structure of Twitter sentiment analysis based on the machine learning approach. Various formats of sources such as blogs, reviews, etc., are utilized as input data for Twitter sentiment analysis. Here, useful features are selected by a feature extractor for subsequent processing, which is critical to sentiment analysis. This is because the combination of different features plays a crucial role in the classification. The classifier is then trained by a machine learning algorithm to evaluate the degree of polarity and assign a corresponding label to the tested instance [25][26][27]. learning algorithm to evaluate the degree of polarity and assign a corresponding label to the tested instance [25][26][27]. Various learning algorithms have been adopted for the machine learning-based approach, including Support Vector Machine (SVM), Decision Tree (DT), Naïve-Bayes (NB), and NN, etc. SVM has been proved to be a powerful classification algorithm, especially in processing a massive number of features [28][29][30]. However, the weakness of low interpretability and computational overhead limits its usage. DT is another popular algorithm where the features and corresponding classes are reflected in the form of tree structure. It is well understood, and can be easily employed to classify textual data. However, exploring the features is restricted owing to its rigid tree structure, and feature overfitting may degrade the performance [31][32][33]. NB is a commonly used method for sentiment analysis due to its advantages of effectiveness and simplicity. It is used to select the category of the highest probability for the unlabeled text based on the probability theory [19,[34][35][36].
The NN classifier employs a network structure that is composed of a group of connected units, where each input represents a term and an output is generated from a function to represent the corresponding category. The dependence relations are represented by weighted edges between the units. Assume that a tested document_d is evaluated by the NN classifier. The term weight, w, is viewed as an input to the units, and the output of the units is the decision of the category that is obtained by activating the units of the network [37]. Perceptron is a simple linear NN classifier, which is also called a single-layer perceptron, which applies the Heaviside step function to activate the units of the network [38]. Another type of linear NN classifier incorporating logistic regression achieves good classification performance [39]. Here a non-linear NN classifier is implemented by constructing one or more layers of units. In [40] a non-linear NN was applied to the task of topic spotting, which models higher-order interaction between the terms. Due to the advantage of the NN classifier in processing data with multiple attributes, recently, many researches have been focusing on the study of the NN technique. A novel NN model (ConvGRNN and LSTM-GRNN) was proposed to evaluate the document-level sentiment analysis, where the semantics of sentences and corresponding correlations are encoded by the document representation [41]. In [42] a Mixed Neural Network (MNN) was introduced to solve the classification problem, which combines a rectifier Neural Network and Long Short-Term Memory (LSTM). A neural architecture and two extensions of Long Short-Term Memory (LSTM) [43] were presented to handle the task of targeted aspect-based sentiment analysis.

Feature Weighting
Feature weighting is a process of assigning appropriate weight to individual features in accordance with their relevance to the given domains. It is generally thought of as the generalization of feature selection, where the presence of a feature serves as the criterion for its extraction [44]. Each feature is represented as a binary vector, wherein "0" denotes the absence of the feature (evict the feature), and "1" denotes the presence (keep the feature) [19]. A variety of feature weighting methods have been introduced, and DF is the simplest one, where the "single appearance" of a word is Various learning algorithms have been adopted for the machine learning-based approach, including Support Vector Machine (SVM), Decision Tree (DT), Naïve-Bayes (NB), and NN, etc. SVM has been proved to be a powerful classification algorithm, especially in processing a massive number of features [28][29][30]. However, the weakness of low interpretability and computational overhead limits its usage. DT is another popular algorithm where the features and corresponding classes are reflected in the form of tree structure. It is well understood, and can be easily employed to classify textual data. However, exploring the features is restricted owing to its rigid tree structure, and feature overfitting may degrade the performance [31][32][33]. NB is a commonly used method for sentiment analysis due to its advantages of effectiveness and simplicity. It is used to select the category of the highest probability for the unlabeled text based on the probability theory [19,[34][35][36].
The NN classifier employs a network structure that is composed of a group of connected units, where each input represents a term and an output is generated from a function to represent the corresponding category. The dependence relations are represented by weighted edges between the units. Assume that a tested document_d is evaluated by the NN classifier. The term weight, w, is viewed as an input to the units, and the output of the units is the decision of the category that is obtained by activating the units of the network [37]. Perceptron is a simple linear NN classifier, which is also called a single-layer perceptron, which applies the Heaviside step function to activate the units of the network [38]. Another type of linear NN classifier incorporating logistic regression achieves good classification performance [39]. Here a non-linear NN classifier is implemented by constructing one or more layers of units. In [40] a non-linear NN was applied to the task of topic spotting, which models higher-order interaction between the terms. Due to the advantage of the NN classifier in processing data with multiple attributes, recently, many researches have been focusing on the study of the NN technique. A novel NN model (ConvGRNN and LSTM-GRNN) was proposed to evaluate the document-level sentiment analysis, where the semantics of sentences and corresponding correlations are encoded by the document representation [41]. In [42] a Mixed Neural Network (MNN) was introduced to solve the classification problem, which combines a rectifier Neural Network and Long Short-Term Memory (LSTM). A neural architecture and two extensions of Long Short-Term Memory (LSTM) [43] were presented to handle the task of targeted aspect-based sentiment analysis.

Feature Weighting
Feature weighting is a process of assigning appropriate weight to individual features in accordance with their relevance to the given domains. It is generally thought of as the generalization of feature selection, where the presence of a feature serves as the criterion for its extraction [44]. Each feature is represented as a binary vector, wherein "0" denotes the absence of the feature (evict the feature), and "1" denotes the presence (keep the feature) [19]. A variety of feature weighting methods have been introduced, and DF is the simplest one, where the "single appearance" of a word is equivalent with "multiple appearances" [45]. It counts the number of documents in which a word occurs, and uses the value to represent the corresponding document. TF is another criterion that explores feature weighting in a different direction. It is based on the intuition that the term of multiple appearances is more important than that of a single appearance [46]. Here, the frequency of a word occurring in a document is counted, which is used to represent the document, and is expressed as: Here, tf i,j represents the frequency of word_i in document_j, and N is the number of documents in category_c k . TF has been proved to be effective, but its performance is degraded with the multi-domain dataset because of the simple measure of words. Term Frequency and Inverse Document Frequency (TF-IDF) is a "state-of-the-art" method of feature weighting [47], which comprehensively measures the appearance frequency and distribution of words in assigning the weights, and it is obtained as: where idf (i,D) measures how much information word_i provides to represent document_j in document set_D, which is formulated as: Here, |{i∈D:i∈j}| represents the number of documents containing word_i in D. The TF-IDF is effective in selecting important words in a document while eliminating common words, yielding good performance in terms of classification accuracy. Part of Speech-based Weighting (PSW) is another novel feature weighting approach improving the accuracy of Twitter sentiment analysis. It clusters unique words of the categories into different subsets based on the word's Part Of Speech (POS) feature, which is shown in Table 1. The words in the subsets are weighted considering the intra-category relevance of the words, which is measured as: Here, wf i,j represents the assigned weight for word_i in subset_j, and x is the weight of the value of 2, 1.5, or 1 for each of the three subsets, which is given to leverage the role of the words in different Appl. Sci. 2019, 9, 92 6 of 18 subsets in affecting the orientation of the sentiment. E[F j ] refers to the expected frequency of the words in subset_j which is computed as: Note from Equation (4) that only when the TF value of a word is larger than the expected frequency of the subset, the weight is adjusted. This indicates that the word is more informative than the average level of the subset. The PSW method incorporates a clustering technique to classify the words and assign weights based on the semantical meaning and hierarchical relevance of the words, allowing effective feature weighting.

The Proposed Scheme
In this section a novel adaptive feature weighting scheme is proposed called CDS-C (Category Discriminative Strength with Chi-square model), which properly measures the inter-category and intra-category significance of a feature. A new mathematical model called Category Discriminative Strength (CDS) is presented to measure the inter-category discriminative strength of the features, while a modified Chi-square model is introduced to measure the dependency of the features in intra-category. Also, a fine-grained clustering strategy is proposed to precisely define the margin of discriminative features and aggregate the features of similar distributions for efficient feature weighting. Furthermore, an adaptive weighting strategy is proposed to properly decide the weight for each feature.

Motivation
The main objective of the feature weighting of sentiment analysis is to assign an appropriate weight for every single word in order to reflect its relative significance in the feature space, which in turn allows accurate prediction of the sentiment. Generally, many supervised feature weighting approaches assume that the inter-category distribution of a feature plays a major role in feature weighting, while ignoring its intra-category distribution. Nevertheless, the researchers have been questioning feature weighting based on only the inter-category distribution of the features. It has been proved that incorporating the two distributions is superior to using only one with respect to the measurement of the significance of the feature, and as a result yields a better classification performance. The inter-category and intra-category distribution of the features are equally important in feature weighting.
Intuitively, for a specific word, the more evenly a word is distributed in the feature space, the weaker it distinguishes the category. For example, even though conjunction words widely appear in a document, they hardly represent any category. It is generally believed that the words of large CDS mostly occur in only one or a few categories. Note that adjective words more likely appear in the given category to express the sentiment. Therefore, it is highly likely that the words of non-uniform distribution in the entire categories represent a specific category, and thus a larger weight needs to be assigned. While inter-category distribution is important to characterize the significance of a word in the decision of the weight, intra-category distribution of the word is also crucial in measuring its significance. Specifically, the distribution of a word within a certain category reflects its dependency on that category. More accurate weight could be obtained by incorporating the dependency. Thus, in the proposed scheme, CDS is used to measure the inter-category distribution of the word, while the property of intra-category distribution is formulated using the Intra-Distribution Measure (IDM). The feature weight is then finally computed based on the two measures.

Category Discriminative Strength
In order to characterize the inter-category distribution, a specific VSM is constructed to represent the feature weight of the words among the whole categories. Here, every single category, c i (∈C, C = {c 1 , . . . , c n }, n ≥ 1) is viewed as a vector on the feature space, and it is represented as v c i (∈ V, V = {v c 1 , . . . , v c n }). n is the number of the categories, and v c i denotes the feature value of word_w based on target category_c i . In TF the number of occurrences of a word for a certain document is counted to represent the feature weight of the document. Typically, the greatest feature value among the whole categories is utilized as the global feature weight of the word for that category, which is denoted as: This approach effectively analyzes the relative weight of the words based on the frequency. However, the discriminative strength of the words is not properly considered. Note that appropriately reflecting the discriminative strength of the word is crucial for the measure of inter-category distribution. Therefore, the frequency with which a word appears in all of the whole documents of a category, cf, is employed instead of TF, and then the VSM is expressed as: Assume that word_w occurs only in category_c 1 . Then, the VSM becomes one element vector, which indicates that word_w is solely relevant to class_c 1 . Note that the more categories a word belongs to, the less discriminative information the word retains to distinguish a certain category. Especially, when c f c 1 = c f c 2 = . . . = c f c n , word_w is distributed uniformly in the class space, which indicates that word_w is meaningless in representing any category.
In order to effectively measure the inter-class distribution, a novel model called CDS is introduced to measure how much discriminative information a word contains in representing a category, which is expressed as: Here, n is the number of categories containing word_w µ in the training dataset, and cf wµ,cν is the frequency of word_w µ appearing in category_c ν , which is obtained by summing up the total number of occurrences of word_w µ appearing within all the documents of that category. CDS(w µ ,c ν ) denotes the total frequency distance of word_w µ on the pre-measured category_c ν in the feature space. If CDS(w µ ,c ν ) > 0, word_w µ is viewed as having positive relevance to category_c ν , and owns discriminative information with respect to that category. Otherwise, it is regarded as meaningless.
Assume that word_w µ appears in n categories. Then, a queue Q p is built with the sorted values of CDS with respect to n individual categories in descending order, which is expressed as Here, CDS(w µ ,c α ) represents the highest CDS value of the n elements, which indicates that category_c α is the choice for word_w µ . The degree of choice for word_w µ considering multiple categories is formulated as: Here, word_w µ is supposed to appear in category_c j (1 ≤ j ≤ m), and D inter (w µ ,c ν ) represents the degree of choice of category_c ν for word_w µ measuring how much contribution word_w µ makes for category_c ν in predicting the polarity of the tested document. Using only one category for a word may not be enough because every presence of the word in other categories is also meaningful. Therefore, the intra-category measure of the words in one category also needs to be analyzed in order to precisely characterize the words.
Similar to Q p , another queue, Q q = CDS(w α ,c ν ) ≥ . . . ≥ CDS(w i ,c ν ), is built with the sorted values of CDS(w i ,c ν ) of word_w i (1 ≤ i ≤ m) belonging to category_c ν in descending order. The degree of intra-category significance of a word is then measured as: With word_w i (1 ≤ i ≤ m) in category_c ν , D intra (w µ ,c ν ) measures the intra-category significance of word_w µ . Incorporating D inter (w µ ,c ν ) and D intra (w µ ,c ν ), the discriminability of word_w µ for category_c ν is measured as: Based on the relative frequency and discriminability, word_w µ is represented as a vector, DisVector, as depicted in Figure 2. Here, the word is located at the origin, and the pointer that originated from it represents the position of the discriminability of the word. Based on the relative frequency and discriminability, word_wμ is represented as a vector, DisVector, as depicted in Figure 2. Here, the word is located at the origin, and the pointer that originated from it represents the position of the discriminability of the word. Obviously, Lengthi and θi are needed to be computed for constructing DisVector for a word. The relative frequency, cf(wμ,cν), and discriminability, Ф(wμ,cν), are utilized for obtaining them as: Here, λ is a linear factor that maps the range of Ф(wx,cy), [Фmin, Фmax], into [0,π/2]. The value of Фmin and Фmax are computed based on the training dataset. Refer to Figure 2. The degree of discriminability of word_i, Di, is obtained as: The discriminability of a word is characterized by DisVector in terms of the orientation of the discriminability and strength of the word.

Intra-Distribution Measure (IDM)
The intra-distribution of a word is important in reflecting its dependency on a category. Inverse Document Frequency (IDF) is commonly adopted, which measures how much information a word offers in representing the category. It measures the degree of occurrence of a word across all the documents, and is expressed as: Here, N is the number of documents in the training dataset, and |{d ∊ D: w ∊ d}| denotes the number of documents containing word_w. IDF has been incorporated into various schemes including TF-IDF to decrease the weight of frequently appearing words and increase the weight of rare words. Obviously, Length i and θ i are needed to be computed for constructing DisVector for a word. The relative frequency, cf (w µ ,c ν ), and discriminability, Φ(w µ ,c ν ), are utilized for obtaining them as: Here, λ is a linear factor that maps the range of Φ(w x ,c y ), [Φ min , Φ max ], into [0,π/2]. The value of Φ min and Φ max are computed based on the training dataset. Refer to Figure 2. The degree of discriminability of word_i, D i , is obtained as: The discriminability of a word is characterized by DisVector in terms of the orientation of the discriminability and strength of the word.

Intra-Distribution Measure (IDM)
The intra-distribution of a word is important in reflecting its dependency on a category. Inverse Document Frequency (IDF) is commonly adopted, which measures how much information a word offers in representing the category. It measures the degree of occurrence of a word across all the documents, and is expressed as: Here, N is the number of documents in the training dataset, and |{d ∈ D: w ∈ d}| denotes the number of documents containing word_w. IDF has been incorporated into various schemes including TF-IDF to decrease the weight of frequently appearing words and increase the weight of rare words. This is a coarse-grained approach because it supposes that rare words are more important than common ones. In this paper a fine-grained Intra-Distribution Measure called IDM is introduced to analyze the words, which incorporates the dependency derived from Chi-square statistics and the occurrence of words in the document. It is expressed as: Here, χ 2 (w i ,c j ) is the Chi-square value measuring the dependency of word_w i on category_c j . It is denoted as: m is the number of documents word_w i appears in category_c j , and n is the number of categories except class_c j . M denotes the sum of cf value of all the words in category_c j , and N is computed by summing the value of cf of words of categories except class_c j . Note that, unlike the conventional Chi-square method, c f i = ∑ |d c | i=1 t f i is employed as the input of the Chi-square calculation to avoid the weakness of incorrectly overestimating the role of low-frequency words. The proposed IDM is based on the intuition that the words of high document frequency are relatively more important. By incorporating the dependency of the words on the category, more accurate measurement of the intra-distribution significance of the words is obtained. Table 2. Different cases of the sums of the frequencies.

Feature Selection
∈ c j / ∈ c j Sum

Feature Clustering and Weighting
In order to efficiently analyze a large amount of features extracted from the training dataset, a feature clustering and weighting strategy is proposed. The clustering is performed to partition the features into representative groups generating a smaller classification model and allow a higher accuracy of sentiment prediction. In this paper two representative groups, G D and G N , are constructed to accommodate the distinctive and non-distinctive terms of a category [48].
Instinctively, an emotive word is representative and possesses enough distinctive information reflecting the sentiment orientation of a document. In order to accurately extract emotive words, firstly, all the documents belonging to a category are parsed through the Part of Speech (POS) tagger, and a POS tag is assigned to each word.
The unique terms with the tag of Verb, Adjective, or Adverb are regarded as emotive indicators to be clustered into G D , and the words with other tags are put in G N . However, as some words may co-occur in multiple categories, not all of the emotive words in a category are distinctive for that category. Since the words in the same cluster are expected to display similar inter-category distributions, the distribution of the target word over the whole feature space is examined. Then, the words of similar distribution are clustered [49]. The relative distance of similarity, D i,j , between word_w i and cluster_G θ (θ = D ∨ N) is obtained as: Here the distance is measured by the Euclidean norm of the Φ value of word_w i for category_c j from the expectation. Recall that the Φ value measures the discriminability of a word for the given category. Therefore, it is utilized in measuring the status of the word in that category. exp(G θ ,c j ) is the expectation of word_w i for category_c j , and it is computed as: Here p is the number of unique words in cluster_G θ . The average distance, D θ (θ = D ∨ N), of the words from cluster_G θ is obtained as: Then, all the words of cluster_G D are compared with the average distance, D D , to extract the relevant emotive words. If D D ≥ D i,j , word_w i is not regarded as relevant to G D . Otherwise, it is put to G N . Note that cluster_G D accommodates highly relevant words of large discriminative information. Then, the weight for word_w i , W, is calculated, reflecting its significance in terms of inter-class and intra-class distribution, and it is expressed as: Here, the weight assigned to word_w i of cluster_G D and cluster_G N of class_c k are represented as W(w i ,G D ,c k ) and W(w i ,G N ,c k ), respectively, and p and q are the number of unique words of G D and G N , respectively. Note that cluster_G D retains emotive words obtained with fine-grained selection, and ∑ p i=1 D i / ∑ q i=1 D i reflects the ratio of inter-distribution measures between G D and G N , capturing the intuition that the emotive words greatly represent the orientation of sentiment of the document. IDM(w i ,c k ) is employed as an intra-category distribution metric to accurately reflect the degree of the significance of word_w i in class_c k .

Sentiment Estimation
Bayesian theorem, as a supervised learning technique, is commonly applied to the sentiment analysis. In this paper the Multinomial Naïve-Bayes (MNB) is employed as the classifier to predict the polarity of the tested documents. Assuming that tested document_d is represented as a vector <w 1 , . . . , w |W| >, the MNB estimates the label of the Maximum A Posteriori (MAP) category and the probability of the category membership of d for category_c using Equations (23) and (24) Here W is the number of unique words in document_d, w i represents the i-th word of d, and C is the set of all the possible labels. P(w i |c) is the conditional probability of word_w i (1 ≤ i≤ |W|) belonging to document_d with the category label_c. Observe from Equation (24) that multiplying |W| conditional probabilities may cause floating point underflow, which results in incorrectly predicting the sentiment. Hence, the addition of the logarithms of the probabilities is used instead of the multiplication of the probabilities as follows.
The prior probability, P(c), is computed as: Here N c is the number of documents in category_c, and N is the entire number of documents in the training data. Note that incorporating the feature weights into both the classification formula and conditional probability model improves the performance of the classifier of Naïve-Bayes [51]. Therefore, the conditional probability in Equation (25) is modified as: Here W(w i ,G θ ,c) is the importance degree of word in terms of inter-class and intra-class distribution, and it is computed based on Equations (21) and (22). λ(c j ,c) is a binary function judging whether the two input parameters are equal or not. It is defined as: As the feature weights are incorporated into the MNB model for calculating the probability of the target document belonging to category_c, the class label with the highest probability is selected as the most suitable one for that document. Note that the synonyms are separately assigned the feature weight to reflect its significance, causing a respective effect in sentiment analysis.

Performance Evaluation
In order to demonstrate the effectiveness of the proposed feature weighting scheme for Twitter sentiment analysis, extensive experiments are carried out. Also, the results are compared with the existing schemes such as DF, TF, TF-IDF, and PSW.

Experiment Setting
The proposed CDS-C scheme is evaluated using Matlab, which consists of three components including the preprocessor, POS tagging API, and Bayes-based text classifier. The preprocessor is performed to remove stop-words and punctuations, and normalize the training dataset into a customized form, which makes it accessible by a POS tagger and features extractor. A Matlab-based function is utilized to implement the Stanford POS tagger [52,53], which offers the API that is used to obtain the POS tag for each word in the training dataset. The Bayes-based text classifier is implemented to apply the Baye's theorem in the analysis of the orientation of the sentiment and prediction of the polarity (positive/negative) of the tested document. In this paper the MNB classifier is utilized as the text classifier with the assumption of feature independence. The workload in the experiments is extracted from Sentiment 140, which is a benchmark dataset derived from Twitter API, and widely adopted in the sentiment analysis [54]. The workload is composed of 1,600,000 documents of labeled tweets where one half is labeled as positive and the other is labeled as negative. The data of Sentiment 140 covers a wide range of domains and data characteristics [55], and two instances are shown in Figure 3. Each data consists of six attributes: the polarity indicator (0 = negative, 4 = positive), user id, posting time, query condition, username, and text of the tweet. Four metrics are adopted to evaluate the performance of feature weighting in sentiment analysis: accuracy, precision, recall, and F1-score [19].
to obtain the POS tag for each word in the training dataset. The Bayes-based text classifier is implemented to apply the Baye's theorem in the analysis of the orientation of the sentiment and prediction of the polarity (positive/negative) of the tested document. In this paper the MNB classifier is utilized as the text classifier with the assumption of feature independence. The workload in the experiments is extracted from Sentiment 140, which is a benchmark dataset derived from Twitter API, and widely adopted in the sentiment analysis [54]. The workload is composed of 1,600,000 documents of labeled tweets where one half is labeled as positive and the other is labeled as negative. The data of Sentiment 140 covers a wide range of domains and data characteristics [54], and two instances are shown in Figure 3. Each data consists of six attributes: the polarity indicator (0 = negative, 4 = positive), user id, posting time, query condition, username, and text of the tweet. Four metrics are adopted to evaluate the performance of feature weighting in sentiment analysis: accuracy, precision, recall, and F1score [19].

Experimental Results
Three groups of cross-validated experiments are carried out to compare the proposed scheme with the existing ones. As a workload of Sentiment 140 is built from 800,000 negative and 800,000 positive documents of labeled tweets, the data is reconstructed as pairs of one negative and one positive tweet. For each experiment, 10,000 distinct random numbers (half odd and half even) are generated using the MATLAB function, ranging from one to 1,600,000. A training dataset is then constructed by extracting the tweets of the corresponding index from Sentiment 140, which is

Experimental Results
Three groups of cross-validated experiments are carried out to compare the proposed scheme with the existing ones. As a workload of Sentiment 140 is built from 800,000 negative and 800,000 positive documents of labeled tweets, the data is reconstructed as pairs of one negative and one positive tweet. For each experiment, 10,000 distinct random numbers (half odd and half even) are generated using the MATLAB function, ranging from one to 1,600,000. A training dataset is then constructed by extracting the tweets of the corresponding index from Sentiment 140, which is composed of 5000 negative and 5000 positive randomly selected documents. The tested dataset is selected from the remaining 1,500,000 documents of tweets.
The first group of experiments tries to investigate the performances in processing the tested data with balanced class distribution. Here, the tested datasets ranging from 500 to 6000 elements are built, consisting of equal number of positive and negative randomly selected documents. Secondly, two experiments are implemented to compare the performances of the schemes in handling biased tested data, consisting of solely positive or negative instances. Finally, they are evaluated with the tested documents of random polarity. Figure 4 shows the performance of classification of the five schemes when the size of the balanced tested dataset varies from 500 to 6000. Observe from the figures that the proposed CDS-C scheme consistently outperforms the other schemes for all four metrics regardless of the size of the tested data. Evaluating the tweets that have negative and positive polarity of the same size is common. The proposed scheme presents the highest accuracy, precision, recall, and F1-score, which indicates that it is very sensitive to the orientation of sentiment, and is able to classify the tested data into the correct category. This is achieved by the weights that were computed in accordance with both inter-category and intra-category feature relevance, and the discriminative features. Moreover, the proposed models of CDS and IDM properly measure the discriminability of the features of the training dataset, which contributes to quantify the usefulness of the features. In general, the classification performance tends to degrade when the number of the tested data increases. This is because a training corpus of limited size cannot sufficiently cover the increased size of the tested data. As depicted in Figure 4a, the accuracy slightly decreases as the number of tested documents increases from 500 to 6000. Also, it can be seen from Figure 4a,d that the accuracy and F1-score of the DF are superior than TF because considering either the presence or absence of features allows higher classification performance than considering only their frequencies [56].
training dataset, which contributes to quantify the usefulness of the features. In general, the classification performance tends to degrade when the number of the tested data increases. This is because a training corpus of limited size cannot sufficiently cover the increased size of the tested data. As depicted in Figure 4a, the accuracy slightly decreases as the number of tested documents increases from 500 to 6000. Also, it can be seen from Figure 4a,d that the accuracy and F1-score of the DF are superior than TF because considering either the presence or absence of features allows higher classification performance than considering only their frequencies [56].   Table 3 shows two tweets extracted from Sentiment 140, and Table 4 shows a comparison of the polarity label assigned with the five approaches evaluating the tested tweets of Table 3. Observe from Table 3 that the indicator is the predefined polarity label as "0" for negative and "4" for positive. As Table 4 shows, P(neg|test) and P(pos|test) are the final sentiment values generated by the MNB classifier of the tested tweets in the negative and positive categories, respectively. The final polarity label is obtained by comparing the two values.   Figure 5 presents the performance of the five schemes in terms of the accuracy with the balanced training dataset, but biased tested data. As observed from Figure 5a,b, the proposed approach always produces the highest accuracy compared to the other four schemes regardless of the polarity of the tested documents. It is challenging to analyze biased tweets using a training dataset of a limited size. Since a specific training dataset has an inclination on the tested instances of a certain polarity, a classifier trained using it may not be robust enough. Note that the proposed fine-grained clustering strategy accurately defines the boundary of the features, and groups the features into different clusters. Also, the discriminative features are effectively aggregated considering its significance and distribution. As a result, the proposed scheme can allow high classification accuracy in processing biased tested data. Observe from Figure 5a that the DF scheme achieves the second-highest accuracy, while TF shows the lowest accuracy. However, they display opposite performances for the document of negative polarity as shown in Figure 5b. This is because DF and TF weight the features based on the absence and frequency, respectively. Thus, the schemes are suitable for specific tested data, while their performances are significantly affected by the polarity of the tested data. In order to validate the robustness and efficiency of the schemes, the experiments are conducted with the tested dataset formed with the documents of random polarity. This set of experiments is complementary to the second group in evaluating biased tested tweets. Observe from Figure 6 that the proposed scheme consistently shows the highest performance in terms of all the four metrics. An unreasonable assignment of the weight to the features causes inaccurate decision of the polarity. Since the proposed adaptive weighting strategy computes the weight considering the inter-category and intra-category relevance, dependency, and distribution of the features, it can exhibit robust performance. This confirms that the proposed scheme is effective regardless of the dataset with respect to the size and polarity of the tested dataset. In order to validate the robustness and efficiency of the schemes, the experiments are conducted with the tested dataset formed with the documents of random polarity. This set of experiments is complementary to the second group in evaluating biased tested tweets. Observe from Figure 6 that the proposed scheme consistently shows the highest performance in terms of all the four metrics. An unreasonable assignment of the weight to the features causes inaccurate decision of the polarity. Since the proposed adaptive weighting strategy computes the weight considering the inter-category and intra-category relevance, dependency, and distribution of the features, it can exhibit robust performance. This confirms that the proposed scheme is effective regardless of the dataset with respect to the size and polarity of the tested dataset. the proposed scheme consistently shows the highest performance in terms of all the four metrics. An unreasonable assignment of the weight to the features causes inaccurate decision of the polarity. Since the proposed adaptive weighting strategy computes the weight considering the inter-category and intra-category relevance, dependency, and distribution of the features, it can exhibit robust performance. This confirms that the proposed scheme is effective regardless of the dataset with respect to the size and polarity of the tested dataset.

Conclusions
In this paper a novel feature weighting scheme has been presented for the sentiment analysis of tweet data. It comprehensively measures the significance of the features in both inter-category and intra-category distribution, and an analytical model characterizing the discriminability of the feature was introduced to precisely measure how much discriminative information a feature possesses. Also, a modified Chi-square statistics model was proposed to measure the relative relevance of a feature with the corresponding category. A new fine-grained clustering strategy was adopted to properly group the features of similar inter-category distribution, which allows accurate text classification. Extensive experiments were conducted using benchmark workloads, which demonstrate that the proposed scheme is consistently superior to the existing approaches in terms of four metrics including accuracy.
Since the proposed scheme is based on the dependence of each feature to the corresponding category, the length of the sentence does not directly affect the performance of the proposed method. A more precise model on the relation between the measures will be investigated in the future. As each word has a different feature weight reflecting its importance degree for different categories, the polarity of a sentence having two different opinions is determined by the weights obtained using the MNB text classifier. A new approach including the NN technique will also be developed for handling different or opposite opinions. In the proposed scheme the emoticons are detected and removed using the MATLAB function of the tokenized document. Since emotions usually contain some sentiments of people, a methodology effectively dealing with them will be included in future work.
The proposed scheme is currently performed in the MATLAB environment, manually evaluating the labeled tweets. Its performance will be tested by processing real-time information of the Twitter platform via the Twitter API, and more advanced text preprocessing techniques will be incorporated to improve the quality of classification of the sentiment. Further enhancements include the portability to other platforms such as mobile devices and the incorporation of Latent Dirichlet Allocation (LDA) technique to automatically identify the number of attributes of corpus [57][58][59][60].