Sentiment-Target Word Pair Extraction Model Using Statistical Analysis of Sentence Structures

: Product information has been propagated online via forums and social media. Lots of merchandise are recommended via an expert system method and is considered for purchase by online comments or product reviews. For predicting people’s opinions on products, studying people’s thoughts via extracting information in documents is referred to as sentiment analysis. Finding sentiment-target word pairs is an important sentiment mining research issue. With the Korean language, as the predicate appears at the very end, it is not easy to ﬁnd the exact word pairs without ﬁrst identifying the syntactic structure of the sentence. In this study, we propose a model that parses sentence structures and extracts sentiment-target word pairs from the parse tree. The proposed model extracts the sentiment-target word pairs that appear in the sentence by using parsing and statistical methods. For extracting sentiment-target word pairs, this model uses a sentiment word extractor and a target word extractor. After testing data from 4000 movie reviews, the applicable model showed high performance in both accuracy 93.25 (+14.45) and F1-score 82.29 (+3.31) compared with others. However, improvements in the recall rate ( − 0.35) are needed and computational costs must be reduced.


Introduction
Traditionally, some information has been spread orally but recently it has also been propagated online via forums such as blogs, Twitter, Facebook, etc. Recent research on the consumption patterns of more than 2000 adults produced the following results: 81% of Internet users conduct their consumption activities on the Internet; 20% conduct consumption activities on the Internet daily; reviews written by the opinion readers exert influence on the consumer activities of others at 73%; 80% of consumers prefer goods that are rated five stars to goods rated four stars; and 32% of the overall grades of merchandises are decided online via an expert system method and about 30% or so are determined by the online comments or product reviews, etc. [1,2] Studying other people's thoughts by extracting online documents is referred to as opinion mining, an operation that deciphers and extracts subjective information or opinions from the source material. Research on opinion mining is concerned with two important tasks: Making dictionaries with words tagged with opinions and finding target words represented by opinions. For example, when analyzing "This cellphone's LCD is bright", through opinion mining, it can be analyzed as having the sentiment word "bright" and the target word "LCD". Building a dictionary of words tagged with sentiments permits the determination of whether the sentiment word "bright" was a word used in a positive or negative sense. And distinguishing that the word pointed to by the sentiment word "bright" is in fact "LCD" is what solves the issue of finding the target word. In opinion mining for target words, the predicate and object that represent attributes and sentiments have distinctly significant meanings. Since the predicate contains different meanings, depending on the attribute part of the sentence, it needs to be handled together with the attribute part. For instance, looking at the sentence "This cellphone's size is large", and the sentence "This car has a trunk that's large", the verb "is large" can be thought of as negative in the former sentence however, it can be thought of as positive in the latter sentence. Here, the verb "is large" has a dependency on the words "size" and "trunk". As such, in order to determine whether the verb "is large" is positive or negative, the part that has a dependency should be considered together.
Since the word order in the Korean language typically is of a structure in which the predicate appears in the last part of the sentence, a particular approach is needed in order to accurately find the target word that the predicate points. In order to accurately find the sentiment-target pairs, we propose in this article a model that can reflect the characteristics of the syntactical structure of the Korean language. The proposed model has found, in structurally analyzed sentences, the words with the possibility of being sentiment words and the words with a possibility of being target words using statistical data.
For this paper, in Section 2, research related to the sentiment-target pairs model that can reflect the characteristics of the syntactical structure of the Korean language is outlined, and previously developed technologies are explained. In Section 3, the methodology of the proposed model is explained. In Section 4, experimental methods and results are described. In Section 5, the conclusion of this study is presented.

Related Works
Sentiment analysis in the Korean language and sentiment analysis in the English language are clearly different. The first difference is in the word order. The existence of different word orders can compound the difficulty of an opinion mining study in which the pacing of the subject, predicate, and object in a random sentence needs to be performed. For general plain English texts, the typical structure is of a structure similar to "Subject + Predicate + Object" where the predicate is in the center, the subject in the front, and the object in the rear. Representing the "Subject + Predicate + Object" structure in phrase units, it can be expressed as "NP + VP + NP" where NP is the noun phrase and VP is the verb phrase. The noun phrase is comprised of two or more words and it refers to the unit acting as the noun. The verb phrase consists of two or more words, and it refers to the unit that serves as the verb. In such a sentence structure, on the basis of the verb phrase, the subject part and the object part can effectively be separated. This is so because each part can be separated effectively with only a morpheme analysis. However, in the case of the Korean language, the structure does not allow for distinguishing the subject part and object part based on the verb phrase. The typical structure of the word order is "Subject phrase + Object phrase + Verb phrase". Representing the corresponding sentence in phrase units, it can be expressed as "NP + NP + VP". In a structure that has the noun phrase appearing nested, it is difficult to distinguish the subject phrase and the object phrase with only a morpheme analysis. This implies that in doing a morpheme analysis in the Korean language, there is a possibility that sequential nouns may appear, as shown in Equation (1). N is the noun, V is the verb, and X is the selection problem of the noun to be used as an attribute: As shown in Equation (1), finding the attribute-sentiment pair in a sentence can cause a problem in selecting "up to which part" from the sequential nouns. Regarding the typical prior sentiment mining studies [3,4], there has been a number in which the target word is found by the PMI method or by applying the rule after tagging parts of a speech [5], when the target word is not determined or when it is determined [6,7]. In order to accurately find the parts of a sentence that can be the target word and sentiment word, a statistical model that analyzes the sentence structure and effectively extracts the target-sentiment word pair from the analyzed structure is proposed. In order to find the target words, B. Liu used a pattern in which various commas, periods, semi-colons, hyphens, &, and, but, etc. appeared in review sentences summarized by users [8,9]. An example of the review sentences is shown in Table 1, as well as the pros and cons of the item in the example. Then in Table 2, we show how the review sentences were analyzed. My SLR is on the shelf by camerafun4. Aug 09'04 Pros: Great photos, easy to use, very small Cons: Battery usage; included memory is stingy. I had never used a digital camera prior to purchasing this Canon A70. I have . . . Read the full review Table 2. The pros in Table 1 can be separated into three segments.
great photos -> 〈photo〉 easy to use -> 〈use〉 very small -> 〈small〉 ⇒ 〈size〉 By using the Web-PMI method, Popescu and Etzioni attempted to find the target words. The typical PMI method is the same as in Equation (2). As with the P(w) calculation method, it is used to count the number of documents containing the word (w) in Equation (3). When Equation (4) is substituted into Equation (2) it then becomes Equation (4), which is called the Web-PMI [10].
w 1 and w 2 are words; w 1 is used as a candidate element for identification and w 2 is used as an identifier. By confirming the co-occurrence information between w 1 and w 2 , an attempt was made to determine whether or not w 1 was a target word. The elements of the sentence used as identifiers are a pattern between structured morphemes and elements in WordNet [11,12]. Wen and Wan tried to extract target words by using label sequential rules [13]. Label sequential rules describe the combination that has the possibility of the elements of the sentence being seen when analyzed morphologically. Pertinent rules are applied to find the target words. Kang Han-hoon established a review pattern database for product reviews and used it to extract product attribute-specific positives/negatives. By comparing the positive/negative information extracted from the data (i.e., monitor, laptop, digital camera, and MP3 player), the rate of accuracy was then calculated. The approach was tested using a method that applied rules in morphologically analyzed sentences [6]. Yang Jung-Yeon obtained the product feature words (e.g., battery) and product sentiment words (e.g., short) from product reviews by using the data containing both the product reviews and product scores, via PMI, and determined whether the sentiment words appearing in the product features are positive or negative by using the review scores [14]. Long Jiang undertook classifying Twitter data according to sentiment words nouns extracted through the PMI method and words showing more than the threshold were deemed as one chunk of data. In addition, in order to overcome the difficult parts of the analysis that were caused by frequent occurrences of short sentences, the Tweets deemed as having been posted by one person were considered altogether in the form of a graph. The words processed in this way were classified into positive, negative, or neutral sentiments [7]. All of the applicable models are subject to issues in either the PMI method or in a method that applies rules after morpheme analysis: For the PMI method, words that are either high or low in frequency; for methods that apply rules, sentences that fall outside the categorized rule statements [15,16]. The issues stem from relying on simple probability information while the syntactic structures of sentences are not identified as of yet, or from trying to express everything with rules that solve the syntactic structures of sentences manually. In this paper, in order to solve problems such as these, a method is used that finds target words by identifying the sentence structure [17].

Materials and Methods
The model extracts the sentiment-target word pairs that appear in the sentence by using parsing and statistical methods. The model is comprised of two parts: One that parses sentences in the inputted documents and one that extracts word pairs. The part that parses the sentence structure consists of a morphological analyzer, a speech tagger component, and a syntactic structure analyzer; the other part that extracts word pairs consists of a sentiment word extractor and a target word extractor.

Sentence Structure Analysis
The sentence structure analysis part of the proposed model is comprised of a partsof-speech tagger and a parser. First, the parts-of-speech tagging model uses a general probability model similar to Equation (5) where T is the parts-of-speech tagging function of W, M represents the morpheme candidate, T is the parts-of-speech candidate, and W represents words of the sentence: By using the applicable model, the parts-of-speech of neutral words are attached properly [18,19]. What this is referring to is the fact that in the sentence "The sailor dogs the barmaid", the word "dogs" is not used as the frequently used noun form, but a verb form is determined and attaches the appropriate parts-of-speech [20][21][22]. Similar to the parts-of-speech tagging, the parser model also commonly uses a Probabilistic Context-Free Grammar model [23][24][25]. The Probabilistic Context-Free Grammar model can be expressed as shown in Equation (6). T best is a function that selects the syntax structure with the highest generation probability from the syntax structure trees, T represents words that comprise the parse tree, G is the grammar rules, t is the sentence, rule i is the ith grammar rule in the parse tree, and h i is the history of the appearance of the ith grammar rule:

Extraction of Word Pairs
For input sentences that are of the parse tree [26], the extraction of the sentiment word and target word is done in two kinds of processes. The extraction of the sentiment word is to find the verb or adjective that applies to the top-level node verb phrase in the tree that is being analyzed at the sentence structure analysis level; the extraction of the target word is finding the noun word of the noun phrase that is the most dependent with the found sentiment word. This can be represented with an equation as shown in Equation (7): In Equation (7), S represents the sentiment words and A is the target words, and T refers to the parse tree. Eventually, the extraction of the pair words refers to finding S and A, which have the highest probability values for the sentiment words S and target words A that are seen in the phrase-analyzed parse tree T. Equation (7) can be expressed as Equation (8), in which each of the needed elements is calculated by using Equation (9) and Equation (10) P(A ∨ s i , T) = argmax i P(ds i ∨ s i , w d 1,n , w co 1,n , p co 1,n ).
The w d is the distance information apart from the selected sentiment word; w co is the probability information for words that can appear together with the sentiment word; and p co is the probability information for the parts-of-speech that can appear together with the sentiment word. ds are the dependency strength that is calculated into w d , w co , p co [27,28]. The detailed calculation process of Equation (7) is as follows.
The rule of information from the parse tree.

•
Step 1: Select the node with the highest dependence in the parse tree (generally, root node). • Step 2: Register as target word candidates the two nodes close to the root node. • Step 3: Calculate the distance information with the selected node word from the two candidates, co-concurrence information of words, and co-concurrence information of parts-of-speech. • Step 4: Select the candidate with higher calculated dependency strength from the two candidates. • Step 5: Extract the selected two nodes into sentiment word and target word. • Step 6: If the parse tree can be turned into a sub-tree, then proceed to turn it into a sub-tree and repeat the above steps.
The distance between words is measured by how close the root node word and candidate words are in the input sentence. The distances between the selected root node and the candidate words w 1 and w 2 are calculated, and by interchanging the calculated values, how close each word is to the root node is then quantified. For example, if the first candidate word is the one-word distance from the root node and the second candidate word is three-word distances away from the root node word, then the first candidate word has a value of 3 and the second candidate word has a value of 1. The corpus dependent pattern is comprised of the word co-occurrence frequency and POS co-occurrence frequency [29]. Cooccurrence frequency means the value of the measure of frequency in which the root node word and a candidate word appear together, whereas the POS co-occurrence frequency is the value of the measure of frequency in which the POS of the root node word and the POS of a candidate word appear together. Finally, the measurement is made via "Dependency Strength = (word position) × (word co-occurrence) × (POS co-occurrence)", and as a measured value, the target word is selected from the candidates. Table 3 shows the extraction of the sentiment word and target word. Table 3. Extraction of sentiment-target word.

Korean POS English POS
Mina-ga Mina: personal pronoun, ga: a subjective case Ye-peun pretty: adjective In-hyeong-ui In-hyeong: doll, ui: a noun modifier Jip-eul Jip: house, eul: an objective case Sat-da Sat: buy, da: a finishing final ending Figure 1 shows the parsing results of the input sentence and shows a parse tree that expresses the example sentence in the nodes of verb phrase (VP) and noun phrase (NP).
The target-sentiment word pairs that are possible to be extracted from Figure 1 are shown in Table 4.  The node with the highest dependency in the parse tree in Figure 1 is the node after the word "buy". The words "Mina" and "pretty" located in the lower part of the node selected after the applicable node are selected as candidates for the target words. The dependency strength of the selected candidate words is calculated as follows: • Dependency strength ("Mina") = 3 × 2000 × 100,000; • Dependency strength ("pretty") = 4 × 4000 × 5000.
For the distance calculation, "Mina" gets a value of 4 and "pretty" gets a value of 3, and by interchanging the corresponding distance values, the degree of closeness of each of the words and the selected root node is measured. The calculation for the word co-occurrence frequency uses the statistics appearing in the corpus, the pronoun "Mina" is relatively lacking in the occurrence frequency as compared to the word "pretty" being used as an adjective. Calculation of POS co-occurrence frequency, as expected, uses the statistics appearing in the corpus, and the noun-verb combinations appear much more than the verb-verb combinations. Through the calculations of the applicable numerical values, the word "bought" is extracted as the sentiment word and the word "Mina" is extracted as the target word. By calculating with the same method, the sub-tree in which "pretty" is the root node in such ways, the word pair of "pretty" and "house" is obtained.

Evaluation Metric
To evaluate the effectiveness of our proposed method, we use the F1-score, which is to evaluate the effectiveness for evaluation. The F1-score is a harmonic value of precision (P) and recall (R) as a standard indicator to compensate for the shortcomings caused by using only accuracy for evaluation. It is calculated by Equation (11), where E p represents the set of predicted correct answers, E r denotes the ground-truth answer collection, and C = E p E r are the correct answers:

Experimental Results
The data used in the experiment consisted of 4000 movie reviews written in Korean, and for comparison with other studies tested in English. Functional words that appear in the Korean language were removed and comparative experiments were conducted. For the experiments, the performances were compared using a method that measures accuracy and recall rates.
In order to conduct a comparative experiment, using the same data, the method proposed in Long Jiang's model was implemented [30]. Table 5 shows the accuracy and recall rates of Long Jiang's model and the model proposed in this study. For the experiment, a comparative experiment was conducted with the proposed model, of which, the measured results are shown in the table below. Results of significant improvements were seen in the proposed model as compared to the existing model. The recall rates were about the same. The reason why the accuracy increased greatly from the existing model seems to be attributable to the fact that the analysis took into consideration the structure of the sentence. However due to a large number of calculations, the execution speed falls slightly compared to the existing models. An example of a specific miss-analysis of data actually analyzed is shown in Table 6. In the analysis results, the part in front of the symbol "-" is the property and the rear part is sentiment. Although the first miss-analysis result is a case finding, only the most representative pair in a sentence, in this case, the system extracts other properties and sentiments in addition to finding the most representative pair. The case of the second miss-analysis result is a case in which the results are not extracted due to the fact that the noun in the verb part does not show a dependency relationship in the syntax structure analysis results, where the verb phrase is in the front and the noun is in the back.

Conclusions
Since the word order in the Korean language typically is of a structure in which the predicate appears in the last part of the sentence, a particular approach is needed in order to accurately find the target word that the predicate points to. In order to accurately find the sentiment-target pairs, we proposed in this article a model that can reflect the characteristics of the syntactical structure of the Korean language. The proposed model found, in structurally analyzed sentences, the words with the possibility of being sentiment words and the words with a possibility of being target words using statistical data. Experimental results show a 93.25 (+14.45) accuracy and 82.29 (+3.31) F1-score, as compared to the test set.
However, due to a large number of calculations, the parts of the model that tax computational resources would need to be improved. In addition, the recall rate was not improved over the rate that is achieved by other models. These two shortcomings suggest a need for further studies. We also chose corpora with very different structure styles (such as the Korean language) for the experimental setting, and more extensive evaluations will be required to confirm that the presented results are applicable across domains. Therefore, we are conducting research to show the strength of generalized processing by adding linguistic rules that consider Korean characteristics and use deep learning technology.
Author Contributions: Conceptualization, formal analysis, methodology, software, visualization, writing-original draft preparation, and writing-review and editing, J.J.; investigation, data curation, resources, writing-review and editing G.K.; validation, supervision, project administration, and funding acquisition, K.P. All authors have read and agree to the published version of the manuscript.