1. Introduction
The increasing trend of social media usage on the internet and mobiles has amazingly diverted the ways of business, marketing, and other day to day activities. Every moment, huge data is floated on the web and is most popularly used in tremendous ways for various applications. In fact, social media data is used by customers, businesses, manufacturers, and administrative machinery for decision-making and e-governance. Thus, opinion mining from customer reviews has become an interesting topic of research. The data sciences techniques and data-driven predictive models are most popularly employed to automate this process [
1]. Opinion extraction can be carried out in two ways, i.e., one way is to obtain the overall summary of sentiment about a product and the other way is to produce a feature-based opinion summary about a product referred to as fine-grained analysis. The fine-grained analysis is also referred to as aspect-based opinion analysis. The fine-grained analysis or the aspect-based opinion analysis is in fact the most accurate and popular approach as the users can expect to get reviews about every aspect of the product [
2]. It is necessary to identify opinion targets and opinion words to carry out a fine-grained opinion mining process. The primary focus of the current research work is to improve the recognition of opinion targets, opinion words, and relationships between the two from unstructured text.
To demonstrate the efficiency of the proposed model, we tested our model on real datasets of online reviews from different domains. In the current study we made comparison of our proposed approach with state-of-the-art OM techniques. The experimental result figures confirms that the proposed approach ameliorates the performance over existing methods.
The paper is organized as follows:
Section 2 details the research questions that surround aspect-based opinion analysis,
Section 3 presents related work,
Section 4 describes the methodology,
Section 5 presents the results, and
Section 6 concludes the paper.
2. Domain Independent Opinion Target Identification
Most commonly, the opinion target is delimited as the expression of the product around which opinion is stated and normally it consists of nouns or noun phrases. However, it is important that all nouns and noun phrases are not opinion targets. Furthermore, a single sentence carries multiple nouns, out of which some of the nouns may be opinion targets and some may not. Similarly, a single sentence can carry multiple opinion targets with different opinions. Hence, the identification of opinion targets is an important element of the opinion mining problem. The second subtask of opinion mining is opinion words or expressions identification, which is defined as the words that express opinion. It is also important to detect the relation between aspect and opinion. The relation represents links between opinion words and opinion targets.
Existing approaches employ a co-occurrence-based mechanism, which is described as opinion words and opinion targets that normally co-occur at a certain distance in a sentence [
3]. Hence, various methods have been adopted to extract the two opinion components. A recently reported technique is a word alignment model that employs the co-ranking of opinion targets and opinion words [
4].
In order to further explain the opinion mining (OM) problem, let us look into a review taken from amazon datasets about a DVD player (see
Figure 1).
In the first sentence, the reviewer writes that apex is the best DVD player. In this sentence, the evaluative expression is the phrase “apex is the best DVD player”, which contains the target as “apex” and opinion as “best”.
The next sentence is rather a complex one. This sentence states that another DVD player has problems of stucks and sucks. In other words, the reviewer means that the apex does not have the problems of stuck and sucks. In this sentence, the phrase “there are other well-branded DVD players, but they keep complaining that some time it stuck and sucks” is an evaluative expression, with the target “other branded DVD player” and opinion “complaining of stuck and sucks”. This sentence implicitly compares the features of another player with apex player. Hence, the explicit target is that other than the apex DVD player and the implicit target is the apex DVD player. Further, the sentence states that most of the reviewer’s friends have “switched to apex player and are happy”. This phrase is again an evaluative expression with the target “apex player” and opinion “happy”.
The other two sentences, consisting of “If you are skeptical about trying apex take my words and give it a try. Hopefully you won’t regret”, state that if you are doubtful about trying apex, then you should take the reviewer’s surety that you will not be unhappy with it. Here, again, the target is “apex DVD player” and “will not regret” is the opinion.
The sentence “I bought this to replace an expensive ($300+) Onkyo DVD player that quit after only 3 years” has again both an implicit and explicit nature. Implicitly, it shows a good opinion about the apex DVD player and explicitly, it shows that the Onkyo DVD player quit after only three years. It again compares apex with the Onkyo DVD player.
These examples clearly show that it is not simple to obtain opinion terms and features relation. The OM process needs to consider the semantic and context nature of the sentences. Although there have been many approaches for opinion target identification, we have observed that there are still some limitations, as explained below.
Adjectives and nouns have been potentially exploited for opinions and targets, respectively, while the relationship between them is detected by collocation [
5,
6] or syntactic patterns [
7,
8] in a limited window. A sentence may have different nouns and adjectives and there can be long-span modified relations and diverse evaluative expressions [
4,
9]. Similarly, reviewers may express an opinion about different products or topics in the same or a single review. Hence, simply instance-based nouns or adjectives and their nearness might not provide a particular result. In order to address this problem, several methods have exploited lexico-sequential relations that contain opinions and words. It can otherwise be stated as part of speech dependency relations. To this effect, several regular expressions have been designed to identify mutual relationships between targets and opinions [
6]. However, the online reviews are usually written in an informal manner, having typographical, punctuation, and grammatical errors. Hence, machine learning tools trained on standard text corpuses might produce deficient results.
Another important issue in iterative procedures is error propagation. The erroneous extraction in preceding iteration is difficult to eliminate in subsequent iterations. For example, selecting all base noun phrases (BNPs) and adjectives may have numerous irrelevant entities and may lead to extracting irrelevant opinion targets. Therefore, it might be better to filter out redundant features in the first step.
Furthermore, supervised learning algorithms have the limitation of domain dependency and are difficult to prepare training sets for multi domain operations.
To address these challenges, we have proposed the BFE algorithm. BFE extracts the relation between opinion targets and opinion words based on grammatical and contextual information. Furthermore, BFE extracts only those sentences that have no grammatical errors. The BFE provides a list of seed target features that are employed to collectively mine opinion targets and opinion words from a candidate list. Since the BFE model is the only exactly relevant target in the first phase, error propagation is avoided. It may also be noted that BFE is domain independent.
3. Related Work
OM has been an interesting area of research and significant efforts have been taken [
1,
2,
5,
6,
8]. Therefore, the research community has contributed a number of lexical dictionaries, corpuses, and tools for opinion mining. The appearance of social media on the internet has further attracted researcher attention in this field due to the availability of a huge collection of online and real-time data. The machine learning algorithms are employed for the opinion mining process in two different modes, i.e., extraction of opinion from sentences and documents. The sentence mode of the process extracts the opinion targets and opinions at the sentence level, while the document mode sees opinion targets and opinion words at the document full-text level.
Earlier works concentrated on the document level extraction of opinion words and opinion targets in pipelines [
10,
11,
12,
13], without regarding the relation between them, while in recent works, sentence level techniques are reported that are based on the relationship between opinion words and opinion targets [
1,
2].
In this work, we have focused on the sentence level domain independent extraction of opinion words and opinion targets based on the grammatical structure of the sentences. The contextual and semantic relations among the synset of WordNet provide the best clues for the identification of opinions and targets [
14]. A number of unsupervised learning approaches have potentially exploited syntactic and contextual patterns [
12,
13,
15,
16,
17,
18,
19,
20]. We regarded syntactic patterns for the identification of relationships between opinion targets and opinion words.
We employed regarded opinion lexicon for the detection of opinion words, while opinion targets are extracted through syntactic relations with opinion words. A number of studies have exploited basic opinion words to key out opinion aspects. Some common words, for example, good, well, right, bad, badly, awful, like, dislike, jolly, beautiful, pretty, wonderful, fantastic, impressive, amazing, and excellent, have been used as words to generate opinion lexicon for opinion detection through machine learning techniques [
14,
21,
22,
23,
24,
25,
26,
27,
28]. Kobayashi et al. [
19] presented an unsupervised approach for the extraction of evaluative expressions that contain target and opinion pairs. Popescu and Etzioni [
29] introduced a machine learning tool referred to as OPINE. This system is established on the un-supervised machine learning paradigm to dig out the most relevant product features from the given reviews. OPINE applies syntactical patterns’ semantic preference of words for the recognition of phrases containing opinion words and their polarity. Carenini, Ng et al. [
25] presented an ameliorated un-supervised approach for the task of feature descent that exercises the taxonomy of the product features. Holzinger et al. [
30] have proposed domain ontologies based on a more eminent set up of knowledge for the identification of product features. L. Zhuang, Jing, & Zhu [
31] proposed a multi-knowledge-based approach exploiting WordNet (WordNet is a lexical dictionary), syntax rules, and the keyword list for mining feature-opinion pairs. Bloom et al. [
32] efficaciously made use of the adjectival assessment aspects for target recognition. Ben-David et al. [
33] presented a structural correspondence learning (SCL) algorithm for domain classification to predict new domain features based on training domain features. Ferreira, Jakob et al. [
13] presented a modified Log Likelihood Ratio Test (LRT) for opinion target identification. Blitzer et al. [
34] presented a novel structural correspondence learning algorithm, which take advantage of the untagged training data from the target domain to take out several relevant features that might have the capability to bring down the difference. Kessler, Eckert et al. [
35] presented an annotated corpus with different linguistic features, such as mentions, co-reference, meronym, sentiment expressions, and modifiers of evaluative expressions, i.e., neutralizers, negators, and intensifiers. The authors of [
15] presented a semi-supervised approach to bunch up features through a bootstrap process using lexical characteristics of terms. Lin & Chao [
36] presented feature-based opinion mining model particularly pertained to hotel reviews. The authors presented model is based on supervised learning. The main objective of their work is to train classifier for touristy associated opinion mining. Goujon [
10], Qu et al. [
11] and Nigam [
37] have exploited linguistic patterns and bag of word models for target identification and sentiment extraction.
4. Proposed Framework
Although there are a lot of OM techniques, we noticed that there are few recent methods that are particularly similar to our proposed mechanism [
1,
2,
33]. Similar to our proposed BFE approach, Ben-David et al. [
33] presented a structural correspondence learning (SCL) algorithm for domain classification to predict new domain features based on training domain features. However, their approach needs pre-knowledge of the domain and our approach exploits syntax rules for the identification of BFE. [
1,
2] presented an automatic extraction technique for opinion words and opinion targets based on word alignment relations. Their technique is semi-supervised topic modeling for a generalized cross domains opinion target extraction. However, it is unlike our proposed model, which is fully heuristic and does not need well-written expert training examples. Similar to this, our proposed model aligns the best fit example patterns to sentences for the prediction of relations between opinion targets and opinion words. Furthermore, we derived the method from the existing work [
14,
21,
22,
23,
24] to exploit opinion lexicon and the nearest neighbor rule for candidate selection from the sentences where proposed BFE rules are not applicable.
This proceeding section reports and describes the paces of the total process of the proposed method for the identification of opinion targets, opinion words, and the relations between them in sentence-based opinion extraction.
Overview of the Proposed Framework
As explained above, we have to extract opinion targets, opinion words, and the target-opinion relations from each sentence of the document. We noted that the existing work has commonly adopted nouns/noun phrases for opinion targets and adjectives for opinion words. Additionally, for relation extraction, diverse approaches have been adopted, as explained in related work. The most crucial and challenging issue is the relation identification. Here, in this work, we considered the nouns/noun phrases for target identification and adjectives for opinion words. For relation identification, we regarded key factors that can be employed to predict target-opinion relations: contextual dependency, location information, and opinion lexicon. Considering all the three factors, we can get better results, as shown through experiments. The proposed framework employs the following analogy for the prediction of the three opinion components.
If an adjective in a sentence is likely to be an opinion word, then a noun/noun phrase modified by the adjective would be opinion target and vice versa. As explained in related work [
10,
11,
12,
13], adjectives are used for opinions and nouns for opinion targets in opinionated sentences. In this study, our proposed approach considers the adjectives and noun phrases for the subject task.
Therefore, our approach first detects opinion words in sentences and then searches for opinion targets, as explained in Algorithm 1.
Algorithm 1. Extract BFE and Candidate Opinion-Target. |
- 1.
Input: Reviews - 2.
Output: BFE-LIST, COT-LIST - 3.
Begin - 4.
Initialize: OP, BFE, BFE-LIST, COP-LIST - 5.
For Each Review Ri - 6.
For Each Sentence Si in Ri - 7.
Find Opinion Word: OP in Si - 8.
If Si Contains OP then - 9.
Find BFE Pattern - 10.
If BFE Exits then - 11.
Add BFE to BFE-LIST - 12.
else - 13.
Search Candidate Pattern: COT - 14.
IF found COT then - 15.
Add COT to COT-LIST - 16.
End if - 17.
End if - 18.
End if - 19.
Next For - 20.
Next For - 21.
End
|
The steps of the proposed framework’s process are given below.
Pre-Processing
The initial step is to pre-process the input review to prepare it from the start for additional actioning. This step necessitates noise removal, to accomplish part of speech (POS) tagging task and sentence breaking. In POS tagging, a unique correct syntactic category or the POS tag is assigned to each word of a phrase, which is mandatory for pattern generation, e.g., the extraction of noun phrases, subjective expressions, etc. In this step, noise removal is also performed, which is used to remove incomplete sentences and unidentified words.
Extract Candidate Targets
In this step, we extract candidate target features and opinion words depending upon location and semantic context from sentences other than BFE. As discussed in related work, noun phrases and adjectives have been widely used for opinion and target extraction. However, simply selecting all BNPs and adjectives cannot provide a materialized result and has a high false positive rate.
Hence, it is important to select BNPs with semantic and contextual dependency. In this regard, various studies have presented different sequences or patterns to identify the target and opinion. We derived a combination of dependency patterns (cBNP) from [
38] and employed random walk through an algorithm to extract candidate opinion-targets.
The proposed patterns that are employed for candidate extraction have the following description:
Entity-to-Entity and Entity-to-Feature are interlinked through the preposition “of/In”. For example, in the sentence “physical appearance of the apex compared to one previous (ad1100w) is most appreciated.”, the opinion target is the “physical appearance of the Apex”.
This pattern considers noun phrases with subjective adjectives, for example, “good/JJ color/NN setting/NN, funny/JJ pictures/NN”.
We experimentally evaluated the patterns as shown in the scattered graph
Figure 4 and
Table 1, where the specificity and sensitivity of the patterns based on the precision and recall are presented. From this graph, it is clear that the Precision and Recall of the cBNP are highly balanced compared to the other patterns.
Targets Features Pruning
In order to refine the target features from impurities, we propose two steps of pruning, i.e., BFE pruning and CF pruning. BFE pruning pertains to refining target features extracted through a best-fit examples algorithm, while CF pruning pertains to refining the candidate features. Both pruning steps depend on semantic-based relevance scoring. The relevance scoring is calculated by using WordNet Based Semantic Similarity. The WordNet dictionary offers conceptual-semantic and lexical relations. It is important to mention why the IS–A relation is proposed for semantic similarity in this work. Since this relation does not cross the part of speech boundaries, the similarity measures are limited to making judgments between noun pairs (e.g., cat and dog) and verb pairs (e.g., run and walk) [
39]. As we only consider noun phrases and opinion targets, the IS-A relation is proposed. Hence, for pruning, we use the WordNet-based path length semantic similarity, as given in “1”.
where
is the semantic similarity,
is the path length from
to
, and
are terms.
BFE Pruning
Although BFE provides compact results, there is still a chance of impurities, like the reviewer may discuss different topics. e.g., reviews on the Amazon website sometimes discuss product features and the services of Amazon. Actually, we are concerned with product features. Thus, to exactly obtain the target and opinion about the product, there is a need to refine the BFE results before applying relevance scoring with candidate feature so that error propagation may be avoided. In this pruning step, we divide targets into groups based on WordNet-based path length similarity. Hence, a target group is generated by finding semantically similar features where the similarity score might be greater than a threshold value. Mathematically, the group function is presented in Equation (2).
where
is a base term,
represents the union of base terms,
represents the similarity between base term
and
, and
.
CF Pruning
As a matter of fact, candidate features are greater in number and have impurities. In the free text, there is no boundary for the reviewers. They can touch upon different topics in a single review. However, our proposed patterns for candidate features selection only predict the subjectivity of the phrase or sentence. Hence, the candidate features algorithm extracts all those BNPs that are modified by adjectives. Because of this, there is a greater chance of including irrelevant target features. In order to remove the irrelevant features, we propose CF pruning. This pruning step employs BT obtained in the previous step. Thus, in this step, we extract all the features from CF that have a semantic relation with BFE.
where
represents candidate terms,
is the union of candidate terms,
is the similarity between the base term and candidate term, and
is a threshold value.