The related work is segregated based on the SA Levels and the current trends in the SA. SA has 3 levels document, sentence and aspect level.
2.1. Document Level
In this process, the SA is carried out on the document or paragraph as a whole. Whenever a document is about a single subject it is best to carry out document-level SA. Examples of document-level SA datasets are speeches of world leaders, movie reviews, mobile reviews etc.
SentiWordNet (SWN) is an opinion-based lexicon derived from WordNets. WordNets is a lexical database which consists of words with a short definition and example. SWN consists of dictionary words and the numeric positive and negative sentiment score of each word. WordNets and SWNs are researchers’ common choice when carrying out SA on document level. Pundlik et al. [
6,
7] had worked on a multi-domain Hindi language dataset. The architecture was implemented by [
6,
7] contained two steps—domain classification, which was the first step performed using an ontology-based approach, and sentiment classification, performed using HNSW (Hierarchical Navigable Small World) and Language Model (LM) Classifier. A comparative study done on the results by the HNSW and HNSW + LM Classifiers. The combination of HNSW and LM Classifier had given better classification results as compared to HNSW [
6,
7].
The workby Yadav et al. and Shah et al. [
7,
8] showed that SA for the mix-Hindi language could be performed by using three approaches. The first approach was to perform classification based on neural network on the predefined words. The second approach used IIT Bombay HNSW. Third approach performed classification using neural networks on the predefined Hindi sentences. The approaches in [
7,
8] are explained in detail as follows: The first approach maintained the positive and negative word list. The mix-Hindi words were converted into pure Hindi words and were searched in the positive and negative list which was created manually. If the word was found in the positive word list the positive word count was incremented and if the negative word was found the negative word counter was incremented. In second approach instead of the positive and negative word list the HNSW was used remaining all the steps were same as in the first approach. In third approach seven features were created and applied on the sentences. The features are as follows, to find the frequency of the word, adjective, noun, verb, adverb, total positive polarity and negative polarity of the sentence. These features were sent to the neural network for testing and the polarity of the word was detected. After the comparison of all approaches it was found that the second approach had the best accuracy which was 71.5%.
Ansari et al. [
9] introduced an architecture for two code-mix languages, Hindi and Marathi. The architecture included language identification, feature generation and sentiment classification as major steps. Hindi and English WordNet’s and SWNs were used as there was no SWN for Marathi. The Marathi words were first translated into English and the sentiment score of the English words were found and assigned to the words. Also, classification algorithms such as Random Forest, Naïve Bayes, and Support Vector Machine (SVM) were used to find the polarity in the final step. Slang identification and emoticons were also crucial steps in the study. Slang is a group of words which are used informally in a particular language. Emoticons are the representation of different facial expressions. SVM performed the best among all the algorithms with accuracy of 90% and 70% for Marathi and Hindi language.
In the paper, Jha et al. [
10] explains that there are a lot of research done in the English language for SA, but little for the Hindi language. The system developed by the authors carried out the SA in Hindi language using two approaches. In first approach, Naïve Bayes was used for document classification and in the second approach, the parts of speech (POS) tagging was done using TnT POS Tagger by using the rule-based approach on the classification of opinionated words was completed. 200 positive and 200 negative movie review documents were web scraped for testing the system. Accuracy of 80% was achieved by the system.
2.2. Sentence Level
Sentence-level SA identifies the opinions on the sentence and classifies the sentence as positive, negative or neutral. There are two types of sentences, subjective and objective sentences, which are required to be identified while performing sentence-level SA. Subjective sentences carry opinions, expressions and emotions in them. Objective sentences are the factual information. Sentence-level SA can be carried out only on the subjective sentences hence it is important to first filter out objective sentences.
SWN (Senti Word Net) is a most common lexicon-based approach used by the researchers. Haithem et al. [
7,
11] developed the Irish SWN whose accuracy was 6% greater than the accuracy obtained by transliteration of the Irish Tweets into English. The lexicon was manually created. The accuracy difference between the systems was because of the translation carried out into the English language [
11]. Naidu et al. [
7,
12] carried out the SA on Telugu e-newspapers. Their system was divided in two steps. The first step was subjectivity classification. The second step was sentiment classification. In the first step the sentences were divided as subjective and objective sentences. In the second step only, the subjective sentences were further classified as positive, negative and neutral. Both the steps were performed using the SWN which gave the accuracy of 74% and 81% [
7,
12].
Nanda et al. [
7,
13] used the SWN to automatically annotate the movie review dataset. Machine-learning algorithms Random Forest and SVM were used to carry out the sentiment classification. Random Forest performed better than SVM giving the accuracy of 91%. Performance metrics used to evaluate the algorithms were accuracy, precision, recall, F1-score [
7,
13].
Pandey et al. [
7,
14] defined a framework to carry out the SA task on the Hindi movie reviews. It has been observed that the lower accuracy was obtained by using SWN as a classification technique and hence suggested using synset replacement algorithm along with the SWN. Synset replacement algorithms groups the synonymous words with the same concepts together. It helped in increasing the accuracy of the system because if the word was not present in the Hindi SWN then it find the closest word and assigned the score of that word [
7,
14]. In the study, Bhargava et al. [
7,
15] completed the SA task on the FIRE 2015 dataset. The dataset consisted of code-mixed sentences in English along with 4 Indian languages (Hindi, Bengali, Tamil, Telugu). The architecture consisted of 2 main steps, Language Identification and Sentiment Classification. Punctuation and hashtags were identified and handled by the CMU Ark tagger. Machine-learning techniques such as logistic regression and SVM were used for language identification. SWNs of each language were used for sentiment classification. The results of the implemented system were compared with the previous language translation technique and 8% better precision was observed [
7,
15].
Kaur, Mangat and Krail [
7,
16] carried out their SA task on Hinglish language, popularly be used for the social media communication. The authors [
10] had created a Hinglish corpus which contained movie review domain-specific Hindi words. Stopword removal and tokenization were the pre-processing techniques used in the system, along with TF-IDF as the vectorization technique. Classification algorithms such as SVM and Naïve Bayes were used to carry out the classification task. As future work, the authors in [
7,
16] are trying to find the best feature and classifier combination.
SVM is the machine-learning algorithm which is among the top choice of researchers. The researchers have even compared the results of the different deep-learning models with SVM [
17]. In [
7,
17] SA task was performed on a Tibetan microblog where Word2vec was used as a vectorization technique. It converts the words into the numeric vector. After the vectorization step the classification of the data was carried out by the different machine-learning and deep-learning algorithms such as SVM, Convolution Neural Network (CNN), Long short-term memory (LSTM), CNN-LSTM. CNN is a type of neural network with 4 layers; Input layer, convolution layer, global max pooling layer, output layer. Convolutional layer is the main layer because as feature extraction is done in this layer. LSTM is the variant of the RNN (Recurrent Neural Network) which are capable of learning long term dependencies and detecting patterns in the data. The comparative study of different algorithms displays CNN-LSTM model as the best model with the accuracy of 86.21% [
17].
Joshi et al. [
7,
18] carried out SA on the Gujarati tweets. Stopword removal and stemming were the pre-processing techniques used in the implemented model. Feature-extraction technique, Parts of Speech (POS) tagging and the classification algorithm SVM was used in the system. SVM performed very well and gave accuracy of 92%. Sharma et al. [
7,
19] tried to predict the Indian election results by extracting the Hindi tweets for the political domain. The tweets were mainly for 5 major political parties. Three approaches which were implemented to predict the winner in the election. The first approach was dictionary-based in which n-gram was used as a pre-processing technique and TF-IDF was used as a vectorization technique. SWN was used to classify the data and assign the polarity score to the words. Naïve Bayes algorithm and SVM were the remaining two approaches which were used. SVM and Naïve Bayes predicted party BJP (Bhartiya Janta Party) as the winner. SVM had the accuracy of 78.4% which was highest among the three implemented approaches.
Phani et al. [
20] carried out SA in three different languages, Hindi, Tamil and Bengali. Feature-extraction techniques, n-grams, and surface features were explored in detail because they were language independent, simple and robust. 12 surface features were considered in the study where some of them were number of the words in tweet, number of hashtags in the tweet, number of characters in the tweet etc. A comparative study was carried out to find out which feature extraction and sentiment classifier algorithm worked best together. The classifiers such as Multinomial Naïve Bayes, Logical Regression (LR), Decision Trees, Random Forest, SVM SVC and SVM Linear SVC were applied to the dataset. Most of the languages worked best with the word unigram and LR algorithm which result the highest accuracy of 81.57% for Hindi [
7,
20]. Research by Sahu et al. [
7,
21] was carried out on movie reviews in the Odia language. Naïve Bayes, Logistic Regression, and SVM were used for the purpose of classification. Comparison of the results of different algorithms was done using performance metrics such as accuracy, precision and recall. Logistic Regression performed the best with the accuracy of 88% followed by Naïve Bayes with accuracy of 81% and SVM with the accuracy of 60% [
7,
21].
Guthier et al. [
7,
22] proposed the language-independent approach for SA. An emoticon dictionary was created, and scores were assigned to the emoticons. When the tweet contained a combination of hashtags and emoticons, the hashtags were also added to the dictionary. A graph-based approach was implemented in the study. The graph-based approach worked on the principle that if multiple hashtags were present in the sentence then all the hashtags would have the same sentiment score. Also, all the hashtags present in the same sentence could be linked with each other. The work was tested on 5 different languages and the accuracy obtained was above 75%. Average accuracy of the model was 79.8%. The approach worked fairly with the single word hashtags and the hashtags which formed the sentences and accuracy for them were 98.3% and 84.5% respectively.
Kaur et al. [
7,
23] worked on the Hinglish language dataset. YouTube comments of two popular cookery channels were extracted and analysis was carried out on them. Pre-processing techniques such as stopword removal, null values removal, spell errors removal, tokenization and stemming were performed. DBSCAN which is the unsupervised learning clustering algorithm was used and 7 clusters were formed for the entire dataset. Dataset was manually annotated with the labels of 7 classes. 8 machine-learning algorithms were used to perform sentiment classification. Logistic regression along with term-frequency vectorization outperforms the other classification techniques with the accuracy of 74.01% for one dataset and 75.37% for the other dataset. Statistical testing was also carried out to confirm the accuracy of the classifiers.
Both document-level and sentence-level SA extract the sentiments for the given text but the feature for which the sentiment is expressed cannot be found. This shortcoming is fulfilled by aspect-level SA.
2.3. Aspect Level
Aspect-level SA is carried out in two steps. The first step is to find the features or the components in the text and the second step is to find polarity of sentiments attached to each feature, e.g., mobile reviews are given in the series of the tweets. The companies first find out which part or feature of the mobile the users are talking about and then find out the emotions related to that feature.
In the paper by Ekbal et al. [
7,
24] the aspect-level SA was carried out on product reviews. The dataset was obtained by web scrapping on different websites. Multi-domain product reviews obtained were analyzed in a two-step process—the first step was aspect extraction i.e., the aspects (features) in the review were extracted using the Condition Random Field Algorithm. Performance evaluation metrics such as F-measure and accuracy were used. SVM gave the accuracy of 54.05% for sentiment classification.
The proposed work by Ray et al. [
7,
25] is a SA of twitter data. POS tagging was used as a feature-extraction technique. Word embedding was used as the vectorization technique. Word embedding is the method where the words of sentences are converted into vectors of real numbers. Aspects were not directly labeled; instead, they were tagged to a predefined list of categories. Classification of the data was done using three approaches, CNN, rule-based approach, CNN + rule-based approach. The hybrid model of CNN + rule-based approach gave an accuracy of 87%.
Table 1 is the representation of the work done by different researchers in indigenous language [
7].