A Sentiment-Aware Contextual Model for Real-Time Disaster Prediction Using Twitter Data

: The massive amount of data generated by social media present a unique opportunity for disaster analysis. As a leading social platform, Twitter generates over 500 million Tweets each day. Due to its real-time characteristic, more agencies employ Twitter to track disaster events to make a speedy rescue plan. However, it is challenging to build an accurate predictive model to identify disaster Tweets, which may lack sufﬁcient context due to the length limit. In addition, disaster Tweets and regular ones can be hard to distinguish because of word ambiguity. In this paper, we propose a sentiment-aware contextual model named SentiBERT-BiLSTM-CNN for disaster detection using Tweets. The proposed learning pipeline consists of SentiBERT that can generate sentimental contextual embeddings from a Tweet, a Bidirectional long short-term memory (BiLSTM) layer with attention, and a 1D convolutional layer for local feature extraction. We conduct extensive experiments to validate certain design choices of the model and compare our model with its peers. Results show that the proposed SentiBERT-BiLSTM-CNN demonstrates superior performance in the F1 score, making it a competitive model in Tweets-based disaster prediction.


Introduction
Social media has been increasingly popular for people to share instant feelings, emotions, opinions, stories, and so on.As a leading social platform, Twitter has gained tremendous popularity since its inception.The latest statistical data show that over 500 million Tweets are sent each day, generating a massive amount of social data that are used by numerous upper-level analytical applications to create additional value.Meanwhile, numerous studies have adopted Twitter data to build natural language processing (NLP) applications such as named entity recognition (NER) [1], relation extraction [2], question and answering (Q&A) [3], sentiment analysis [4], and topic modeling [5].
In addition to the social function, Twitter is also becoming a real-time platform to track events, including accidents, disasters, and emergencies, especially in the era of mobile Internet and 5G communication, where smartphones allow people to post an emergency Tweet instantly online.Timing is the most critical factor in making a rescue plan, and the rise in social media brings a unique opportunity to expedite this process.Due to this convenience, more agencies like disaster relief organizations and news agencies are deploying resources to programmatically monitor Twitter, so that first responders can be dispatched and rescue plans can be made at the earliest time.However, processing social media data and retrieving valuable information for disaster prediction requires a series of operations: (1) perform text classification on each Tweet to predict disasters and emergencies; (2) determine the location of people who need help; (3) calculate the priorities to schedule rescues.Disaster prediction is the first and most important step, because a misclassification may result in a waste of precious resources which could have been dispatched to real needs [6].
However, to automate this process, an accurate and robust classifier is needed to distinguish real disaster Tweets from regular ones.Disaster prediction based on Tweets is challenging, because words indicative of a disaster, such as "fire", "flood", and "collapse", can be used by people metaphorically to describe something else.For example, a Tweet message "On plus side look at the sky last night it was ABLAZE" explicitly uses the word "ABLAZE" but means it metaphorically.The length limit of Tweets brings pros and cons for training a classifier.The benefit is that users are forced to tell a story in a concise way, and the downside is that the lack of clear context may prevent a classifier from well understanding and interpreting the real meaning of a Tweet.Therefore, it is crucial to build an advanced model that can understand the subtle sentiment embedded in Tweets along with their given contexts to make better predictions.
Recent advances in deep learning have explored approaches to address these challenges that are commonly seen in other NLP tasks.Convolutional neural networks (CNNs), which have been widely used in numerous computer vision tasks, have also been successfully applied in NLP systems due to their ability for feature extraction and representation.Recurrent neural networks and their popular variants, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), are not only suitable for general sequential modeling tasks but also provide the capability to capture long dependency information between words in a sentence.In addition, LSTM and GRU can well address the gradient explosion and vanishing issue and allow a training algorithm to converge.Another breakthrough architecture is Bidirectional Encoder Representations from Transformers (BERT), which stacks layers of Transformer encoders with a multi-headed attention mechanism to enhance a model's ability to capture contextual information.
Inspired by these prior efforts, we propose a learning pipeline named SentiBERT-BiLSTM-CNN for disaster prediction based on Tweets.As shown in Figure 1, the pipeline consists of three consecutive modules, including (1) a SentiBERT-based encoder that aims to transform input tokens to sentiment-aware contextual embeddings, (2) a Bidirectional LSTM (BiLSTM) layer with attention to produce attentive hidden states, and (3) a singlelayer CNN as a feature extractor.In addition, a standard detection head takes as input a concatenation of the generated features and feeds them into a fully connected layer followed by a softmax layer to output the prediction result, i.e., disaster Tweet or not.The design is validated through extensive experiments, including hyper-parameter tuning to decide certain design choices and an ablation study to justify the necessity of each selected building block.Results show that the proposed system achieves superior performance in the F1 score, making it a competitive model in Tweets-based disaster prediction.The rest of this paper is organized as follows: Section 2 reviews relevant studies; Section 3 covers the dataset description and the technical details of the proposed learning model; Section 4 provides experimental validation with result analysis; Section 5 summarizes our work and points out future directions.

Social Media Learning Tasks
Data collected from social media have a lot of potentials to explore.Social texts have been extensively studied and mined to build a wide range of NLP applications such as NER [1], Q&A [3], sentiment analysis [4,[7][8][9][10], and topic modeling [5,11,12].In addition, social data have been utilized for emergency, disease, and disaster analysis [13][14][15].In [16], the authors develop predictive models to detect Tweets that present situational awareness.The models are evaluated in four real-world datasets, including the Red River floods of 2009 and 2010, the Haiti earthquake of 2010, and the Oklahoma fires of 2009.This paper focuses on exploring the contextual information in Tweets to build a robust disaster classifier.

RNN/CNN-Based Models in Text Mining
Active development in deep learning in recent years has generated fruitful achievements in social text learning.As two representative learning models, RNN and CNN have been seen in numerous studies, either individually or in a hybrid fashion.
Huang et al. [17] combined BiLSTM and Conditional Random Field (CRF) to build a sequential tagging framework that can be applied to parts of speech (POS), chunking, and NER tasks.In [18], Liu et al. propose a Stochastic Answer Network (SAN) that stacks various layer types, including GRU, BiLSTM, and self-attention; along with a stochastic prediction dropout trick, the SAN model shows superior performance in reading comprehension.
Kalchbrenner et al. [19] designed one of the earliest CNN-based methods for sentence modeling, which featured a dynamic CNN (DCNN) that uses the dynamic k-max pooling subsampling and achieves superior performance in sentiment classification.Due to CNN's ability in feature extraction, the DCNN-based system does not require hand-crafted features, which is appreciated and widely adopted by numerous subsequent studies.Kim [4] proposed a simple but effective CNN architecture that utilizes pre-trained word embeddings by word2vec.Kim's work was modified by Liu et al. [20], who propose to learn word embeddings rather than use pre-trained ones directly.Mou et al. designed a tree-based CNN [21] that can capture the general semantics of sentences.In [22], Pang et al. proposed to transform the text matching problem into an image recognition task that can be solved by a CNN-based model.In addition to open-domain datasets, CNNs have also been extensively used in domain-specific tasks, especially in biomedical text classification [22][23][24][25][26][27].
Chen et al. [18] proposed a two-stage method that combines BiLSTM and CNN for sentiment classification.First, the BiLSTM model is used for sentence type classification.Once assigned a type, a sentence then goes through a 1D CNN layer for sentiment detection.In [28], the authors designed a hybrid network that combines RNN, MLP, and CNN to explore semantic information at each hierarchical level of a document.

Transformer-Based Models for Social Text Learning
BERT [29] and its variants [30][31][32][33][34][35][36] have been extensively used as building blocks for numerous applications, owing to their ability to capture contextual word embeddings.FakeBERT [37] combines BERT and 1D CNN layers to detect fake news in social media.A similar work [38] adopts BERT to detect auto-generated tweets.Mozafari et al. [39] designed a BERT-based transfer learning method to detect hate speech on Twitter.Eke et al. [40] employed BERT to build a sarcasm detector that can classify sarcastic utterances, which is crucial for downstream tasks like sentiment analysis and opinion mining.

Learning-Based Disaster Tweets Detection
One of the early efforts to identify and classify disaster Tweets is by Stowe et al. [6], who focused on the Tweets generated when Hurricane Sandy hit New York in 2012.In [6], six fine-grained categories of Tweets, including Reporting, Sentiment, Information, Action, Preparation, and Movement, are annotated.With a series of hand-crafted features, such as key terms, Bigrams, time, and URLs, the dataset is used to train three feature-based models, including SVM, maximum entropy, and Naive Bayes models.Palshikar et al. [41] developed a weakly-supervised model based on a bag of words, combined with an online algorithm that helps learn the weights of words to boost detection performance.Algur et al. [42] first transformed Tweets into vectors using count vectorization and Term Frequency-Inverse Document Frequency (TF-IDF), based on a set of pre-identified disaster keywords; the vectorized Tweets are then trained using Naive Bayes, Logistic Regression, J48, Random Forest, and SVM to obtain various classifiers.Singh et al. [43] investigated a Markov model-based model to predict the priority and location of Tweets during a disaster.Madichetty et al. [44] designed a neural architecture that consists of a CNN to extract features from Tweets and a multilayer perceptron (MLP) to perform classification.Joao [45] developed a BERT-based hybrid model that uses both hand-crafted features and learned ones for informative Tweets identification.Li et al. [46] investigate a domain-adapted learning task that uses a Naive Bayes classifier, combined with an iterative self-training algorithm, to incorporate annotated data from a source disaster dataset and data without annotation from the target disaster dataset into a classifier for the target disaster.More broadly, prior efforts on event Tweet detection are also of interest.Ansah et al. [47] proposed a model named SensorTree to detect protest events by tracking information propagated through the Twitter user communities and monitoring the sudden change in the growth of these communities as burst for event detection.Saeed et al. [48] developed a Dynamic Heartbeat Graph (DHG) model to detect trending topics from the Twitter stream.An investigation of recent efforts [49] in disaster Tweet detection reveals a lack of deep learning-based methods that have shown superiority in numerous other NLP applications, as mentioned in Section 2.1.However, in the sub-field of disaster Tweets detection, the use cases are still insufficient.In addition, the idea of integrating sentiment information into a disaster detector remains unexplored, and our study is an attempt to fill this gap.
Inspired by the prior efforts, we design a learning pipeline that includes a BERT variant named SentiBERT [50] to obtain sentiment-aware contextual embeddings, a BiLSTM layer for sequential modeling, and a CNN for feature extraction.The pipeline aggregates the strength of each individual block to enhance the predictive power that realizes an accurate disaster detector.

Dataset
The dataset was created by Figure Eight inc.(an Appen company) from Twitter data and used as a Kaggle competition hosted at https://www.kaggle.com/c/nlp-gettingstarted/data(accessed on 21 June 2021).There are 10,876 samples in the dataset, including 4692 positive samples (disaster) and 6184 negative samples (not a disaster).Table 1 shows four positive and four negative samples.It can be seen that the disaster and non-disaster Tweets could use similar keywords in different contexts, resulting in different interpretations.For example, "pileup" in sample 1, "airplane's accident" in sample 2, "Horno blaze" in sample 3, and the phrase "a sign of the apocalypse" in sample 4 are more indicative of a disaster.However, the words "bleeding", "blaze", "ambulance", and "Apocalypse" in samples 4 through 8 do not indicate a disaster, given their contexts.Figure 2 displays the histograms of three variables per Tweet: the number of characters, the number of words, and the average number of word lengths.Specifically, the means of the character number per Tweet for disaster and non-disaster Tweets are 108.11and 95, respectively; the means of the word number per Tweet for disaster and non-disaster Tweets are 15.16 and 14.7, respectively; the means of the average word length for disaster and non-disaster Tweets are 5.92 and 5.14, respectively.The stats data show that the disaster Tweets are relatively longer than the non-disaster ones.

Data Pre-Processing
The raw data obtained from Twitter have noises that need to be cleaned.Thus, we apply a pre-processing step to remove the hashtags, emoticons, and punctuation marks.For example, a message "# it's cool.:)", becomes "it's cool." after the filtering.We then apply some basic transformations such as changing "We've" to "We have" to create a better word separation within a sentence.Finally, we tokenize each message to generate a word sequence as the input of the learning pipeline.

Overview of the Proposed Learning Pipeline
Figure 3 shows the proposed SentiBERT-BiLSTM-CNN learning pipeline, which consists of three sequential modules: SentiBERT is utilized to transform word tokens from the raw Tweet messages to contextual word embeddings.Compared to BERT, SentiBERT is better at understanding and encoding sentiment information.

2.
BiLSTM is adopted to capture the order information as well as the long-dependency relation in a word sequence.

3.
CNN acts as a feature extractor that strives to mine textual patterns from the embeddings generated by the BiLSTM module.
The output of the CNN is fed to a detection layer to generate the final prediction result, i.e., disaster or not.

Sentibert
BERT [29] is an attention-based language model that utilizes a stack of Transformer encoders and decoders to learn textual information.It also uses a multi-headed attention mechanism to extract useful features for the task.The bidirectional Transformer neural network, as the encoder of BERT, converts each word token into a numeric vector to form a word embedding, so that words that are semantically related would be translated to embeddings that are numerically close.BERT also employs a mask language model (MLM) technique and a next sentence prediction (NSP) task in training to capture wordlevel and sentence-level contextual information.BERT and its variants have been applied to numerous NLP tasks such as named entity recognition, relation extraction, machine translation, and question and answering, and achieved the state-of-the-art performance.In this study, we choose a BERT variant, SentiBERT, which is a transferable transformerbased architecture dedicated to the understanding of sentiment semantics.As shown in Figure 4, SentiBERT modifies BERT by adding a semantic composition unit and a phrase node prediction unit.Specifically, the semantic composition unit aims to obtain phrase representations that are guided by contextual word embeddings and an attentive constituency parsing tree.Phrase-level sentiment labels are used for phrase node prediction.Due to the addition of phrase-level sentiment detection, a sentence can be broken down and analyzed at a finer granularity to capture more sentiment semantics.Let s = {w i |i = 1, ..., n} denote a Tweet message with n word tokens, which are the input of SentiBERT.Our goal is to leverage the power of SentiBERT to generate sentimentenhanced word embeddings, which can be denoted by e = {e i = SentiBERT(w i )|i = 1, ..., n}.In this study, each Tweet should have no more than 64 tokens; the Tweets with less than 64 tokens are padded, namely, n = 64.Reference [29] experimentally showed that the output of the last four hidden layers of BERT encodes more contextual information than that of the previous layers.To this end, we also chose a concatenation of the outputs of the last four hidden layers as the word embedding representation.

Bilstm with Attention
A regular LSTM unit consists of a cell, an input gate, an output gate and a forget gate.The cell can memorize values over arbitrary time periods, and the three gates regulate information flow into and out of the cell to keep what matters and forget what does not.The BiLSTM consists of a forward and a backward LSTM that process an input token vector from both directions.By looking at past and future words, a BiLSTM network can potentially capture the more semantic meaning of a sentence.In our study, the word embeddings e produced from module I are fed into a standard BiLSTM layer to generate a list of hidden states h = {h i |i = 1, ..., n}, where h i is given by Equation set (1).

← −
where [; ] is a concatenation operation.The structure is shown in Figure 5.In a Tweet, each word influences the disaster polarity differently.Using an attention mechanism can help the model learn to assign different weights to different words so that the more influential words are given higher weights.For a hidden state h i , its attention a i is given in the Equation set (2).

LSTM
where W denotes a weight matrix, b denotes the bias, and u w a global context vector, and all three are learned during training.The of module II is a concatenation of attentive hidden states H = [a 1 h 1 ; ...; a n h n ].

CNN
Module III is a CNN that extracts local features, as shown in Figure 6.We adopt a 1D convolutional layer with four differently-sized filters.Each filter scans the input matrix H and performs a 1D convolutional along the way to generate a feature map.The extracted features are then fed into a max-pooling layer and concatenated to form a feature matrix F. Lastly, we send a concatenation of H and F to the dense layer.

A Fusion of Loss Functions
In this subsection, we explore the options of loss functions.We considered two individual loss functions including the binary cross-entropy (BCE) loss and the Focal Loss.In addition, we employed a fusion strategy as suggested in [51] to combine the two losses, which resulted in performance improvement.
Since the disaster Tweet detection task is a typical binary classification problem, it is intuitive to utilize the BCE loss as shown in Equation ( 3) below.
in which m is the training set size, and y (i) and ŷ(i) denote the ground truth and the predicted class for the ith sample in the dataset, respectively.Meanwhile, considering the imbalanced sample distribution, this study also employs Focal Loss, defined in Equation ( 4).
where γ is a coefficient that controls the curve shape of the focal loss function.Using Focal Loss with γ > 1 reduces the loss for well-classified examples (i.e., with a prediction probability larger than 0.5) and increases loss for hard-to-classify examples (i.e., with a prediction probability less than 0.5).Therefore, it turns the model's attention towards the rare class in case of class imbalance.On the other hand, a lower α value means that we tend to give a small weight to the dominating or common class and high weight to the rare class.By fusing the focal loss and the BCE loss in a certain ratio, we obtain Equation ( 5), in which β 1 and β 2 specify the fusion weights.

Experiments
We utilize the disaster Tweet dataset discussed in Section 3.1 for performance evaluation.We first present the performance metrics and then report the experimental results.

Evaluation Metrics
We use precision (Pre), recall (Rec), and the F1 score to evaluate the model performance.Given that the positive/negative samples are not balanced, F1 is a better metric than accuracy.Precision and recall are also important.The former reflects the number of false alarms; the higher the precision, the fewer false alarms.The latter tells the number of positive samples that are missed; the higher the recall, the fewer disaster Tweets missed.A large precision-recall gap should be avoided, since it indicates that a model focuses on a single metric, while a model should really focus on optimizing F1, the harmonic mean of precision and recall.
Let TP, TN, and FP denote the number of true positives, true negatives, and false positives, respectively, we can then calculate precision, recall, and F1 as follows.

Training Setting
The dataset was divided into training and validation sets in the ratio of 7:3, generating 7613 training and 3263 validation samples.For the SentiBERT, the embedding dimension was 768, max sequence length was 128, and layer number was 12; for the BiLSTM module, the layer number was 1, and the feature number was 768; for the CNN module, the sizes of four were set to 2, 3, 4 and 5.For the overall architecture, we used a learning rate of 1 × 10 −4 , the Adam optimizer, and experimented with different batch sizes (16 and 32) and training epochs (6, 8, 10, 12, and 14).All experiments were implemented using Python 3.9.4 and PyTorch 1.8.0 on Google Colab with an NVIDIA Tesla K80.

Baseline Model
The baseline model we chose was a BERT-based hybrid model developed by Joao [45].We denote the model as BERT hyb .We regard BERT hyb as a credible baseline because it presented the state-of-the-art (SOTA) performance compared to a variety of models on four datasets.BERT hyb works by combining a series of hand-crafted Tweet features and the BERT word embeddings and sending the feature concatenation to an MLP for classification.

Effect of Hyper-Parameter Choices
We conducted experiments to evaluate the performance of our model SentiBERT-BiLSTM-CNN under different hyper-parameter settings.Specifically, the model was trained with a combination of three values of epochs (6, 8, 10, 12, and 14) and two values of batch sizes (16,32), creating ten experiments, as shown in Table 2.It can be seen that when the model was trained with 10 epochs and with a batch size of 32, the model achieved the best performance, with an F1 of 0.8956.We also observe a consistent performance improvement as the number of epochs increases from 6 to 10, and beyond 10 epochs, the gain is not apparent.The training was efficient because SentiBERT has been pre-trained and was only fine-tuned on our dataset.It is noted that for this set of experiments, we applied a basic cross-entropy loss function.The effect of the fused loss function is reported in the next subsection.

The Effect of a Hybrid Loss Function
We conducted experiments to evaluate the model's performance under different loss function settings.We first evaluated the performance of using BCE and FL individually and then fused the two loss functions in the ratio of 1:1.The results are reported in Table 3.We observe that the model with FL outperformed the model with BCE, validating the efficacy of FL in the case of imbalanced data distribution.In addition, the model with a hybrid loss function performed the best, with an F1 of 0.9275.The result demonstrates the effectiveness of the fusion strategy in this study.We also conducted experiments to evaluate a set of models, and present a performance comparison of all evaluated models in Table 4, using the best hyper-parameter settings and the fused loss function, as reported in the previous two subsections.We give the result analysis as follows.

•
The

Error Analysis
Table 5 shows ten samples, including five positive and five negative ones, which are misclassified by the proposed SentiBERT-BiLSTM-CNN model.In this subsection, we provide an analysis of these mistakes that may shed light on further improvement of our model.

•
For the five samples that are marked as disaster Tweets (i.e., samples one through five), none of them are describing a common sense disaster: sample 1 seems to state a personal accident; sample 2 talks about US dollar crisis which may indicate inflation given its context; in sample 3, the phrase "batting collapse" refers to a significant failure of the batting team in a sports game; sample 4 is the closest to a real disaster, but the word "simulate" simply reverses the semantic meaning; sample 5 does mention a disaster "Catastrophic Man-Made Global Warming", but the user simply expresses his/her opinion against it.Our observation is that the process of manual annotation could introduce some noises that would affect the modeling training.From another perspective, the noises help build more robust classifiers and potentially reduce overfitting.

•
For the five negative samples (6-10), we also observe possible cases of mislabeled samples: sample 6 clearly reports a fire accident with the phrase "burning buildings" but was not labeled as a disaster Tweet; sample 7 states a serious traffic accident; sample 8 mentions bio-disaster with the phrase "infectious diseases and bioterrorism"; sample 9 has only three words, and it is hard to tell its class without more context, although the word "bombed" is in the Tweet; sample 10 reflects a person's suicide intent, which could have been marked as a positive case.

− +
We need to clarify that these misclassified samples presented in the table are randomly selected from all error predictions.It can be seen that the length limit of Tweets presents pros and cons for training a classifier.The bright side is that users are forced to use short and direct words to express an opinion, and the downside is that some short Tweets are hard to interpret due to the lack of more context information, which is the main challenge for training an accurate model.

Conclusions
Disaster analysis is highly related to people's daily lives, and recent years have seen more research efforts dedicating to this field.Research on disaster prediction helps augment people's awareness, improve the mechanism of a government rescue, and schedule charitable institutions' work.This paper investigates a novel model for disaster detection using Tweets.Our model, SentiBERT-BiLSTM-CNN, leverages a sentiment-aware BERT encoder, an attentive BiLSTM, and a 1D convolutional layer to extract high-quality linguistic features for disaster prediction.The model is validated through extensive experiments compared to its peers, making it a competitive model for building a real-time disaster detector.
Although the proposed model is trained and validated on an English dataset, it can be applied to datasets in other languages.Specifically, in a different language environment, the following adjustments need to be made: first, we should find a BERT model pre-trained in the target language or in a multi-lingual setting, which is readily available online (https://huggingface.co/transformers/pretrained_models.html,accessed on 12 March 2021); second, we need to retrain SentiBERT on a sentiment analysis dataset in the target language; lastly, a new disaster Tweet dataset in the target language is needed to train and validate the model.In this new language environment, SentiBERT can now generate sentiment-aware word embeddings to be consumed by the subsequent BiLSTM and CNN modules, which are language independent.This work has the following limitations that also point out the future directions.First, it remains interesting to uncover the role keywords played in disaster detection.Given that keywords like "blaze" and "apocalypse" can appear in both disaster and non-disaster Tweets, it is challenging to effectively utilize the keywords as extra knowledge to help boost the detection accuracy.One potential solution is to fine-tune BERT through pair-wise training, taking a pair of Tweets containing the same keywords but with opposite training labels; this way, BERT is forced to better understand the contextual difference between two Tweets.Second, it remains unknown that how well the model trained on our dataset performs on other disaster datasets, such as HumAID [54] and Crisismmd [55]; in addition, we expect to obtain a more robust model that is trained across multiple disaster/crisis Tweets datasets.Third, we are interested in creating a multilingual disaster detector that can understand and process Tweets in different languages; it is worth conducting a performance comparison between a multilingual and a monolingual model.

Figure 2 .
Figure 2. Stats of Tweets in the dataset.Histograms of (a) the number of characters per Tweet, (b) the number of words per Tweet, and (c) the average word length per Tweet, plotted for disaster Tweets (left) and non-disaster Tweets (right).

Figure 6 .
Figure 6.Module III: feature extraction via a CNN layer.

Table 1 .
Disaster Tweets dataset samples.A + sign indicates a positive sample, and a − sign indicates a negative sample.

Table 3 .
Performance of SentiBERT-BiLSTM-CNN under different loss function settings.
[53]of models CNN, BiLSTM, SentiBERT, BiLSTM-CNN, and SentiBERT-BiLSTM-CNN forms an ablation study, from which we can evaluate the performance of each individual module and the combined versions.It can be seen that the pure CNN model performs the worst since a single-layer CNN cannot learn any contextual information.Both BiLSTM (with attention) and SentiBERT present an obvious improvement.SentiBERT is on a par with BiLSTM-CNN in precision, but outperforms it in recall.Our final model, SentiBERT-BiLSTM-CNN tops every other model, showing its power to combine the strength of each individual building block.•Theset of models fastText-BiLSTM-CNN, word2vec-BiLSTM-CNN, BERT-BiLSTM-CNN, and SentiBERT-BiLSTM-CNN are evaluated to compare the effect of word embeddings.FastText[52], word2vec[53], BERT, and SentiBERT are used for the same purpose, i.e., to generate word embeddings.A model's ability to preserve contextual information determines its performance.From the results, we observe that by adding contextual embeddings, the models gain improvements to varying degrees.

Table 4 .
A performance comparison of models.

Table 5 .
Examples of misclassified samples.A "+" sign indicates a positive sample, and a "−" sign indicates a negative sample.