Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning

The growth of the Internet has expanded the amount of data expressed by users across multiple platforms. The availability of these different worldviews and individuals’ emotions empowers sentiment analysis. However, sentiment analysis becomes even more challenging due to a scarcity of standardized labeled data in the Bangla NLP domain. The majority of the existing Bangla research has relied on models of deep learning that significantly focus on context-independent word embeddings, such as Word2Vec, GloVe, and fastText, in which each word has a fixed representation irrespective of its context. Meanwhile, context-based pre-trained language models such as BERT have recently revolutionized the state of natural language processing. In this work, we utilized BERT’s transfer learning ability to a deep integrated model CNN-BiLSTM for enhanced performance of decision-making in sentiment analysis. In addition, we also introduced the ability of transfer learning to classical machine learning algorithms for the performance comparison of CNN-BiLSTM. Additionally, we explore various word embedding techniques, such as Word2Vec, GloVe, and fastText, and compare their performance to the BERT transfer learning strategy. As a result, we have shown a state-of-the-art binary classification performance for Bangla sentiment analysis that significantly outperforms all embedding and algorithms.


Introduction
Sentiment classification is the process of examining a piece of text to forecast how an individual's attitude toward an occurrence or perspective will be oriented. The sentiment is usually analyzed based on text polarity. Typically, a sentiment classifier categorizes positive, negative, or neutral [1]. Sentiment extraction is the backbone of sentiment categorization, and considerable study has been conducted. The next crucial step is sentiment mining, which has increased tremendously in recent years in line with the growth of textual data worldwide. People now share their ideas electronically on various topics, including online product reviews, book or film studies, and political commentary. As a result, evaluating diverse viewpoints becomes essential for interpreting people's intentions. In general, sentiment refers to two distinct sorts of thought, either positive or negative, across several platforms where mass opinion has worth.
For example, internet merchants and food suppliers constantly enhance their service in response to customer feedback. For instance, Uber or Pathao, Bangladesh's most popular ride-sharing service, leverages consumer feedback to improve its services. However, the difficulty here is traversing through the feedback manually, which takes far too much time and effort. Automatic Sentiment Detection (ASD) can resolve this issue by categorizing the sentiment polarity associated with an individual's perspective. This enables more informed decision-making in the context of one's input. Additionally, it may be utilized in various natural language processing applications, such as chatbots [2].
As a result of numerous revolutionary inventions and the persistent efforts of researchers, the area of NLP has arisen. Deep Learning (DL) approaches have been increasingly popular in recent years as processing power has increased and the quantity of freely accessible data on the Web has increased. As word embedding improves the efficiency of neural networks and the performance of deep learning models, it has been used as a foundation layer in a variety of deep learning methods.
Earlier attempts to implement sentiment analysis in Bangla have relied on non-contextualized word embeddings (Word2Vec and fastText), which present a series of static word embeddings without considering many other contexts in which they could occur. However, the Bidirectional Encoder Representations from Transformers's (BERT) recent advent phenomenon tremendously amplifies the contextualization strategy [3]. As the trend switched toward transformer-based architectures consisting of attention heads, BERT has established itself as the most impressive NLP model capable of performing superbly in any NLP operation with proper fine-tuning for specific downstream tasks. BERT is a pre-trained state-of-the-art (SOTA) language model that is highly bidirectional and has been trained on a large English Wikipedia corpus [4]. For 104 languages, there is a generic mBERT model [5]. Since it does not do well on other language tasks, the researchers developed their language-specific BERT model that performs pretty similarly to the original BERT model. Consequently, we employ the superior BERT model for Bangla sentiment analysis.
Bangla is spoken by around 250 million people and is the world's fifth most widely spoken language. However, due to a scarcity of resources, pre-trained models such as transformer-based BERT were unsuitable for any task. This issue was handled by developing a monolingual Bangla BERT model for the Bangla language. To obtain the best possible result for this sentiment analysis dataset, we fine-tuned the Bangla-BERT (https://huggingface.co/Kowsher/bangla-bert (accessed on 1 February 2022)) model which had been trained on the largest BanglaLM dataset (https://www.kaggle.com/datasets/ gakowsher/bangla-language-model-dataset (accessed on 1 February 2022)) [6] and then set connection to a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM).
This research examined an extensive sentiment dataset, including reviews from various domains like the internet and social networking sites, including politics, sports, products, and entertainment. To do this at first, we fine-tuned the BERT, and then the aggregating layer has been utilized as the text embedding; finally, we have developed a deeply integrated model such as CNN-BiLSTM for decision-making. Here, we showed two kinds of comparison of our proposed work: the first one is word embedding techniques such as Word2Vec, GloVe, fastText with BERT, and the second one compares various machine learning and deep learning algorithms to ensure the best performance of hybrid integrated model CNN-BiLSTM. to ensure the best performance of hybrid integrated model CNN-BiLSTM. This work will assist merchants in rapidly integrating a classification model into their own systems for the purpose of tracking customer feedback. The following points can notate the main contribution of this paper: • This work has ensured the hybrid integrated model such as CNN-BiLSTM, and it has been used in combination with monolingual BERT to address the issue of sentiment analysis in Bangla; • We compared Word2vec, GloVe, FastText, and BERT, where we demonstrated how  transformer architecture exceeds all prior state-of-the-art approaches and becomes the  new state-of-the-art model with proper fine-tuning;  • To do this, we developed a Bangla pre-trained BERT model for transfer learning (Huggingface: Kowsher/bangla-bert).
The following Section 2 discusses related work. Section 3 presents our proposed methodology, while Sections 4 and 5 discuss the embedding and classification algorithms. We reported the results in Section 6 and concluded with a discussion and recommendations for further study in Section 7.

Related Work
Sentiment Analysis is a well-known problem that involves assessing the polarity of an individual's viewpoint. The SA procedure entails extracting features from a text corpus, building a classifier model, and assessing its performance [7]. This type of usual procedure has been applied to a variety of sentiment classification tasks, including categorization of movie reviews [8], online product reviews [9], and Twitter tweets [10]. Akshay et al. [11] developed a model for detecting positive and negative sentiments in restaurant reviews, with a maximum accuracy of 94.5%. Another analysis revealed an accuracy of 81.77% for smartphone reviews when the researchers employed SVM as a classifier [12]. Sentiment analysis on Twitter data for the Portuguese language has been defined in [13]. Ombabi et al. demonstrated a sentiment classifier for the Arabic language with an accuracy of 90.75%, outperforming the state of the art [14]. The blending of two languages to create a new one is a regular occurrence in NLP. Such work has been conducted on vernacular Singaporean English, a product of the coalescence of Chinese and Malay languages [15]. However, the majority of efforts on sentiment categorization focus on English and other widely spoken languages. The biggest constraint on Bengali sentiment analysis research is a lack of appropriate resources and datasets. Numerous deep learning techniques have been developed for a variety of domains, including microblogs, product reviews, and movie reviews. To classify sentiment polarity on those domains, SVMs use maximum entropy [16] and Multinomial Naive Bayes (MNB) [17] have been utilized. Hossain et al. [18] created a Bangla book review dataset, applied machine learning approach, and discovered that MNB had an accuracy of 88%. Similar study using SVM on the Bangladesh cricket dataset achieved 64.60% accuracy [19]. Sarker et al. suggested a sentiment classifier for Bengali tweets that outperforms n-gram and SentiWordnet features by 45%. The sentiment categorization of Bengali film reviews demonstrates a range of performance values when using various machine learning techniques. Amongst them, the SVM and LSTM models achieve 88.89 and 82.41% accuracy, accordingly [20]. Pre-trained language models have notably become pivotal in a variety of NLP applications since they can leverage massive amounts of unlabeled data to obtain general language representations; Elmo [21], GPT [22], and BERT [4] are just a few of the possible best example. Among them, the BERT model receives the most attention due to its unmatched bidirectionality and attention mechanism. As a result, researchers are tracking its effect on downstream NLP tasks. Since BERT is trained exclusively in English, researchers create their language-specific BERT model to get higher precision on the task since it has been demonstrated that language-specific BERT models outperform generic mBERT models. Recent research has also shown outstanding task performance in sentiment analysis [23,24] attempting to uncover factors and their related views. Numerous researchers from various countries have developed their respective language BERT models to evaluate the sentiment analysis task. The Arabic BERT model AraBERT scored 99.44 on their distinct sentiment analysis experiment [25], while the Persian ( PersBERT) [26], DutchBERT (BERTje) [27], and Romanian (RobBERT) [28] models scored 88.12, 93.00, and 80.44 on their corresponding sentiment analysis experiments. Russia (Ru-BERT) [29], China [30], and several other countries develop their language-specific BERT models to obtain greater accuracy across all NLP domains, including sentiment analysis. They compare their model accuracy to that of the mBERT model and discover that their results are significantly higher than their mBERT values. This demonstrates that, while performing sentiment analysis, the monolingual BERT produces the state-of-the-art (SOTA) outcome, discarding all previous attempts and methods.

Methodology
Though BERT can be used as a feature extraction model, we chose the fine-tuning technique. We have expanded the Bangla-BERT model using two distinct end-to-end deep network layers in this technique: CNN and LSTM.
BERT generates contextualized embedding vectors for each word that are then fed through one of two deep network layers CNN to LSTM, which has been described in Figure 1. The feature vector is constructed by concatenating the output neurons for each word from the intermediary layer. Then, each vector is processed through a densely linked neural network to reduce its dimension. Softmax is used to classify the final reduced vector. Additionally, three additional learning algorithms with pre-trained word embeddings were incorporated: Word2Vec, Glove, and fastText. Word2Vec has proved to be very effective in the analysis of sentiment in a variety of languages, especially Bengali [31]. Meanwhile, fastText has gained widespread interest in Bengali text analysis, owing to its action on n-grams at the word level [32]. Representative mechanism of Bangla-BERT to CNN-BiLSTM in sentiment analysis. First, BERT accepts tokens for embedding, and then passes through the CNN layer for the extract information. Next, LSTM aids to create a sequence from the extracted information after FNN makes a decision by calculating loss. Data gathering and labeling were the initial steps in this classification work. The data acquired from social media was carefully labeled. A relevant domain expert validated the manual labeling. The data were then subjected to a pre-processing approach that included the removal of missing values, noise, and spelling correction, as well as feature extraction and dimension reduction. Figure 2 illustrates the entire process of this research. Following that, the data set was partitioned into training and test segments at a 7:3 ratio. We trained and evaluated the model using supervised learning. The trained model was then fed the testing data, and the prediction accuracy was compared to the ground truth. The whole methodology of this work has been depicted in Figure 2. Whole workflow of sentiment analysis. The first phase is data collecting and then labelling, Secondly, the data have been pre-processed and the last phase is decision-making by modeling.

Data Source
We gathered data for the corpus from a range of sources, including internet sites and social networking sites where individuals share their valued opinions. A substantial portion of the data was gathered via Facebook, Twitter, and YouTube comments. Apart from that, online stores have grown to be a significant part of digital marketing. As a result, we also gathered data from online retailer product reviews. Additionally, certain film and book reviews have been included in this corpus. Table 1 presents an overview of the dataset.

Data Collection
Following that, we have collected a total of 8952 samples from referred sources, where 4325 samples are positive and the rest of the samples are negative. For this sample labeling, ten native speakers annotated it using the web annotation software doccano. Each participant annotated 30% of the dataset individually by assigning positive or negative labels.
We then applied kappa statistics to labeling data collectors and the majority voting of the labeling by the native speaker group. The annotation tool is depicted in Figure 3.

Data Preprocessing
Data preparation is essential in machine learning-based classification, as the model's accuracy is heavily dependent on the quality of the input data [33]. We employ this procedure to prepare data for machine utilization. The next subsection describes the many procedures involved in data preprocessing.

Missing Value Check
We began our data processing phase by addressing the dataset's missing values. We've encountered two distinct sorts of missing values. Several of these are data omissions, while others provide less information than is required. If all information was absent, we eliminated the entire sample by erasing the entire row. If there was insufficient information, the value was manually adjusted using a similar observation value.

Noise Removal
After correcting for missing values, we enhanced the dataset by removing noise from the samples. Non-Bangla letters or characters, meaningless special symbols, and emoticons are all considered noise. Though emoticons can express a wide variety of emotions, we have seen that only a small percentage of data contains emoticons. As a result, the cleaning operation includes the removal of emoticons. Table 2 illustrates the processing steps with an example. Table 2. Data pre-processing methods, step by step.

Processing Text
Original

Spelling Correction
Since the data were gathered from various people, some words may have been mistyped or misspelled. We used the Bangla Academy's [34] available dictionary (AD) database to determine the most suitable structure of the word. From the sentiment corpus, SC = d1, d2, d3, dn where d1 is the text data. Each v that does not appear in AD is deemed a misspelled word. The right word was then obtained from AD and substituted for the incorrect one. Table 2 details the workflow used to analyze the sample data.

Feature Extraction
Feature extraction, alternatively referred to as word embedding, represents words in such a way that related terms are translated appropriately [35]. We employed four distinct word extraction approaches in this analysis to examine which word extraction technique performs the best on Bangla language sentiment.
We explored the most commonly used methods for word embedding, including Word2Vec, GloVe, fastText, as well as the state-of-the-art model BERT. We trained Word2Vec, fastText, and GloVe to demonstrate more incredible performance using the skip-gram model rather than the CBOW model as it can better represent fewer recurring words. In Section 4, we widely described the feature extraction techniques in detail.

Encoding Algorithm
We used the preprocessed data for the word embedding algorithm's training model [36]. We examined the performance of each model independently using a variety of window sizes, vector sizes, and iterations over the dataset. The models we developed were created with the Gensim tool, which is an efficient toolkit for performing a variety of typical natural language processing tasks and includes a Word2Vec, fasttext, Glove models' implementation, whereas, to train BERT, we have used the Huggingface open-source tool.

Word2Vec
Word2Vec is an extensively used word embedding method. It uses a neural network to ascertain the semantic similarity of the context of the words [37]. Word2Vec implements two inversely related architectures: a continuous bag of words (CBOW) and a Skip-Gram. Skip-Gram is an architecture for unsupervised learning used to discover semantic concepts depending on their context [38]. Skip-Gram works based on Equation (1) to get the maximum average logarithmic probability: It makes use of the provided training w 1 , w 2 , w 3 . . . w N . c denotes the context size, also known as the window size. E is the embedding size. The probability (wn + m|wn) can be calculated using Equation (2): Here, V represents the vocab list and u signifies 'input' and u is the 'output' vector representations of i, o accordingly. CBOW forecasts the target word using the semantic information available in a collection of given text [39]. It makes use of distributed continuous contextual representations. CBOW constructs a static window from a word sequence. Then, using a log-linear classifier trained on upcoming and prior words, the model assumes the window's middle word. The greater the value of Equation (3), the more likely the word wt will be inferred: Here, V and c are equivalent to the Skip-Gram model parameters. Figure 4 illustrates both models.

GloVe
GloVe or Global Vectors imply word embeddings relying on their co-occurrence [40]. The co-occurrence matrix indicates the frequency with which a specific pair of words occurs. The matrix of co-occurrences, designated as C, in which the rows and columns correspond to the vocabulary of words. Each element in C, i.e., C ij , indicates the frequency with which the word occurs. Increased weight results in an increase in vector similarity.

FastText
FastText is a highly robust algorithm for word embedding that takes advantage of subword information [41]. This model learns the embeddings from the training words' character n-grams. As a result, during the training period, a non-existent word in the vocabulary can be created from its constituent n-grams. This resolves the constraint of Word2Vec and GloVe, which require training to obtain a non-vocab word.
The first matrix of weights, A, is a look-up table for the words. After averaging the word representations, a text representation is created, which is subsequently input into a linear classifier. The text representation is a protected variable that could be utilized. This structure is identical to Mikolov's cbow model [42], except that the intermediate word is substituted by a tag. They estimate the likelihood function over the predefined set of classes using the softmax activation function f . This results in a reduction of the negative log-likelihood over the classes for a set of N documents: where x n is the nth document's normalized bag of information, y n is its label, and A and B are its weight matrices. Concurrently, on many CPUs, this model is trained to employ a stochastic gradient descent and a linearly decreasing learning rate.

BERT
BERT is the world's first pre-trained bidirectional and entirely unsupervised language representation approach, having been trained on a massive English Wikipedia corpus [4]. It is an Open-Source Language Representation Model developed by Google AI. Prior to training, BERT can read the texts (or a series of words) in either direction, which is superior to a single-direction technique. BERT surpasses all other word embedding algorithms with fine-tuning, attaining state-of-the-results (SOTA) in multiple NLP applications. BERT employs Transformer, an attention method that discovers semantic aspects of speech (or sub-words) in a text.
The attention mechanism of the transformer is the core component of BERT. The attention mechanism helps extract the semantic meaning of a term in a sentence that is frequently tied to its surroundings. The context information of a word serves to strengthen its semantic representation [43]. Simultaneously, other terms in the context frequently play multiple roles in expanding semantic representation. An attention mechanism can enhance the semantic representation of the target sentence by evaluating contextual information.
In contrast to prior word embedding approaches, BERT employs two distinct strategies: masked language modeling (MLM) and next sentence prediction (NSP).
For the purpose of predicting random masked tokens, the Masked Language Model (MLM) is utilized. In addition, 15% of N tokens are picked at random for this reason. These are derived by substituting an exclusive [MASK] token for 80% of selected tokens, 10% with a randomized token, and 10% staying unmodified.
In the case of the Next Sentence Prediction (NSP) task, the model is fed pairs of sentences and trained to predict whether the second sentence in the pair corresponds to the succeeding sentence in the original text. According to the original BERT research, excluding NSP from pre-training can result in a decrease in the model's performance on specific tasks.
Some research explores the possibilities of leveraging BERT intermediate layers but the most typical is to utilize the last output layer of BERT to boost the efficiency of fine-tuning of BERT.
We compute this sentiment analysis research using a pretrained Bangla-BERT model. This BERT model is comparable to Devlin's [4] suggested BERT model in terms of performance because it was trained on the largest Bangla dataset yet created. This model demonstrates that state-of-the-art results outperform all preceding results.
The key component of this transformer architecture is the BERT encoder. It is based on a feed-forward neural network and an attention mechanism. Multiple encoder blocks are layered on top of one another to form the Encoder. Each encoder block consists of two feed-forward layers and a self-attention layer that operates in both directions [44].
Three phases of processing are performed on the input: tokenization, numericalization, and embedding. Each token is mapped to a unique number in the corpus vocabulary during the tokenization process, which is known as numericalization. Padding is essential to ensure that the lengths of the input sequences in a batch are similar. When data travel through encoder blocks, a matrix of dimensions (Input length) × (Embedding dimension) for a specific input sequence is provided, providing positional information via positional encoding. The Encoder's total N blocks are primarily connected to obtain the output. A specific block is in charge of building relationships between input representations and encoding them in the output.
The structure of the Encoder is based on multi-head attention. It performs multiple calculations of attention h utilizing varying weight matrices and then combines the outcomes [43]. Each of these simultaneous calculations of attention results in the creation of a head. The subscript i is used to denote a certain head and its associated weight matrices. Once all of the heads have been calculated, concatenation will proceed. This results in the formation of a matrix with the dimensions Input_Length * x(h * d v ). Finally, a linear layer composed of the weight matrix W 0 with dimensions (h * d v ) * Embedding_dimension is added, resulting in an ultimate output with dimensions Input_Length * Embedding_dimension. In mathematical terms: where head i = Attention(QW Q i , KW K i , VW V i ) and Q, K, and V are placeholders for various input matrices. Each head is defined by three unique projections (matrix multiplications) determined by matrices in the mechanism of scaled Dot-Product.
W v i with the dimensions d emb_dim × d v The input matrix X is projected individually through the above weight matrices to estimate the head. Then, the resultant matrix are as follows:: with the dimensions input_length ×d v We use these K i , Q i , and V i to calculate the scaled dot product attention: To assess the similarities of token projections, the dot product of these K i and Q i projections is utilized. Considering m i and n j as the i th and j th token's projections via K i and Q i , Thus, the dot product is as Equation (6): It reflects the relationship between n i and m j . Next, for scaling purposes, the resulting matrix is partitioned into elements by the square root of d k . The following step entails the row-by-row implementation of softmax. As a result, the matrix's row value converges to a value between 0 and 1, which equals 1. Finally, V i multiplies this value to obtain the head [4].

Convolutional Neural Networks (CNN)
CNNs (Convolutional Neural Networks) is a type of deep feed-forward artificial neural network extensively employed in computer vision problems like image classification [45]. CNN was founded by LeCun in the early 1990s [46]. A CNN is a similar multilayer perceptron to a multilayer perceptron (MLP). Because of its unique structure, the model's architecture allows CNN to demonstrate translational and rotational invariance [47]. A CNN is made up of one or more convolutional layers, associated weights and pooling layers, and a fully connected layer in general. The local correlation of the information is used by the convolutional layer to extract features.

Convolution Layer
The convolution layer uses a kernel to compute a dot product (or convolution) of each segment of the input data, then adds a bias and forwards it through an activation function to build a feature map over the next layer [48,49]. Suppose an input vector for beat samples is where n is the number of samples/beat. The output values are then calculated using Equation (7): In this case, l is the layer index, h is the activation function used to append nonlinearity to this layer, and b is the bias term for the j feature map. M specifies the kernel/filter size, while w specifies the weight for the jth feature map and m filter index.

Batch Normalization
The training data are collected batch by batch. As a result, the batch distributions remain nonuniform and unstable, and therefore must be fitted using network parameters in each training cycle, severely delaying model convergence. To solve this issue, a convolutional layer is followed by batch normalization, an adaptive reparameterization approach. The batch normalization approach calculates the mean µ D and variance σ 2 D of each batch of training data before adjusting and scaling the original data to zero-mean and unity-variance.
Additionally, weight and bias are given to the shifted datax l to improve its expressive capacity. The calculations are provided by the Equations (8)- (11).
The reparameterization of the batch normalization approach substantially simplifies coordinating updates across layers in the neural network:

Max Pooling Layer
The sub-sampling layer is another name for the pooling layer. The proposed method employs the 1D max-pooling layer following the 1D convolutional layer and batch normalization layer, which performs a downsampling operation on the features to reduce their size [48]. It collects small rectangular data chunks and produces a distinct output for each piece. This can be performed in a variety of different methods. In this study, the Maxpooling approach is used to find the largest value in a set of neighboring inputs. The pooling of a feature map inside a layer is defined by (12) [49]: The pooling window size is denoted by R, while the pooling stride is denoted by T. Following that, utilizing several convolutional and max-pooling layers, the obtained features are converted to a single one-dimensional vector for classification. Apart from each classification label corresponding to a single output type, these classification layers are fully coupled. CNN needs fewer experimental parameter values and fewer preprocessing and pre-training methods than other approaches, such as depth and feedforward neural networks [50]. As a result, it is a very appealing framework for deep learning.

Bidirectional Long Short-Term Memory Model
Since deep learning is the most advanced sort of machine learning accessible today, there is an increasing range of neural network models available for use in real-world settings. A successful deep learning method was used in this study to illustrate its unique and exciting problem-solving capabilities. Because of its memory-oriented characteristics, it is known as long short-term memory.
The Bi-LSTM is a deep learning algorithm that analyzes data fast and extracts the critical characteristics required for prediction. This method is an extension of the Recurrent Neural Network methodology (RNN). To tackle the "vanishing gradient" problem of the old RNN structure, the predecessors devised the new network structure of LSTM [51]. The LSTM structure (Cell) has an input gate, an output gate, a forgetting gate, and a memory unit [52]. Figure 5 shows the architecture of the gate.
The math notation of these gates are things such as forget gate ( f t ), input gate (i t ), and output gate (O t ) at time t. For given input x and hidden state h t at time t, the computation of lstm alluded to is as below: The forget gate in the memory block structure is controlled by a one-layer neural network. The activation of this gate is determined by (13): where x t represents the input sequence, h t−1 represents the previous block output, C t−1 represents the previous LSTM block memory, and b f represents the bias vector. While σindicates the logistic sigmoid function, W signifies separate weight vectors for each input. The input gate is a component that uses a basic NN with the tanh activation function and the prior memory block effect to generate fresh memory. These operations are computed using (14) and (15) [53]: Long-term dependencies may be avoided by deliberately constructing and remembering long-term information, which is the default behavior of LSTM in practice. The one-way LSTM is based on previous data, which is not always sufficient. Data are analyzed in two directions using the Bi-LSTM. The bi-hidden LSTM's layer has two values [54], one of which is utilized in forward computation and the other in reverse computation. These two values define BiLSTM's final output, which tends to improve prediction performance [55].

One-Dimensional CNN-BiLSTM Proposed Method
The one-dimensional CNN (1D CNN) is the same as the classic 2D CNN, other than that the convolution operation is only conducted on one dimension, resulting in a deep architecture as shown in Figure 1. Hence, it can be easily trained on a normal CPU or even embedded development boards [56]. The convolution technique facilitates the development of significant hierarchical features for classification from a dataset. To estimate the dimensions of the output features after 1D CNN, apply the following equation: where x represents the output dimension and w represents the size of the input features f is the size of the filter used for convolutions, and p denotes padding, which is the addition of values to the border before conducting convolution. The variable s stands for stride, which is the distance traveled once the convolution procedure is completed.
Because one-dimensional convolution is a linear operation, it cannot be utilized to categorize nonlinear data. The majority of real-world datasets are nonlinear, requiring nonlinear processes after convolution. A nonlinear function is an activation function. The sigmoid, hyperbolic tangent, rectified linear unit (ReLU), and Exponential Linear Unit are the most commonly used activation functions (ELU).
The suggested CNN architecture makes use of the ELU activation function, which is easy to implement and allows for faster processing.
Furthermore, it addresses some of the issues with ReLUs while retaining some of the favorable aspects. It also has no difficulties with gradients vanishing or popping. Finally, the whole method is integrated with BERT. BERT transforms the embedding layer to CNN-BiLSTM for decision-making, which is described in Figure 1.

Experiment
According to some studies, fine-tuning mBERT with a text classification model generates a lower result than the proposed similar architecture since it eliminates the aggregated weight and restricted data arise and result in improved performance [57]. Another research study reveals that, when a classification algorithm is fine-tuned with BERT, the results are improved to the original BERT fine-tuning approach [58].
The proposed model is a hybrid of BERT and CNN-LSTM. We have used the BERT output as the LSTM's input. The LSTM layer eventually extracts features from BERT that it has obtained. Then, we have connected CNN to the following layer. As we have used BERT as a sentence encoder, this state-of-the-art model can acquire precise semantic information. Then, we integrated CNN models for text classification. The architecture of Bangla-BERT for sentiment analysis is depicted in Figure 1.
To examine various characteristics of the suggested method, we combined four embedding and thirteen classification methods. In the next section, we have included a table that summarizes the categorization models' performance. The model with the highest scores has been made publicly available.

Prediction and Performance Analysis
We develop the complete project entirely in Python 3.7. We used Google Collaboratory for GPU support because the data set was substantially larger than average and required the implementation of various deep learning architectures. Sci-kit-learn and Keras (with Ten-sorFlow backend) were utilized for machine learning and DNN frameworks, respectively. Additionally, we included another machine learning framework, Impact Learning.
Having completed the training set, we tuned critical hyperparameters to offer a finetuned model with equivalent assessment metrics. Based on the test set, we evaluated the models. We assessed each estimate for accuracy, precision, recall, f1 score, Cohen's kappa, and ROC AUC. We summarized the findings in Tables 3-6.
We represented the outcome of Word2Vec embedding with each classification technique in Table 3. As shown in Table 3, the CNN-BiLSTM algorithm achieves the maximum accuracy of 84.93%, while the SVM method achieves the second-highest accuracy of 83.83%. ANN performed the worst, achieving an accuracy of 76.84%. Because CNN-BiLSTM is leading in the F1 score, it is optimum if we implement Word2Vec embedding.
In Table 4, we have used fastText for embedding and used all the algorithms that were used earlier to classify emotions. As seen in the table, CNN-BiLSTM has the highest accuracy of 88.35%, while LDA has the second-highest accuracy of 86.38%. The Naive Bayes classifier performed the worst, with an accuracy of 78.16%. With an F1 score of 85.97%, Impact Learning is the best match when fastText embedding is used.
In Table 5, we used GloVe for embedding and all previous methods for classifying emotions. As can be seen from the table, the CNN-BiLSTM method is once again the winner with an accuracy of 84.53%, followed by Decision Trees with an accuracy of 82.93%. With an accuracy of 74.93%, Logistic Regression produced the lowest performance.   As illustrated in Table 6, we implemented Bangla-BERT for embedding and all previous algorithms for sentiment classification. As can be seen from the table, the CNN-BiLSTM approach wins with an accuracy of 94.15%, which exceeds all previous scores for other embedding methods and places this model above all previous models. SVM comes in second with an accuracy of 92.83%. Naïve Bayes performed the worst, with an accuracy of 87.81%.
Word2vec, Glove with LSTM classification performs almost identically in the challenge; however, Fasttext improves by 4% with an 88.35% score using the impact learning method. However, fine-tuned Bangla BERT with the LSTM classification model outperforms it. These experimental findings indicated that fine-tuning Bangla BERT with LSTM and CNN resulted in great improvement compared to the other embedding approach. As the improvement is much better, it reveals possibilities for subsequent advancements.   Table 8 denotes the classification score for each dataset. Among the traditional techniques, SVM beats RF, excluding the ABSA cricket dataset, where RF outperforms SVM by 2.6%. Deep learning models provide continuous improvements. Both Skip-Gram (Word2Vec) and CNN glove embedding perform comparably well. FastText performs better in three datasets than CNN does in others, while, ultimately, FastText outperforms CNN. Models based on transformers, such as BERT (Multilingual version), have gradually surpassed FastText. However, it indicates a reduction of 0.6% in the ABSA cricket dataset. Here, the suggested Bangla-BERT dataset outperforms all prior results, except the BengFastText dataset. Bangla-BERT outperforms all other techniques in terms of average F1 score, establishing this transformer-based model as the state-of-the-art approach compared to all other methods.

Conclusions and Future Work
This paper compares and contrasts various machine learning and deep learning algorithms for classifying texts according to their topics. We have demonstrated how transfer learning, the new revolution in natural language processing, can surpass all previous architectures. We have shown that transformer models such as BERT with proper fine-tuning can play a crucial role in sentiment analysis. Additionally, a CNN architecture was developed for this classification task. A very reliable pre-trained model was prepared for ease of use and made accessible as a python open-source package. Due to the fact that deep learning takes a rather large amount of data, we will continue to work on expanding the dataset. Additionally, we want to provide a Python API compatible with any web framework. We wish to use a particular word extraction algorithm to analyze the truth word extraction system for Bangla topic classification. We discovered that combining Bangla-BERT and LSTM leads to an astounding accuracy of 94.15%. However, LSTM gives the most significant overall result of all four-word embedding systems. We worked with an unbalanced dataset. A well-balanced dataset improves efficiency significantly. We want to use the sophisticated deep learning algorithm on a more enriched and balanced dataset in the future. Additionally, we offer an approach for assessing the performance of the proposed model in real-world applications.