Roman Urdu Sentiment Analysis Using Transfer Learning

: Numerous studies have been conducted to meet the growing need for analytic tools capable of processing increasing amounts of textual data available online, and sentiment analysis has emerged as a frontrunner in this ﬁeld. Current studies are focused on the English language, while minority languages, such as Roman Urdu, are ignored because of their complex syntax and lexical varieties. In recent years, deep neural networks have become the standard in this ﬁeld. The entire potential of DL models for text SA has not yet been fully explored, despite their early success. For sentiment analysis, CNN has surpassed in accuracy, although it still has some imperfections. To begin, CNNs need a signiﬁcant amount of data to train. Second, it presumes that all words have the same impact on the polarity of a statement. To ﬁll these voids, this study proposes a CNN with an attention mechanism and transfer learning to improve SA performance. Compared to state-of-the-art methods, our proposed model appears to have achieved greater classiﬁcation accuracy in experiments.


Introduction
In the subject of natural language processing (NLP), sentiment analysis is one of the most well-known subfields due to its prominence in text classification [1]. User reviews and complaints have become increasingly commonplace since the advent of social media and other user-centric platforms [2]. Automatically sorting text documents into those with the desired polarity is another consideration in the field of sentiment classification [3]. In terms of polarity, it can be either positive or negative, or even neutral. In addition to its more general meaning, sentiment analysis can be thought of as a subfield of text classification that focuses on the examination of how readers feel about and react to various things and their characteristics through the words they use to describe them. Businesses, governments, and scholars have all discovered that sentiment analysis is a useful tool for understanding public opinion, improving decision-making, and gaining corporate insight [4,5].
Academics have focused a lot of their research on SA. Compared to the English language, there are several languages, including Hindi/Roman Urdu, that are regarded as resource-poor, despite receiving less academic attention because fewer resources are available to conduct SA on them [6]. In addition to being Pakistan's national language, Urdu is spoken by around 200 million people and is written by another 100 million [3]. The Urdu language is challenging to learn and master because of its complicated alphabet, convoluted morphological structure, hazy word boundaries, and tricky stemming [7]. However, unlike other languages, Urdu lacks resources for language processing, such as stemming and lemmatization tools and stop word lists. Roman Urdu is created by

•
We propose a CNN with attention to Roman Urdu SA.

•
We conducted a series of experiments to investigate the model's sensitivity and find the best settings for the hyper-parameters that would maximize classification accuracy. • Transfer Learning is applied to improve small datasets.
The rest of the paper is divided into five sections. In Section 2, we will discuss the literature. The proposed methodology is laid out in great depth in Section 3. Section 4 explains the complete setup of the dataset, experimentations, and comparison of the results. In Section 5, the results and discussions are presented. Section 6 finally concludes the findings.

Literature Review
Various ML models have been widely employed in the past for performing sentiment analysis. In this article, we will analyze the research on DL methods such as DNN and attention-based mechanisms. Then, we will provide a quick summary of the research undertaken so far on Roman Urdu sentiment analysis.

Deep Learning-Based Sentiment Analysis
CNN, RNNs, RAEs, and DBNs are only a few of the DNNs that have been frequently employed for sentiment categorization [26] as a response to the limitations of standard machine learning methods. After applying the GRU to the recursive model and the tree structure of the LSTM, Kuta et al. [27] proposed a tree structure gated RNN. In addition, Tai et al. [28] performed sentiment analysis using an LSTM network with several complex units. They conducted more tests on a bidirectional LSTM with two layers and achieved encouraging results. Kim et al. [29] pursued a related line of questioning by evaluating a single-layer CNN extensively. They used Word2Vec vectors to train the models.
Multi-channel representation and variable-sized filters both produced the same outcomes. Zhang et al. [30] introduced an improved CNN at the character level, which was used to classify texts. Using word embedding, Yin and Schutze [31] presented a multi-channel variable-size convolutional neural network. Kalchbrenner et al. [32] developed a dynamic CNN with k-max pooling. Their model correctly processed input phrases with varying durations, and it accurately captured both short-term and long-term dependencies. Words and phrases can be represented in a matrix and vector in a tree structure, as proposed by Socher et al. [33]. Socher et al. [34] proposed Recursive Neural Tensor Network (RNTN), and it employs tensors as compositional matrices for all tree nodes. Using convolutional and RNN, Sadr et al. [35] proposed a method for sentiment classification. They employed multi-view learning to categorize the intermediate features generated by convolutional and recursive neural networks [36]. Chen et al. [37] combined Long-Short-Term Memory with a conditional random field layer to improve sentence-level sentiment analysis (BiLSTM-CRF). The simulation results on the benchmark datasets showed that their system improved phrase-level sentiment analysis. Hassan et al. [38] recommended using CNN and LSTM to infer polarity from IMDB movie reviews. This method's CNN classifier had two convolutional and two pooling layers. Shen et al. [25] used CNN and bidirectional Long-Short-Term Memory to determine movie review polarity (BLSTM). The CNN-LSTM model beat CNN and LSTM classifiers. Kamyab et al. [39] present a CNN-LSTM attention-based model (named ACL-SA). First, CNN's max-pooling is used to extract contextual features and minimize feature dimensionality. LSTM is used to capture long-term dependencies. Dashtipour et al. [40] used two deep learning algorithms, CNN and LSTM, with SVM for Persian movie reviews. Kastrati et al. [41] collected COVID-19-related comments in the Albanian language. They incorporated the attention mechanism with the DL model to capture the semantic and global context of words for SA.
DNN has produced significant results in SA; however, there is always room for improvement.

Attention-Based Network for Sentiment Analysis
Despite considerable breakthroughs in sentiment analysis, deep neural networks' performance is inadequate [26,42]. One of their typical problems is that they treat all words in phrases identically and cannot focus on critical sections [43]. The attention mechanism, with its proven ability to comprehend text effectively, has lately been utilized in a wide range of NLP applications, most notably sentiment analysis, to address this deficiency. Humans have a visual attention mechanism that helps them prioritize the information in a sentence that is most important to remember rather than encoding the entire thing. With text classification in mind, Yang et al. [19] tweaked RNN by introducing a weight to fulfill the attention function. In addition, Wang et al. [44] presented an attention-based LSTM Model that could concentrate on different sentence segments. Yuan et al. [45] suggested a domain attention model for multi-domain sentiment classification. To pick the best features relevant to each domain, their suggested model prioritized representations of those domains.
A unique approach for sentiment analysis was developed by Long et al. [24], which uses eye-tracking data to learn about the viewer's mental state. To identify the positive and negative emotions in texts, Deng et al. [46] used a sparse self-attention method to record the significance of each word. It was proposed by Gan et al. [27] to use a sparse attention-based separable dilated convolutional neural network, wherein the properties of a particular target entity are used to identify sentiment-oriented components. An organizational framework for classifying emotions and extracting topics was proposed by Pergola et al. [6]. Without using aspect-level annotation, their model extracts aspect-sentiment clusters. The authors Liang et al. [8] presented a sequence-to-sequence selective attention-based approach for the summarization of social text. They immediately optimized the ROUGE score using reinforcement learning and cross-entropy, and they employed a special gate to filter out the irrelevant data.
Despite the apparent success of using an attention mechanism on DNN, relatively some studies have investigated its effects on CNN for use in sentiment analysis.

Transfer Learning
A lack of training data is another issue for DNNs. To correctly train the model, a DNN requires a huge amount of training data; the more data that are added, the better a model will perform [45]. Transfer learning emerged due to a shortage of labeled training data. Transfer learning is utilized when the training data are too small [47]. The model is initially trained on a large dataset in the source domain before being transferred to the target domain to boost its overall performance. Transfer learning is widely used for image processing; however, sentiment analysis in NLP is limited [36]. Krizhevsky and Lee [48] studied the transferability of low-level neuronal layers across tasks. A previous study [49] examined the results of migrating deep neural network hidden layers from a larger training dataset to a smaller test dataset. Few research studies have been conducted on the possible benefits of transfer learning for sentiment analysis [19].
Tu and Zhao (2019) [50] created a model to collect source and target domain information simultaneously. It consists of a cloze task network and a convolutional hierarchical attention network to classify sentiment.   [28,51] introduced a method for cross-domain sentiment classification based on HAGANs. Training a generator and a discriminator alternatively produces a document representation that is both sentimentand domain-indistinguishable. An approach for cross-domain text categorization was presented by Wang et al. (2020) [52], who combined two non-negative matrix tri-factorizations into a joint optimization framework with approximation constraints on word clusters matrices and clusters association matrices. Unsupervised domain adaptation (UDA) uses labeled source domain knowledge to comprehend unlabeled target domains [53]. Learning transferable representations from a well-labeled source domain to a comparable but unlabeled target domain, deep domain adaptation approaches have attained attractive performance [54].

Roman Urdu Sentiment Analysis
In a recent lexical normalization study for Roman Urdu [55], a phonetic technique named TERUN (Transliteration-based Encoding for Roman Urdu text Normalization) was proposed. To standardize lexically flexible terminology, TERUN utilizes Roman Hindi/Urdu linguistics. An encoder based on transliteration, a filter module, and a hash code ranker is the three interconnected components. In the encoder, a single word in Romanized Hindi or Urdu will create every hash code. The filtered hash codes are prioritized by relevancy in the third module, and the irrelevant codes are dismissed in the fourth. The purpose of this study is twofold: to standardize and categorize text.
For Roman Urdu sentiment analysis, a new term-weighting method termed discriminative feature spamming strategy (DFST) was recently presented [56]. DFST works well for sentiment analysis because it uses a term utility criterion (TUC) to pick out different words and then spams them to make them more distinctive. Experiments were run on 11,000 Roman Urdu reviews, and a specialized tokenizer was built to improve accuracy. Ayesha et al. [57] examined Roman-Urdu product reviews for SA. They experimented with Naive Bayes, Support Vector Machines, and Logistic Regression, all of which are machine learning methods. It was clear from the outcomes that SVM is a better classifier than the others. Noor et al. [58] have suggested SVM for this service. They experimented with several SVM kernels after collecting 20,000 customer reviews of products available on "Daraz.pk". The bag-of-word features and linear SVM kernel are relied upon to arrive at the claimed 60% accuracy.
A small selection of Roman Urdu reviews, including only 150 positive and 150 negative ratings, was used by Bilal et al. [59] to test their theories in 2016. The KNN, DT, and NB classifiers for machine learning were all implemented in WEKA. For all four metrics (accuracy, precision, recall, and F-measure), NB was deemed to be the clear winner over DT and KNN. In an extra effort to perform Roman Urdu SA, reviews from various websites were collected for a binary classification assignment. Ten alternative classifiers (SVM, KNN, Decision Tree, Passive Aggressive, Ensemble classifier, Perceptron, SSGD, Naive Bayes, Ridge classifier, and nearest centroid) were compared in an empirical study by Arif et al. [60]. They conducted experiments using Chi-squared, IG, and MI feature selection algorithms and three different feature representation systems (TF, TF-IDF, and Hashing vectorizer). Support vector machines fared best in an empirical comparison of several ML classifiers undertaken by the authors. Naqvi et al. [61] looked at the issue of categorizing Roman Urdu news using various methods, including Naive Bayes, Logistic Regression, Long Short-Term Memory, and Convolutional Neural Networks. The purpose of the research was to divide the news into five separate types (health, business, technology, sports, and international). When standardizing the vocabulary and cleaning up the corpus, they used a phonetic algorithm. According to the findings, Naive Bayes is the most effective classifier. Recently, Chandio et al. [62] have compared the effectiveness of their proposed SVM to that of other ML algorithms for evaluating Roman Urdu sentiment.
Researchers have used deep neural networks to classify sentiments in Roman Urdu texts. With the support of other scholars, Mehmood et al. [63] have constructed a new Roman Urdu corpus RUSA-19 that includes 10,012 reviews. The ratings come from different social media sites, and they span six categories: (i) media (television, film, and radio); (ii) food; (iii) politics and policy; (iv) mobile technology; and (v) sports. They evaluated a new dataset and proposed an RCNN-based deep network for sentiment analysis, and the proposed model was constructed using both recurrent and convolutional neural network (CNN) layers. However, numerous studies have pointed out that CNN does better with nonsequential data than with sequential data. Ghulam et al. [64] examined two deep-learning approaches for sentiment analysis in Roman Urdu to learn more about the language's performance in this area. In practice, the LSTM model and several alternative ML classifiers were explored. In a surprising turn of events, it was determined that the deep learning model outperforms its machine learning counterparts. Some recent studies [64] used a multi-channel hybrid method, proposing a strategy based on BiGRU and word embeddings. Their dataset was also insufficient, totaling only 3241 sentiment evaluations, and it is not yet available to the public. Rizwan et al. [65] studied the problem of identifying anti-Christian bigotry in Roman Urdu by employing static CNN and other word embedding methods.
Chandio et al. [62] suggested a Roman Urdu Stemmer-powered SVM. Each user review is stemmed after preprocessing. Bag-of-word transforms input text into a feature vector. SVM classifies and detects user sentiment. Chandio [66] proposes a deep recurrent architecture RU-BiLSTM for sentiment analysis in Roman Urdu. RUECD and RUSA-19 were used to test our model. Our proposed model beat baseline models in many ways, improving by 6% to 8%. Azhar, seemab [67] Huggingface's transformer models DistilBERT and XLNet outperform Logistic Regression and Nave Bayes. DistilBERT obtained 100% accuracy on our Roman Urdu dataset with only two epochs of fine-tuning, while XLNet reached 86% accuracy with four epochs. Khan et al. [68] analyzed word embeddings for Roman Urdu and English using CNN-LSTM with typical machine learning classifiers. Long-term dependency preservation (LSTM) and local feature extraction (one-layer CNN). CNN and LSTM feature maps are supplied to machine learning classifiers for final classification. Word embedding models support this. Word2Vec performed best on RUSA-19 and UCL Glove.  [73] suggested an ensemble framework to provide more clarity on sentiment and emotion in text. Table 1 provides a summary of the state-of-the-art literature on Roman Urdu sentiment analysis. After reviewing the existing research and examining the complications encountered, we settled on proposing a CNN that combines the benefits of transfer learning with the attention mechanism via a layer of dedicated attention neurons. In contrast to earlier research, the proposed model uses an attention mechanism following the convolutional layer to extract meaningful words from phrases. To improve performance, the suggested model makes use of transfer learning, which involves applying what is learned in one domain to other unexplored domains.

Methodology
In this paper, we build deep Roman Urdu SA using a DL model for Roman Urdu/Hindi sentiment analysis that integrates Word2Vec, GloVe, TF-IDF, and FastText word embedding algorithms into a CNN architecture. CNNs have a reputation as promising tools for NLP. Additionally, they enable close-by input elements to extract at lower layers while remote elements interact at higher layers, allowing them to independently regulate the duration of the dependencies. Multiple convolutional layers can thus be used to generate an abstract, hierarchical representation of input text. To get to the root of the matter, we will employ a pooling layer to glean the following info. Local features that can hold useful information are lost due to the pooling layer. To this end, the proposed model incorporates a prior attention layer to the pooling layer to not only identify the significant characteristics but also to dampen the impact of less relevant aspects and aid the pooling layer in discovering the important features considering the context. While it is true that more training data usually results in better CNN performance, we also wanted to investigate how transfer learning might affect the suggested approach.

CNN Stocked with Attention Layer and Transfer Learning
In the proposed model, we positioned the attention layer before the pooling layer in the four-layer model. Our proposed model begins with an input matrix assembled by the collaboration of word vectors derived from the input texts. After the input matrix is processed through the convolutional filters, it is possible to derive feature maps. The training concludes by merging feature maps with similar filter widths. After the CNN retrieves its features, new phrase vectors are created by extracting informative words and assigning them a higher weight. Word vectors are used as input to a fully connected network. The following is a more mathematically detailed deduction of each layer.

Representation Layer
When providing data to a CNN, the input must be in the form of a sentencing matrix, where each row is a word vector. With zero padding before and after the first and end words, the sentence matrix would have s dimensions if the word vector had dimensions and the length of the phrase was × . When the padding is set to zero, the number of times each word appears in the receptive field during convolution is constant regardless of where the word appears in the sentence. The resulting notation for the This section briefly explains the steps of proposed methodology.

CNN Stocked with Attention Layer and Transfer Learning
In the proposed model, we positioned the attention layer before the pooling layer in the four-layer model. Our proposed model begins with an input matrix assembled by the collaboration of word vectors derived from the input texts. After the input matrix is processed through the convolutional filters, it is possible to derive feature maps. The training concludes by merging feature maps with similar filter widths. After the CNN retrieves its features, new phrase vectors are created by extracting informative words and assigning them a higher weight. Word vectors are used as input to a fully connected network. The following is a more mathematically detailed deduction of each layer.

Representation Layer
When providing data to a CNN, the input must be in the form of a sentencing matrix, where each row is a word vector. With zero padding before and after the first and end words, the sentence matrix would have s dimensions if the word vector had d dimensions and the length of the phrase was s × d. When the padding is set to zero, the number of times each word appears in the receptive field during convolution is constant regardless of where the word appears in the sentence. The resulting notation for the sentence matrix can be represented as this: A R s×d . The input matrix was created using Word2Vec [78], Glove [79], and the FastText [80] word embedding method.

Convolutional Operation
To generate these novel features, a convolutional operation must be accomplished to the text matrix. Since the order of words in a sentence has a significant impact on the message it conveys, picking a filter width that is the same as the number of words in the sentence makes the most sense (d). When it comes to filters, the only parameter that can be changed is the height (h), also known as the region size.
When the convolution filter H R h×d is applied to the sentence matrix R s×d , the resulting sub-matrix K [i: j] is a new feature. By repeatedly applying the convolution process to the K matrix, we obtain O R s−h+1×d , as the output sequence (Equation (1)).
here, · is the dot product of the convolution filter's and the input submatrix's matrices, where i = 1,..., s h + 1. Each O i is also given an activation function and a bias term b R.
Feature maps F R s−h+1 are produced at the end (Equation (2)).

Attention Mechanism
All the words in a sentence do not contribute the same amount to the meaning of the sentence, and because the pooling layer in a CNN causes local features to be lost, there needs to be a way to draw attention to words that have a bigger effect on the meaning of a sentence when context and word interactions are taken into account.
Extracting feature maps from the same-sized filter is used to conduct the attention mechanism on the convolutional layer. Let us pretend that M distinct region widths are considered, and m distinct filters are used in the convolution layer. Therefore, a H ij R h i ×d filter is applied to the sentence matrix K, where i = 1, 2,..., M, and j = 1, 2,..., m, M, to produce an m-by m feature map. This new sentence matrix X i R nxm (Equation (3)) is obtained by connecting feature maps from filters with the same size. Each cell in this matrix represents a feature that can be extracted from the input using filters of the same size. n represents the number of words.
It is the function of the attention mechanism to assign each row a relative importance for the sake of retrieving the relevant information from the text. To accomplish this, we first feed the new word matrix X i into a single layer perceptron with the parameters W R m×d and U i R n−h i +1×d . This yields a hidden representation of the matrix, which we then use to predict new words (Equation (4)).
To find the normalized importance weight a i R n−h i +1×1 , we compare each word's importance to its similarity to the context vector u R d×1 below (Equation (5)). As a process similar to that utilized by memory networks [23,33], the context vector u can be seen as a symbolic depiction of the informative words.
In particular, u starts at 0 and is learned throughout training so that each row in X ii Xi is assigned the same weight. Then, we obtain a different representation of X i by multiplying each element ofá i by the row in the X i matrix that it corresponds to (is the element-wise product) (Equation (6)).
when the attention mechanism is applied to a representation of X i , a new representation of X i is created, with the informative words precisely identified. The X i matrix is created by blending feature maps with identical filter sizes. Then, a single-layer perceptron is used to generate a new representation of X i (called U i ). In training, a hyper-parameter is adjusted to maximize the correlation between U i and the content vector u, which is used to determine the normalized importance weight a i that indicates the significance of each word. Multiplying each element of a i by the row number in X i yields a new representation of X i . Using the attention mechanism typically results in the extraction of more informative words.

Pooling and Classification
Though distinct feature maps are generated with different filter sizes, fixed-size vectors require a pooling function. To achieve this goal, several methods can be employed, including minimum pooling, maximum pooling, and the more common "average pooling." The function of the pooling layer is to minimize the number of dimensions by encapsulating the most salient aspects of each feature map. Each filter's pooling layer's output features are added together to form a feature vector O i . Next, a fully linked Softmax layer receives the feature vector to be used as input for the desired classification. What this means is that Softmax is used to calculate the probability distribution across all sentiment categories (Equation (7)).
Using cross-entropy as the loss function, we can see how the model's distribution P i (C) differs from the true sentiment distribution P i (C). (Equation (8)).
where T is used as the sample size for the test set, and v what one feels categories. The model is trained from start to finish using Stochastic Gradient Descent (SGD).

Transfer Learning
Transfer learning lays an emphasis on encoding and retaining knowledge in order to successfully apply what has been learned in one domain to a new, albeit related domain. Considering: where D s and D t represents the source and target domain, Ts and Tt represent the source and target tasks, respectively. Furthermore, x si X s and x ti X t designate the i-th data in the source and the target domains, whereas y si Y s and y ti Y t display the i-th label in the source and the target domains, correspondingly. Hence, knowledge from D s is used to improve Tt of the target prediction function ft at D t via transfer learning. Figure 2 shows the transfer learning procedure used to train the suggested model. by encapsulating the most salient aspects of each feature map. Each filter's er's output features are added together to form a feature vector Oi. Next, Softmax layer receives the feature vector to be used as input for the desired What this means is that Softmax is used to calculate the probability distribu sentiment categories (Equation (7)). where and represents the source and target domain, Ts and Tt source and target tasks, respectively. Furthermore, and desi data in the source and the target domains, whereas and displ bel in the source and the target domains, correspondingly. Hence, knowled used to improve Tt of the target prediction function ft at via transfer lea 2 shows the transfer learning procedure used to train the suggested model. Following the process flow depicted in Figure 1, the proposed model is trained on the source domain before being exported for use in the target domain. There are two distinct steps involved in conveying the trained model over. The first scenario does not subject the proposed model to any training in the desired area. This means that in the first phase (Figure 2), the model is trained on a generic domain and then tested on the target domain, whereas in the second step (Figure 3), the model is trained on the target domain to receive specific, relevant, and timely information (Figure 3). This means that the model is trained on both the source and target domains in the second process and can leverage this crosstraining to improve classification accuracy. At last, this newly trained model is tested in the target domain. domain, whereas in the second step (Figure 3), the model is trained on the target dom to receive specific, relevant, and timely information (Figure 3). This means that the mo is trained on both the source and target domains in the second process and can lever this cross-training to improve classification accuracy. At last, this newly trained mode tested in the target domain.

Experiments
This section includes the description of the experiments carried out to examine effectiveness of the transfer learning model for SA. The experiments also include measure of the accuracy of feature maps on various datasets. The sections that foll elaborate on the dataset, procedures, evaluation methods, and evaluation metrics.

Data Set
To correctly predict the performance of the purposed model, Roman Urdu stand datasets for sentiment analysis were preferred. . We divided the datasets into two categor i.e., the source and target domain. This division is based on the size of the dataset small size data set are not enough to train a model, so it is chosen as the target domain transfer learning. The details of the datasets, along with the number of data, are p sented in Table 2.

Experiments
This section includes the description of the experiments carried out to examine the effectiveness of the transfer learning model for SA. The experiments also include the measure of the accuracy of feature maps on various datasets. The sections that follow elaborate on the dataset, procedures, evaluation methods, and evaluation metrics.

Data Set
To correctly predict the performance of the purposed model, Roman Urdu standard datasets for sentiment analysis were preferred. The available datasets are RUSA 19 (https://github. We divided the datasets into two categories, i.e., the source and target domain. This division is based on the size of the dataset, as small size data set are not enough to train a model, so it is chosen as the target domain for transfer learning. The details of the datasets, along with the number of data, are presented in Table 2.

Evaluation Measures
To evaluate the efficacy of our classifier, we compute the accuracy, Recall, F1-score, and precision.

Model Configuration
The python TensorFlow karas DL library was used in our experiments. On top of that, sckit-learn was used to randomly divide the data for use in both the training and testing phases. The training data require pre-processing data. Several state-of-the-art techniques were utilized for this task, including Stanford core NLP used for tokenization, Word2Vec [78], Glove [79], and FastText [80] models used to form word embeddings. In the training process of word embeddings, the dimension of the vector was attuned to 200, and the window size was set to 3. The word embedding vector was updated using ADADELTA along with a learning rate of 0.01. As the filter size was considered a hyperparameter, it resulted in filter sizes 3, 4, and 5, along with 128 filters providing encouraging results. The dropout rate in the convolution layer was regularized to 0.5. Softmax was used as the activation function because it smooths sharply and converges faster than other activation functions in vogue. The batch size was set to 32. The model was trained using 50 epochs. After each round of training, the validation accuracy was also calculated. If the training accuracy was discovered to be dropping, the process was terminated immediately.

Results and Discussions
The proposed model was tested in five discrete configurations to achieve the best possible outcome.
CNN-A with Random Input Vector (RI-CNNA): As input, it employs randomly initialized vectors.
CNN-A with Static Input Vector (SI-CNNA): As input, it takes Word2Vec-obtained pretrained word vectors. There is no iterative updating of weights during the training period. The proposed model was only trained on the target datasets to create a foundation for equitable correspondence between the proposed model and other ML and DL techniques. The output was measured in the form of accuracy, precision, recall, and f1 score matrices. The detailed comparison of the outputs for each test is presented in Tables 3 and 4.
It is evident that the proposed model, in its various forms, outperforms the stateof-the-art by a small margin and is thus a viable option for transfer learning. Using randomly initialized vectors as an input may be why RI-CNNA has the worst classification performance among the presented model variants on both target datasets. The use of pre-trained vectors that are able to address the semantic sparsity problem has been linked to improved performance across a number of variants. Further, as SI-CNNA has the second-lowest classification accuracy after RI-CNNA, it is evident that updating word vectors during training could result in a greater performance regardless of whether the word vectors were previously trained or not. In conclusion, 4C-CNNA achieves 0.853% accuracy in classification on the RUSA-19 dataset and 0.903% accuracy on the RUSAD dataset. As seen in the results, the proposed model, along with its distinct configurations, performs better than the existing techniques. Therefore, it appeared to provide a better performance in the case of transfer learning.  Moreover, there was a point of stability in the validation loss plot, and the validation loss was only slightly lower than the training loss. In light of this, we can say that the model has been appropriately adjusted and provides a decent fit.

Effect of Transfer Learning
One of the disadvantages of CNNs is their variable number of hyper-parameters, which necessitates practitioners to define the precise model architecture. Considering that the values of the hyper-parameters have a substantial impact on the performance of DNN, we chose to improve the proposed model's hyper-parameters on the source domains and then apply the optimal parameters to the target domains while employing transfer Appl. Sci. 2022, 12, 10344 13 of 19 learning. The prior mentioned foundation for equitable correspondence was utilized in the experimentation of the transfer learning process. Then, 10% of the training data, along with ten-fold cross-validation, was utilized as the test set. Each experiment was repeated five times with identical parameters to achieve stable and accurate results. Appl. Sci. 2022, 12, 10344 13 of 20 can be seen in the figure, as the number of epochs is raised, classification accuracy increases while the loss value decreases. Moreover, there was a point of stability in the validation loss plot, and the validation loss was only slightly lower than the training loss. In light of this, we can say that the model has been appropriately adjusted and provides a decent fit.

Effect of Transfer Learning
One of the disadvantages of CNNs is their variable number of hyper-parameters, which necessitates practitioners to define the precise model architecture. Considering that the values of the hyper-parameters have a substantial impact on the performance of DNN, we chose to improve the proposed model's hyper-parameters on the source domains and then apply the optimal parameters to the target domains while employing transfer learning. The prior mentioned foundation for equitable correspondence was utilized in the experimentation of the transfer learning process. Then, 10% of the training data, along with ten-fold cross-validation, was utilized as the test set. Each experiment was repeated five times with identical parameters to achieve stable and accurate results.
All of the parameters were kept constant except the filter size to emphasize the effects of filter size in the experimentation process. Consequently, filters (4,5,6) demonstrated a successful outcome for all datasets. Similarly, all of the parameters were kept All of the parameters were kept constant except the filter size to emphasize the effects of filter size in the experimentation process. Consequently, filters (4,5,6) demonstrated a successful outcome for all datasets. Similarly, all of the parameters were kept constant except for the number of filters, to emphasize the effects of the number of filters in some experiments. During experimentation, the number of filters was attuned to 300 on DRU and RUDS, along with a dropout rate of 0.6, to achieve the highest classification accuracy. However, the optimal number of filters to achieve maximal accuracy on DRU was 128, along with a dropout rate of 0.5. A number of activation functions were utilized to achieve the finest outcomes, including SoftMax, ReLU, SoftPlus, and Tanh, but ReLU surpassed the performance of all other activation functions due to its faster convergence rate. Figure 5a shows the accuracy of the source dataset with different activation functions. Figure 5b shows the change in accuracy with different dropout rates.
Tables 5-8 shows the performance of proposed model in comparison to the baseline ML/DL models.
To compare the results of the learning process ( Figure 2), firstly, the proposed model was employed for sentiment analysis without prior training on the target domain. In the second phase, using the above-mentioned attuned configurations, the proposed model was incrementally trained on the target domain. The comparison of the two phases of the transfer learning process, along with the accuracy, is shown in Table 8.
on DRU and RUDS, along with a dropout rate of 0.6, to achieve the highest classification accuracy. However, the optimal number of filters to achieve maximal accuracy on DRU was 128, along with a dropout rate of 0.5. A number of activation functions were utilized to achieve the finest outcomes, including SoftMax, ReLU, SoftPlus, and Tanh, but ReLU surpassed the performance of all other activation functions due to its faster convergence rate. Figure 5a shows the accuracy of the source dataset with different activation functions. Figure 5b shows the change in accuracy with different dropout rates.    As it is clear from the results, the accuracy of the proposed model is lower when it is used directly on the target domain without training. This only signifies that source domain knowledge is not entirely enough to achieve high accuracy on the target domain. In the second phase, the performance and accuracy of the proposed model increased significantly with the use of incremental training on the target domain.
The sizable range of the source domains, along with fine word embeddings, also provided support for learning more contextual information. All of the factors combined resulted in the proposed model remarkably improving the accuracy of the classification process. The proposed model reflects different classification accuracy of source domains, and this indicates that in the transfer learning process, choosing a source domain corresponding to a target domain is an essential task. Some datasets have highly unmatched classification variables, which result in lower accuracy. However, the size and vocabulary metrics of either source or target dataset can also affect the performance. We also concatenated the source datasets to train the model and evaluated the target datasets. Table 9 presents the accuracy obtained on the target dataset using transfer learning from the concatenated source dataset.  Table 9. Accuracy of target datasets with transfer learning from training the model on concatenated source datasets.

Target Dataset Accuracy
RUSA-19 0.972 RUSAD 0.963 As it is obvious that the proposed model achieved higher accuracy on the target datasets when trained on the merged datasets of the source domain. The reason for this higher accuracy could be that merging different datasets increases the domain similarity of the source and target domain.
We have compared the proposed model's performance on the target datasets. 4C-CNNAT achieved higher accuracy on target datasets as compared to the state-of-the-art models. We trained the proposed model on the source datasets and transferred learning on each dataset and compared the accuracy. The major finding indicates that when employed directly on the target domain without training, the suggested model's accuracy is reduced. This simply means that the source domain knowledge is insufficient to achieve high accuracy on the target domain. With the application of incremental training on the target domain in the second phase, the performance and accuracy of the intended model improved dramatically.

Conclusions
The goal of SA is to learn about the public's sentiments by analyzing data collected from various social sources. In Roman Urdu, SA is challenging because of semantic and syntactic restrictions and the input sentence's terms interdependence. Consequently, this research was conducted to establish a mechanism to determine people's feelings towards a particular matter based on their Roman Urdu evaluations. This study's intended audience is the inhabitants of the subcontinent. Urdu, written in roman script, was the target language. This study contributes in two ways. With the objective of sentiment categorization in mind, a CNN with an attention mechanism before the pooling layer was proposed. This neural network provides weight to the most crucial parts of sentences and uses context to determine the polarity of the words. Since a shortage of training data is widely recognized as one of deep learning's biggest obstacles, we next investigated the impact of transfer learning and the sensitivity of the recommended model parameters. The empirical results showed that the proposed attention-based CNN significantly outperformed state-of-the-art alternatives. The accuracy of the classifications was also greatly improved using transfer learning. This work paves the way for additional research to enhance the category.