Empowering Propaganda Detection in Resource-Restraint Languages: A Transformer-Based Framework for Classifying Hindi News Articles

: Misinformation, fake news, and various propaganda techniques are increasingly used in digital media. It becomes challenging to uncover propaganda as it works with the systematic goal of inﬂuencing other individuals for the determined ends. While signiﬁcant research has been reported on propaganda identiﬁcation and classiﬁcation in resource-rich languages such as English, much less effort has been made in resource-deprived languages like Hindi. The spread of propaganda in the Hindi news media has induced our attempt to devise an approach for the propaganda categorization of Hindi news articles. The unavailability of the necessary language tools makes propaganda classiﬁcation in Hindi more challenging. This study proposes the effective use of deep learning and transformer-based approaches for Hindi computational propaganda classiﬁcation. To address the lack of pretrained word embeddings in Hindi, Hindi Word2vec embeddings were created using the H-Prop-News corpus for feature extraction. Subsequently, three deep learning models, i.e., CNN (convolutional neural network), LSTM (long short-term memory), Bi-LSTM (bidirectional long short-term memory); and four transformer-based models, i.e., multi-lingual BERT, Distil-BERT, Hindi-BERT, and Hindi-TPU-Electra, were experimented with. The experimental outcomes indicate that the multi-lingual BERT and Hindi-BERT models provide the best performance, with the highest F1 score of 84% on the test data. These results strongly support the efﬁcacy of the proposed solution and indicate its appropriateness for propaganda classiﬁcation.


Introduction
The dictionary defines propaganda as "information, ideas, opinions, or images that give one part of an argument, which are broadcast, published to influence people's opinions".Propaganda is a well-studied sociological field.Propaganda is ponderously devised to influence the opinions and actions of people with some predefined goals.In 1937, social scientists, opinion leaders, historians, educators, and journalists founded an organization called the Institute for Propaganda Analysis (IPA).This organization was established to spread awareness among American citizens about political propaganda.IPA defined propaganda as "an expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends".One of the foundational classifications of propaganda techniques originated in the 1930s.The 1936 United States presidential campaign, including the reigning President Franklin D. Roosevelt and Alf Landon, garnered scholarly interest in the linguistic strategies employed by both candidates.In 1937, Clyde R. Miller put up a significant classification of propaganda.The set of seven techniques, initially introduced by the Institute for Propaganda Analysis in 1938, continues to be widely acknowledged and embraced in current discourse.The seven features identified by the Institute of Propaganda Analysis indicating the use of propaganda are name-calling, glittering generalities, transfer, testimonials, plain folks, and card stacking [1].
In recent times, propaganda has been used by terrorist organizations for recruitment, by political parties during elections, and even by marketing agencies.Propaganda spreads through various techniques and is loaded with emotional appeals, falsification, and misinformation.Propaganda techniques are seen to be influencing the political discourse [2,3] and the spread of extremism [4][5][6][7].During the COVID-19 pandemic, many misinformation and propaganda articles were also observed.According to [8], any biased message, intentional or unintentional, can be termed propaganda.The social ramifications of propaganda are manifold, ranging from the manipulation of elections [9,10] to health [11].It is also observed that the propaganda phenomenon is not limited to specific demographic regions or languages.
In recent years, digital news media has grown tremendously and public opinion manipulation through propaganda has reached new levels.News articles in digital media have a propaganda continuum ranging from neutral to biased [1].Every reader of a news piece should be conscious that it invariably reflects the bias of the news organization that published it, as well as the author, at least partially.It is challenging to determine the precise nature of prejudice, though.The author may be unaware of his/her prejudice.Alternatively, the article could be an instrument in the author's toolkit to convince readers of a particular viewpoint.Such a situation is an example of propaganda.The efficacy of propaganda is maximized when it operates covertly.If an individual engages with a journalistic piece, regardless of whether it is presented through a formal or informal news channel such as a blog or social media platform, it is expected that they would not readily discern it as propagandistic.Under such circumstances, the reader becomes unwittingly exposed to propagandistic content, potentially leading to a modification of the reader's viewpoints.In light of the diverse range of news sources available, comprising tabloids, print media, and digital platforms, as well as varying degrees of objectivity and bias, this study proposes that the development of an automated tool capable of identifying propagandistic fragments could prove advantageous for both news readers and news agencies.Propaganda spreads through various techniques and is loaded with emotional appeals, falsification, and misinformation.Such hidden techniques require extensive and in-depth analysis.To address this infodemic of disinformation, it is necessary to develop techniques for propaganda detection and analysis.Although the propaganda detection and analysis problem has caught the attention of researchers in recent years, major work has been reported in the English language.Less work in this area has been reported for low-resource languages [12,13].In our effort to tackle propaganda detection, a framework for propaganda classification in Hindi was proposed.Hindi is India's primary language and the fourth most spoken language worldwide.Propaganda detection research in English has produced excellent results, whereas research in Hindi propaganda is still promising.Hindi is a language short of various linguistic resources; hence, it is termed a resource-restraint language (RRL).Many Hindi news websites have come up in the last decade and have accepted Hindi as their content language, like amarujala.com,bhaskar.com,jagran.com,zeenews.india.com/hindi.com,lokmatnews.in,etc. Hindi has become a noteworthy web content language.It has also become vital to analyze Hindi news articles for propaganda and gain valuable insights and relevant information.The technological advances and abundant availability of online news articles in Hindi Devanagari script makes it a stimulating research area.

Background
The term "low-resource text processing" pertains to the utilization of natural language processing (NLP) methodologies, including text categorization, automated translators, sentiment detection, and information extraction, within contexts characterized by limited resources, such as insufficient training data, computational capacity, or domain-specific expertise.In low-resource situations, the availability of computational resources is often constrained, hence affecting the efficiency of running complex natural language processing (NLP) models.The utilization of resource-restraint languages presents unique challenges in domain-specific situations, such as propaganda news, characterized by highly specialized language.The field of low-resource text processing is currently a subject of extensive research due to its profound implications for enhancing the accessibility and utility of natural language processing (NLP) technology across various languages and domains.
The research in text mining for Asian low-resource languages has gained attention over the last decade.The availability of native language keyboards and one's affinity to indigenous languages has increased the use of various languages on the Web.With the advent of natural language processing techniques, researchers are increasingly addressing problems related to indigenous languages.However, Hindi is a resource-restrained language for which significantly less research has been conducted.The nuances of the Hindi language pose significant challenges to researchers dealing with Hindi text processing.Some of the challenges are as follows: 1. Lack of Hindi text corpora: Text corpora for the Hindi language are scarce and lack gold standards.The currently available resources are either in the developmental stage or lack authentication.Hence, obtaining a high quantity and quality annotated corpus is challenging.
2. Lack of language resources: Linguistic resources such as parsers, morphological analyzers, part-of-speech taggers, lexicons, and WordNets for Hindi are inadequate and need improvement.

Code-mixing:
Often multi-lingual users use the mixing of two languages or codemixing.Hinglish is the combination of Hindi and English and is seen prevalently while limited resources, such as insufficient training data, computational capacity, or domainspecific expertise.In low-resource situations, the availability of computational resources is often constrained, hence affecting the efficiency of running complex natural language processing (NLP) models.The utilization of resource-restraint languages presents unique challenges in domain-specific situations, such as propaganda news, characterized by highly specialized language.The field of low-resource text processing is currently a subject of extensive research due to its profound implications for enhancing the accessibility and utility of natural language processing (NLP) technology across various languages and domains.
The research in text mining for Asian low-resource languages has gained attention over the last decade.The availability of native language keyboards and one's affinity to indigenous languages has increased the use of various languages on the Web.With the advent of natural language processing techniques, researchers are increasingly addressing problems related to indigenous languages.However, Hindi is a resource-restrained language for which significantly less research has been conducted.The nuances of the Hindi language pose significant challenges to researchers dealing with Hindi text processing.Some of the challenges are as follows: 1. Lack of Hindi text corpora: Text corpora for the Hindi language are scarce and lack gold standards.The currently available resources are either in the developmental stage or lack authentication.Hence, obtaining a high quantity and quality annotated corpus is challenging.
2. Lack of language resources: Linguistic resources such as parsers, morphological analyzers, part-of-speech taggers, lexicons, and WordNets for Hindi are inadequate and need improvement.

Code-mixing:
Often multi-lingual users use the mixing of two languages or codemixing.Hinglish is the combination of Hindi and English and is seen prevalently while writing in Hindi digital platforms, for example, "स् टे ट बैं क ऑफ इं डिया दे श का सबसे बडा बैं क है । जो समय-समय पर अपने 40 करोड से ज् यादा अकाउं ट होल् िसस के डिए स् कीम िे कर आता रहता है । इस बार एसबीआई ने अपने योनो ऐप के थ्रू फ्री में इनकम टै क् स फाइडिं ग करने की सु डिधा दी है ।" (State Bank of India is the largest bank in the country.This keeps coming up with schemes for its more-than-40-crore account holders from time to time.This time, the SBI has given the facility of filing income tax for free through its YONO app.)The above example was taken from the H-Prop-News dataset used in this study and shows the use of Hinglish in Hindi news articles.Generating a pure Hindi dataset is time-consuming and dreary, which remains a significant challenge.

Spelling and morphological variations:
In Hindi, a lot of morphological information is infused in words that carry information such as gender, tense, and person.Dealing with these morphological variations is a major challenge.In addition, the same word can be spelled differently in Hindi, making it very complex to integrate these terms into lexical resources.
This research aims to address the historical lack of NLP technologies in Hindi language and particularly the subject of propaganda classification that has traditionally received limited attention.

Motivation
The propagation of propaganda techniques in mainstream media has inspired us to devise an approach to propaganda detection in Hindi news articles.From a technical perspective, there is varied accessibility to tools to take advantage of deep learning and transfer learning techniques.However, there are substantial limitations owing to the language of the analysis or the available datasets.It is often difficult to obtain satisfactory performance for different languages, usually low-resource languages, such as Hindi.Much of the earlier work on propaganda classification is centered on resource-rich languages such of India is the largest bank in the country.This keeps coming up with schemes for its morethan-40-crore account holders from time to time.This time, the SBI has given the facility of filing income tax for free through its YONO app.)The above example was taken from the H-Prop-News dataset used in this study and shows the use of Hinglish in Hindi news articles.Generating a pure Hindi dataset is time-consuming and dreary, which remains a significant challenge.

Spelling and morphological variations:
In Hindi, a lot of morphological information is infused in words that carry information such as gender, tense, and person.Dealing with these morphological variations is a major challenge.In addition, the same word can be spelled differently in Hindi, making it very complex to integrate these terms into lexical resources.
This research aims to address the historical lack of NLP technologies in Hindi language and particularly the subject of propaganda classification that has traditionally received limited attention.

Motivation
The propagation of propaganda techniques in mainstream media has inspired us to devise an approach to propaganda detection in Hindi news articles.From a technical perspective, there is varied accessibility to tools to take advantage of deep learning and transfer learning techniques.However, there are substantial limitations owing to the language of the analysis or the available datasets.It is often difficult to obtain satisfactory performance for different languages, usually low-resource languages, such as Hindi.Much of the earlier work on propaganda classification is centered on resource-rich languages such as English and, recently, Arabic.As per our knowledge, minimal work on propaganda detection in Hindi is reported, and hence this problem is worthy of research.The two substantial research gaps found in the literature survey are as follows:

•
Limited analysis of techniques effective in striving against propaganda; • Constrained research on identifying propaganda in Hindi news articles.
This research studies, analyzes, and proposes a deep learning and transformer-based approach to classify Hindi news articles with opposing polarities, such as propaganda and non-propaganda.

Contributions
To create a propaganda classification tool in Hindi, the Hindi computational propaganda dataset H-Prop-News [14] was used.The main objective of this study was to present a simple, yet compelling, solution for propaganda classification using deep learning and a transformer-based approach.The significant contributions of this study are as follows: • Experiments with three deep learning models using Word2vec embeddings trained on our corpus;

Related Work
The research community has recently shown an interest in exploring textual propaganda detection and classification.The rise of fake news has drawn interest in its use for propaganda.Prominent areas in which propaganda has been studied include terrorism and politics [1].Various systems were developed for propaganda detection, such as Proppy [15], Prta [16], and PROpaganda Text dEteCTion (PROTECT) [17] for propaganda detection in news texts.In another study, true news was identified from a corpus of trusted satire, hoax, and propaganda (TSHP) [18] using n-gram features and a maximum-entropy classifier.
In their recent work, ref. [19] addressed the issue of propaganda detection in codeswitched social media text.The authors created a corpus of English and Roman Urdu code-switched text and obtained the best results using a fine-tuned XLM-RoBERTa model.The authors [20] compared two approaches, i.e., BERT and SVM, to detect Pro-Kremlin propaganda in social media.The authors conclude that both the approaches turn out to be insufficient toward the automatically spawned news.The authors [21] have created a HQP dataset by manual annotations of 30,000 English tweets related to the Russia-Ukraine war.The authors indicate that the pretrained models may include biases in the downstream tasks and should be deployed with caution in practice.In their work, ref. [22] worked on bi-lingual dataset of tweets related to Smart City in Urdu and English.The authors have reported the best results using a fine-tuned RoBERTa model.

Deep Learning Models
Deep learning models have been effectively used for text classification tasks in the last few years.For the task of propaganda classification, acquiring significant features and representations of textual data is crucial.Deep learning models have proficiency in acquiring such intricate hierarchical representations of data.Deep learning methods can make use of pre-trained representations, such as word embeddings (e.g., Word2Vec, GloVe), to extract generic linguistic knowledge from massive text corpora.This pre-training considerably improves the model performance by providing them with a strong starting point for the propaganda text classification task.Context is vital in the propaganda classification task for resolving ambiguity, understanding nuances in language, and recalling prior words or sentences.Deep learning models can efficiently gather and leverage such context-sensitive data.Hence, deep learning models have shown promising results in propaganda classification.Convolutional networks have been widely used in various studies [3,[23][24][25].LSTM and Bi-LSTM models were used in [6,23,24,26].In their work, Gavrilenko et al. [3] used CNN, LSTM, and Hierarchical-LSTM (H-LSTM) to identify propaganda in the text files from the corpus of the Internet Research Agency (IRA).They achieved 88.2% accuracy using the CNN model.Nizzoli et al. [6] worked on ISIS propaganda tweets using CNN and an RCNN, and reported an F1 score of 0.9.The authors of [23,24,26] performed sentence-level propaganda detection on a Proppy corpus using an ensemble of deep learning models.Hashemi and Hall [25] studied visual propaganda in images shared by violent extremist organizations (VEOs) on Twitter.They used an eight-layered deep CNN architecture, known as AlexNet, and reported an accuracy of 97.02%.

Transformer-Based Models
In recent years, transformer-based models have shown encouraging results in terms of text classification and natural language processing.In NLP4IF'19, the best-performing approach was reported using the BERT model with hyperparameter tuning [27].Other studies have used variations of BERT-based models for propaganda classification tasks, such as the context-dependent BERT model [28] and cost-sensitive BERT [29].Researchers have also utilized several variations of features for propaganda classification, such as EmoFeat emotion word features [25], linguistic, layout, and topical features [24], linguistic inquiry, and word count (LIWC) features [28], and linguistic style and complexity features [26].The details of various deep learning and transfer learning models used for propaganda classification by various authors are listed in Table 1.

Propaganda Classification in Low-Resource Languages
Propaganda is a global phenomenon; however, most propaganda classification systems are centered on English, and little advancement has been seen in low-resource languages.The details of different propaganda related datasets used previously are as shown in Table 2.In the recent work by Kausar et al. [2], ProSOUL, a framework for propaganda detection in online Urdu content, was proposed.The authors used Urdu news article dataset and generated a linguistic inquiry and word count (LIWC) dictionary in Urdu language.The BERT model, along with NELA, word n-gram, and character n-gram features, outperformed it with an accuracy of 0.91.Recently, a significant contribution toward detecting propaganda techniques in Arabic was generated by the shared task in the WANLP workshop in 2022.The dataset developed by the authors [30] contained 3200 Arabic tweets from various news sources and was labeled with 20 propaganda techniques.Sameer et al. [31] obtained the best results using the AraBERT and Marefa-NER models.Mittal and Nakov [32] used XLM-R and a multigranularity network with an mBERT encoder, whereas [33] relied on AraBERT to show the best results.Other studies on the shared task [34][35][36][37] have also shown promising results.Some researchers translated the English news text dataset into low-resource languages to address resource scarcity.For example, Mapes et al. [27] translated English news into Bulgarian to create a news toxicity detector.In their work, Alam et al. [30] developed a fake news dataset by translating an English news text.Other studies have used Google Translate to generate the LIWC dictionary in Dutch [38] and Filipino [39].A similar translation approach was used by Chaudhari et al. [14] to generate the H-Prop dataset in Hindi by translating the Proppy corpus.However, translation introduces errors that significantly affect classification tasks.

Propaganda-Related Datasets
Low-resource text processing is a major hurdle owing to the lack of standard and public datasets.TSHP-17 [18] is the earliest dataset seen to address propaganda classification tasks.In [18], propaganda was considered a class of fake news analysis.The QProp corpus in [15] contains 51,246 articles from news sources labeled using the distant supervision technique.The Proppy dataset [16] was collected from 13 propaganda news outlets and 36 non-propaganda news feeds.This dataset was labeled manually to obtain the text fragments from news articles using 18 propagandistic techniques.The dataset was further utilized for the MNLP shared task for fine-grained propaganda classification.In their work, S. Kausar et al. [13] created three datasets and a framework called ProSOUL for propaganda detection in Urdu.The first dataset was created by translating the QProp corpus into Urdu, and two other corpuses, Humkinar-Web and Humkinar-News, were obtained from the Urdu webpages and Urdu news articles.The Humkinar-Web and Humkinar-News datasets were labeled using the ProSOUL framework.An Arabic propaganda dataset was developed by [30] and released as part of the WANLP shared task for fine-grained propaganda classification in Arabic.This dataset contains 3200 tweets from leading Arabic news sources.Chaudhari et al. [14] created H-Prop and H-Prop-News datasets in their work.The H-Prop dataset was created by translating QProp.We can conclude from the previous work discussion that significant work has been carried out on propaganda detection and classification in English news texts and, recently, Arabic texts.However, little work has been conducted on propaganda classification tasks in low-resource languages.The propaganda classification research in Hindi language is limited as compared to English and Arabic.The peculiarities of Hindi and the limited linguistic resources may be the substantial reasons.Notably, no significant results have been reported for propaganda classification in Hindi other than in [14].Taking advantage of previous work [14], this research proposes classifying propaganda news text in Hindi using deep learning and transformer-based models.In addition, Word2vec embeddings developed using Hindi content were utilized.

Problem Statement
Given a paragraph of a news article, the system performs binary classification to label the news article as propagandist or not.For simplicity, the propaganda classification can be considered a text classification problem where we map the input sequence of tokens T in news articles to one of the n labels.First, a composition function f was applied to the sequence of tokens to create word embeddings vw such that w ∈ T. The output of this function f is the vector v that serves as input to deep learning models.The sigmoid activation function in the output layer generates estimated probabilities for the output label y.
The transformer-based models have encoder with multi-head attention capable of learning multiple representation features from the given text.Based on the pretrained parameters, the proposed models were fine-tuned to Hindi news text.The output probability of the models can be calculated as follows: where V is the classification vector output by the transformer models W and b, i.e., weights and biases.Px is the output probability.Furthermore, the classification loss can be calculated using the cross-entropy function as defined below.
Cross Entropy(y, P) = − ∑ N i=1 yi log(Pi) = −log(Px) The proposed approach of this study encompasses three main steps as preprocessing, word embedding generation, and classification using deep learning and transformerbased models.Figure 1 shows an overview of the steps involved in the methodology.The details of the dataset, preprocessing, and model architectures are elaborated in the following sections.
The transformer-based models have encoder with multi-head attention capable of learning multiple representation features from the given text.Based on the pretrained parameters, the proposed models were fine-tuned to Hindi news text.The output probability of the models can be calculated as follows: where V is the classification vector output by the transformer models W and b, i.e., weights and biases.Px is the output probability.Furthermore, the classification loss can be calculated using the cross-entropy function as defined below.

𝑙𝑜𝑔 𝑃𝑥
The proposed approach of this study encompasses three main steps as preprocessing, word embedding generation, and classification using deep learning and transformerbased models.Figure 1 shows an overview of the steps involved in the methodology.The details of the dataset, preprocessing, and model architectures are elaborated in the following sections.

Dataset
For this study, our previously introduced Hindi news article dataset H-Prop-News was used.This dataset was generated by collecting news articles from notable Hindi news websites.The dataset was released for the binary propaganda classification task.Each news article was labeled propaganda (1) or non-propaganda (−1).The dataset contains 5500 Hindi news articles, the news website source, URL, and headlines.The H-Prop News

Dataset
For this study, our previously introduced Hindi news article dataset H-Prop-News was used.This dataset was generated by collecting news articles from notable Hindi news websites.The dataset was released for the binary propaganda classification task.Each news article was labeled propaganda (1) or non-propaganda (−1).The dataset contains 5500 Hindi news articles, the news website source, URL, and headlines.The H-Prop News dataset is available in three partitions: development, training, and testing.The details of the H-Prop-News dataset are shown in Table 3.The H-Prop-News dataset was analyzed to understand the distribution of news article sources. Figure 2 shows the news source-wise article distribution in the H-Prop-News dataset.The dataset contained a maximum of 604 articles from Amar Ujala (amarujala.com) and the fewest from India TV.The category-wise article distribution shown in Figure 3 indicates that most propaganda articles were sourced from patrika.com, and most nonpropaganda articles come from amarujala.com.A word cloud of the most frequently appearing keywords, as shown in Figure 4, was created to find the top words appearing in the dataset.cle sources.Figure 2 shows the news source-wise article distribution in the H-Prop-News dataset.The dataset contained a maximum of 604 articles from Amar Ujala (amarujala.com) and the fewest from India TV.The category-wise article distribution shown in Figure 3 indicates that most propaganda articles were sourced from patrika.com, and most non-propaganda articles come from amarujala.com.A word cloud of the most frequently appearing keywords, as shown in Figure 4, was created to find the top words appearing in the dataset.

Data Preprocessing
The news articles in the corpus need the following preprocessing steps before further utilization.

Data Preprocessing
The news articles in the corpus need the following preprocessing steps before further utilization.

1.
URL and mentions removal: Some news articles contain URLs and tweet mentions.These were identified using regular expressions and replaced by a space.The URLs were removed using regular expression patterns like "http://" or "https://".A simple regular expression pattern "@\w+" was used to find the mentions.This expression identifies tweet mentions by finding the strings starting with "@" followed by one or more characters.1. URL and mentions removal: Some news articles contain URLs and tweet mentions.These were identified using regular expressions and replaced by a space.The URLs were removed using regular expression patterns like "http://" or "https://".A simple regular expression pattern "@\w+" was used to find the mentions.This expression identifies tweet mentions by finding the strings starting with "@" followed by one or more characters.2. Stop word and non-Hindi word removal: Stop words are frequently occurring words in the text that do not hold any meaning.Some examples of Hindi stop words are "और" (and), "की" (of), "है " (is), and so on.A predefined list of Hindi stop words was used to remove the stop words from the text.Removing these words helps in reducing the noise and focusing on the content-carrying words.To remove non-Hindi words, a language detection library, langdetect, was used and the words belonging to English were removed.Additionally, all punctuation marks were removed.3. Tokenization: Tokenization was performed using the indicnlp tokenizer, which is a specialized tokenizer for Indian languages.It is tailored to handle the linguistic characteristics and requirements of languages spoken in India, which may differ from tokenizers used for languages with different linguistic structures.The "indicnlp tokenizer" tokenizes text based on punctuation boundaries.This means it breaks text into tokens wherever it finds punctuation marks like periods, commas, question marks, exclamation marks, and so on.Punctuation marks act as usual boundaries between words or subword units in Hindi.Tokenization based on punctuation boundaries is a direct approach and works well for Hindi.

Hindi Word Embedding Creation
Natural language data are typically unstructured, and deep learning models cannot handle them in their raw format.Hence, natural language data must be converted into was used to remove the stop words from the text.Removing these words helps in reducing the noise and focusing on the content-carrying words.To remove non-Hindi words, a language detection library, langdetect, was used and the words belonging to English were removed.Additionally, all punctuation marks were removed.

3.
Tokenization: Tokenization was performed using the indicnlp tokenizer, which is a specialized tokenizer for Indian languages.It is tailored to handle the linguistic characteristics and requirements of languages spoken in India, which may differ from tokenizers used for languages with different linguistic structures.The "indicnlp tokenizer" tokenizes text based on punctuation boundaries.This means it breaks text into tokens wherever it finds punctuation marks like periods, commas, question marks, exclamation marks, and so on.Punctuation marks act as usual boundaries between words or subword units in Hindi.Tokenization based on punctuation boundaries is a direct approach and works well for Hindi.

Hindi Word Embedding Creation
Natural language data are typically unstructured, and deep learning models cannot handle them in their raw format.Hence, natural language data must be converted into internal vector representations.To create word embeddings, Word2vec, an unsupervised algorithm for learning the vector representations of words, was used.The Word2vec model transformed the words into vectors by placing the words with a shared context in vicinity of the vector space [40].Taking a large corpus as the input, the Word2vec algorithm creates a vector space and assigns a distinctive vector value to each word in the vector space.
Currently, the primarily available Word2vec embeddings are in English.Custom Word2vec embeddings were created which are trained on our corpus of all the news articles to obtain the desired embeddings.After preprocessing the data, the entire dataset was divided into sentences and obtained a vector representation of Hindi words using Gensim [37].To create the model, a dimension of 300 and skip-gram window of 10 were chosen.The corpus had 43,423 raw words with 10,310 word types.After retaining unique words and down-sampling most common words, a corpus of 29,933 words was obtained.The trained Word2Vec model on the corpus produced the vector of dimensions 10,310 × 300.The vocabulary of all unique words was built, and the maximum length of the news text was found.Each news text was padded to a maximum length of 1725 words.Using the trained Word2vec model, an embedding matrix was created, where the index of each word in the vocabulary was its index in the matrix, and the corresponding entry was its Word2vec representation.The generated word embeddings served as an input to all deep learning models to predict the news text labels.

Deep Learning Models
As seen in related work performed by other researchers, the deep learning models showed good results in the propaganda classification task in other languages.Hence, this research proposes three deep learning-based models for the computational propaganda classification task for the Hindi language.The models presented are convolutional neural networks (CNNs), long short-term memory (LSTM), and bidirectional LSTM (Bi-LSTM).These deep learning models were selected based on the following points:

•
Convolutional neural networks (CNNs) perform well at identifying local characteristics in text, which is useful for tasks like classifying propaganda.CNNs also have the capability to detect patterns in limited text segments and can offer computational advantages for smaller datasets and less intricate natural language processing (NLP) assignments.

•
Long short-term memory (LSTM) networks are specifically built to effectively process sequential data, enabling them to capture and represent long-range dependencies within textual information.These models have the capability to handle the input sequences of varying lengths, a characteristic frequently seen in news articles.Long short-term memory (LSTM) models have demonstrated a notable ability to properly preserve context and conversation history, which is required for the propaganda classification task.

•
Bidirectional long short-term memory (Bi-LSTM) models are capable of capturing contextual information from both preceding and succeeding tokens within a sequence.This characteristic is of utmost significance, as it enables a comprehensive analysis of the complete context surrounding a given word or phrase.The inclusion of contextual information can be advantageous in the task of classifying propaganda.Bi-LSTM models also have the capability to construct sentence embeddings by effectively encoding information from both forward and backward directions.

Convolutional Neural Network (CNN)
The adopted CNN architecture is shown in Figure 5.This architecture is inspired by the work described in [41].In our case, the embedding layer serves as an input to the model.As explained in [41], classification quality can be affected by the filter size configuration.Four filter sizes, 2, 3, 4, and 5, were used, which allowed us to focus on smaller and larger sections of the news text.After filtering, a max-pooling layer was applied with a dropout probability of 0.2.The maximum values obtained from each convolutional layer were concatenated to obtain a single vector.This single vector was then processed using a fully connected layer of size 30 and dropout probability of 0.2.The dropout layers help achieve better convergence avoid overfitting.
of contextual information can be advantageous in the task of classifying propaganda.Bi-LSTM models also have the capability to construct sentence embeddings by effectively encoding information from both forward and backward directions.

Convolutional Neural Network (CNN)
The adopted CNN architecture is shown in Figure 5.This architecture is inspired by the work described in [41].In our case, the embedding layer serves as an input to the model.As explained in [41], classification quality can be affected by the filter size configuration.Four filter sizes, 2, 3, 4, and 5, were used, which allowed us to focus on smaller and larger sections of the news text.After filtering, a max-pooling layer was applied with a dropout probability of 0.2.The maximum values obtained from each convolutional layer were concatenated to obtain a single vector.This single vector was then processed using a fully connected layer of size 30 and dropout probability of 0.2.The dropout layers help achieve better convergence and avoid overfitting.

Long Short-Term Memory (LSTM)
The context of a word can be determined by using the words preceding it.It has been proven that LSTMs are capable of capturing the relevant context of the word by using memory cells in the network.These memory cells record the meaning of previously occurring words.To model the contextual information, an LSTM model was constructed.As explained earlier, an LSTM classifier was implemented that used the Word2vec embeddings obtained.In the model architecture, the embedding layer is followed by the LSTM layer and then the fully connected layer.The last layer with two neurons is responsible for the news text classification.Also, a dropout layer with a probability of 0.5 was used to counterbalance the overfitting.In the output layer, a sigmoid activation function was used because the classification problem was binary.

Bidirectional Long Short-Term Memory (BILSTM)
Unlike a unidirectional LSTM, which relies only on past words in the sequence, a bidirectional LSTM leverages the context of the word's past and future sequences.In bidirectional LSTM, the memory cells are present in both directions to preserve the information of the words surrounding a particular word.Our bidirectional LSTM token embeddings were fed as inputs to the input layer.A rectified linear unit (ReLU) in the hidden layers was used as an activation function.The average pool and max pool in the pooling layers to reduce the data dimensions were used.The probability of 0.1 was used in the dropout layer.The fully connected layer was the last layer of the model and provided the output for the two classes.

Transformer-Based Models
Although convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and bidirectional LSTMs have their respective advantages, the emergence of transformer-based models such as BERT and its variations has resulted in notable enhancements across multiple natural language processing (NLP) tasks.Transformers have demonstrated exceptional proficiency in capturing intricate interdependencies within textual data, hence establishing themselves as the prevailing preference for numerous natural language processing (NLP) applications.Owing to their unparalleled performance, this research work focused on fine-tuning the mono-lingual and multi-lingual transformer-based models.The general architecture of transformer-based models is as shown in Figure 6.The news article input text was tokenized into a list of tokens using a model-specific tokenizer.For example, the sentence "आप क ं से हं " (How are you) will result in six tokens using the mBERT tokenizer as ['आ', '##प', 'क', '##◌ं ', '##से ', 'हं '].The pre-trained language models use fixed-length inputs.Hence, paddings were added, or extra tokens were removed if required to match the sequence length.Figure 7 shows the distribution of the token counts in the dataset.Padded tokens were not forwarded during the training phase.Hence, the padded tokens were masked.After padding and masking, the generated tokens were inputted into the transformer models and passed through multiple self-attention layers.For each token, a final hidden embedding was generated.The first token related to the beginning of the sequence was considered to generate the probability distribution over the two classes.Our study used four models: multilingual BERT, distill-multilingual BERT, Hindi BERT, and Hindi-TPU-Electra.The news article input text was tokenized into a list of tokens using a model-specific tokenizer.For example, the sentence "आप क ै से है " (How are you) will result in six tokens using the mBERT tokenizer as ['आ', '##प', 'क', '##ैै ', '##से ', 'है '].The pre-trained language models use fixed-length inputs.Hence, paddings were added, or extra tokens were removed if required to match the sequence length.Figure 7 shows the distribution of the token counts in the dataset.Padded tokens were not forwarded during the training phase.Hence, the padded tokens were masked.After padding and masking, the generated tokens were inputted into the transformer models and passed through multiple self-attention layers.For each token, a final hidden embedding was generated.The first token related to the beginning of the sequence was considered to generate the probability distribution over the two classes.Our study used four models: multilingual BERT, distill-multilingual BERT, Hindi BERT, and Hindi-TPU-Electra.
models use fixed-length inputs.Hence, paddings were added, or extra tokens were removed if required to match the sequence length.Figure 7 shows the distribution of the token counts in the dataset.Padded tokens were not forwarded during the training phase.Hence, the padded tokens were masked.After padding and masking, the generated tokens were inputted into the transformer models and passed through multiple self-attention layers.For each token, a final hidden embedding was generated.The first token related to the beginning of the sequence was considered to generate the probability distribution over the two classes.Our study used four models: multilingual BERT, distill-multi-lingual BERT, Hindi BERT, and Hindi-TPU-Electra.
token counts in the dataset.Padded tokens were not forwarded during the training phase.Hence, the padded tokens were masked.After padding and masking, the generated tokens were inputted into the transformer models and passed through multiple self-attention layers.For each token, a final hidden embedding was generated.The first token related to the beginning of the sequence was considered to generate the probability distribution over the two classes.Our study used four models: multilingual BERT, distill-multilingual BERT, Hindi BERT, and Hindi-TPU-Electra.The multi-lingual BERT is a multi-lingual variant of BERT [42] trained on 104 languages using a Wikipedia corpus.The number of entries for the different languages in the Wikipedia corpus was different.To have a sufficient number of words in the vocabulary, high-resource languages such as English were under-sampled, and low-resource languages were over-sampled.An exponentially smoothed weighing factor was used for the sampling.The DistilBERT base multi-lingual model [43] is the distilled version of the BERT-based multi-lingual model.Compared with the multi-lingual BERT base model, the distil-multilingual BERT model has only 6 layers and 768 dimensions, resulting in 133 M parameters.In contrast, the mBERT model produces 177 M parameters.This makes the model twice as fast as the mBERT base model.

Hindi Bert and Hindi-Tpu-Electra
Hindi-BERT was trained on Google Research's ELECTRA [44] in the Hindi language.The model was trained using the Hindi CommonCrawl and Hindi Wikipedia corpora.Hindi-TPU-Electra is a pre-trained Hindi language model trained with an Electra base size.These models were trained using ktrain and TensorFlow to achieve better accuracy.Both models were fine-tuned for our propaganda classification task.

Experimental Settings
All the experiments were performed on an Nvidia DGX-Server with four Nvidia Tesla V-100 GPUs and 32 GB of memory.The details of the environmental setup for experimentation is as mentioned in Table 4.For the deep learning models, we split the dataset into 80% and 20% of examples in the training and testing sets, respectively.As explained in Section 4, we generated the word embeddings for our dataset using the Word2vec skip-gram approach.For all deep learning model experiments, we experimented with loss functions, activation functions, and optimizers to obtain the best results.The ReLU activation function was employed for all layers except the output layer.The binary cross-entropy loss function and Adam optimizer were used.The models were trained for 50 epochs early stopping on patience 10.This research used Hugging Face transformers for transformer-based models, a Python library that provides pre-trained models for research and end-users.Pytorch, TensorFlow, and ktrain were used to train the models described in Section 4. The pre-trained models such as BERT-base-multi-lingual-cased, distil-mBERT, Hindi-BERT, and Hindi-TPU-Electra were fine-tuned to our dataset and for the propaganda classification task.The AdamW optimizer, cross-entropy loss function, a learning rate 2 × 10 −5 , batch size of 16, and 15 epochs were used.

Results and Analysis
This section summarizes and compares the results obtained using deep learning and transformer-based models.The evaluation metrics and analysis of the obtained results are explained in the subsections.

Evaluation Metrics
The models were evaluated based on evaluation metrics such as precision, recall, F1 score, and accuracy.The evaluation metrics can be stated as follows: Precisionx Recall Precision + Recall All the evaluation metrics can be macro-averaged or weight-averaged for calculating the values across different classes.As our classification problem is binary, to treat the classes equally, the macro-averaged scores were employed for precision, recall, and the F score.

Model Performance
The performance of deep learning architectures is shown in Table 5.To avoid overfitting, an early stopping with patience of 10 was used.The best results were recorded after 11, 12, and 50 epochs for the CNN, LSTM, and BiLSTM models, respectively.The validation loss plots are shown in Figure 8. Model loss plots play a vital role in the training and evaluation processes of deep learning models.The loss curves of deep learning models were developed for the purpose of monitoring and analyzing the training process, as well as evaluating the performance of the models.The observed loss curves throughout the epochs demonstrate a favorable convergence of the models, indicating an absence of overfitting.The BiLSTM loss plot indicates the best model performance without overfitting.The macro-averaged metrics for all the models were considered and the best accuracy, precision, recall, and F1 score were obtained using the BiLSTM model.and evaluation processes of deep learning models.The loss curves of deep learning mod-els were developed for the purpose of monitoring and analyzing the training process, as well as evaluating the performance of the models.The observed loss curves throughout the epochs demonstrate a favorable convergence of the models, indicating an absence of overfitting.The BiLSTM loss plot indicates the best model performance without overfitting.The macro-averaged metrics for all the models were considered and the best accuracy, precision, recall, and F1 score were obtained using the BiLSTM model.Table 6 presents the results for the transformer-based models.This work explored the multi-lingual and mono-lingual BERT models.It was observed that the transfer learning approach yielded better results in smaller epochs.However, more time was required to train these models.By examining the results, it can be seen that the multi-lingual BERT outperformed the other models during testing.The distill-BERT model is a distilled version of the mBERT, whereas the Hindi-TPU-Electra is a larger version of the Hindi-BERT model.The multi-lingual BERT model achieved an 84% score on our main evaluation measure (macro-averaged F1 score), which is just 1% higher than the second best score obtained via Hindi-BERT.Both models were trained with large Hindi texts and hence showed the best performance for our task of propaganda classification on Hindi texts.Figure 9 shows the comparative analysis of deep learning and transformer-based models.From the comparative analysis across all evaluation metrics, it is evident that among the deep learning models, the Bi-LSTM model performed the best, and among transformer-based models, the multi-lingual BERT showed the best results.After analyzing the obtained results, it can be concluded that the classification performances of the multilingual BERT and Hindi-BERT models were almost on par with each other.Achieving an 84% accuracy, Bi-LSTM models can capture contextual information, thus showing superior results to the CNN.Multi-lingual BERT performed slightly more effectively than mono-lingual Hindi-BERT.
among the deep learning models, the Bi-LSTM model performed the best, and among transformer-based models, the multi-lingual BERT showed the best results.After analyzing the obtained results, it can be concluded that the classification performances of the multi-lingual BERT and Hindi-BERT models were almost on par with each other.Achieving an 84% accuracy, Bi-LSTM models can capture contextual information, thus showing superior results to the CNN.Multi-lingual BERT performed slightly more effectively than mono-lingual Hindi-BERT.

Conclusions
The propagation of propaganda techniques in mainstream media has inspired us to devise an approach to propaganda detection in Hindi news articles.From a technical perspective, there is varied accessibility to tools to take advantage of deep learning and

Conclusions
The propagation of propaganda techniques in mainstream media has inspired us to devise an approach to propaganda detection in Hindi news articles.From a technical perspective, there is varied accessibility to tools to take advantage of deep learning and transfer learning techniques.However, there are substantial limitations owing to the language of the analysis or the available datasets.It is often difficult to obtain satisfactory performance for different languages, usually low-resource languages, such as Hindi.Much of the earlier work on propaganda classification is centered on resource-rich languages such as English and, recently, Arabic.As per our knowledge, minimal work in propaganda detection in Hindi is reported, and hence this problem is worthy of research.In this study, we investigated deep learning and transformer-based models for the Hindi computational propaganda classification task.The H-Prop-News dataset of Hindi propaganda and nonpropaganda news text is used for the classification task.Owing to the lack of pre-trained word embeddings for Hindi, we trained a Word2vec model for word representations.These word embeddings are used as inputs to all the deep learning models.The loss curves for the deep learning models also indicate that the models are converging without overfitting.The comparative analysis of all experimented models indicates that the Bi-LSTM and multilingual BERT model performance is best across all performance metrics.After analyzing the obtained results, we concluded that the classification performances of the Bi-LSTM and multi-lingual BERT models were almost on par with each other, achieving approximately 84% accuracy.The multi-lingual BERT model achieved an 84% score of macro-averaged F1 score, which is just 1% higher than the second best score obtained via Hindi-BERT.The Bi-LSTM model can capture contextual information, thus showing superior results to the CNN.Multi-lingual BERT performed slightly more effectively than mono-lingual Hindi-BERT did.This study highlights the significance of contextual factors in the classification of propaganda in the Hindi language.A minimal work on low-resource language such as Hindi was reported for propaganda detection and this research contributed to the field by proposing an approach to address this.

Limitations and Future Work
As with other works, ours is not free of limitations.One of the limitations of this research is the absence of linguistic features which will be addressed in the future work.This work also does not consider data augmentation, and this can be explored in future work.We intend to extend our work to the detection of fine-grained propaganda.We also aim to further fine-tune the BERT model and improve it using the semantic features of Hindi texts.

Figure 1 .
Figure 1.Overview of the methodology.

Figure 1 .
Figure 1.Overview of the methodology.

Figure 2 .
Figure 2. News source-wise article distribution in the H-Prop-News dataset.Figure 2. News source-wise article distribution in the H-Prop-News dataset.

Figure 2 .
Figure 2. News source-wise article distribution in the H-Prop-News dataset.Figure 2. News source-wise article distribution in the H-Prop-News dataset.Big Data Cogn.Comput.2023, 7, x FOR PEER REVIEW 10 of 21

Figure 3 .
Figure 3. Category-wise news articles distribution in the H-Prop-News dataset.Figure 3. Category-wise news articles distribution in the H-Prop-News dataset.

Figure 3 .
Figure 3. Category-wise news articles distribution in the H-Prop-News dataset.Figure 3. Category-wise news articles distribution in the H-Prop-News dataset.

Figure 3 .
Figure 3. Category-wise news articles distribution in the H-Prop-News dataset.

Figure 4 .
Figure 4. Word cloud of most frequent words appearing in the dataset.

Figure 4 .
Figure 4. Word cloud of most frequent words appearing in the dataset.

2 .
Stop word and non-Hindi word removal: Stop words are frequently occurring words in the text that do not hold any meaning.Some examples of Hindi stop words Big Data Cogn.Comput.2023, 7, x FOR PEER REVIEW 11 of 22

Big 21 Figure 6 .
Figure 6.General architecture of transformer-based models.

Figure 6 . 22 Figure 6 .
Figure 6.General architecture of transformer-based models.The news article input text was tokenized into a list of tokens using a model-specific

Figure 9 .
Figure 9. Performance analysis of deep learning and transformer-based models.(a) Performance of deep learning models.(b) Performance of transformer-based models.

Figure 9 .
Figure 9. Performance analysis of deep learning and transformer-based models.(a) Performance of deep learning models.(b) Performance of transformer-based models.

Table 1 .
Deep learning and transfer learning models used for propaganda classification.

Table 3 .
Details of H-Prop news dataset.

Table 4 .
Environment setup for experimentation.

Table 5 .
Results of the deep learning models.

Table 6 .
Results of the transformer-based models.