H-Prop and H-Prop-News: Computational Propaganda Datasets in Hindi

: In this digital era, people rely on the internet for their news consumption. As people are free to express their opinions on social media, much information shared on the internet is loaded with propaganda. Propagandist contents are intended to inﬂuence public opinion. In the mainstream media or prominent news agencies, the authors’ and news agencies’ own bias may impact in the news contents. Hence, it is required to detect such propaganda spread through news articles. Detection and classiﬁcation of propagandist text require standard, high-quality, annotated datasets. A few datasets are available for propaganda classiﬁcation. However, these datasets are mostly in English. Hindi is the most spoken language in India, and efforts are needed to detect its propagandist contents. This research work introduces two new datasets: H-Prop and H-Prop-News, which consist of news articles in Hindi annotated as propaganda or non-propaganda. The H-Prop dataset is generated by translating 28,630 news articles from the QProp dataset. The H-Prop-News dataset contains 5500 news articles collected from 32 prominent Hindi news websites. We experiment with the proposed datasets using four supervised machine learning models combined with different feature vectors and word embeddings. Our experiments achieve 87% accuracy using Logistic Regression with TF-IDF feature vectors. The datasets provide high-quality labeled news articles in Hindi and open new avenues for researchers to explore techniques for analyzing and classifying propaganda in Hindi text.


Summary
According to [1], modern propaganda operates with many kinds of truth, such as halftruth, limited reality, and truth out of context. In recent times propaganda has been used by terrorist organizations for recruitment [2][3][4][5] and by political parties during elections [6][7][8][9], among many others. Today, abundant online news media has cropped up, some with the intent of spreading propaganda. The spectrum of a news article can range from neutral to biased [10]. Even though every news outlet/agency claims to be fair and unbiased, the personal stand of the article author and the news outlet may influence the reporting style and intent to some extent [11]. An author may use psychological and linguistic techniques to influence the readers about a specific topic. This malicious way of promoting agenda is generally referred to as propaganda.
Most of the work on creating automatic approaches to propaganda identification targets texts in English. However, most news articles are regional in a specific country context and political landscape. In India, internet users have seen a drastic surge in recent years. Hindi is the predominantly spoken language in India and the fourth most spoken language globally. Still, very little work is done to explore propaganda detection in regional languages, such as Hindi.
To allow the creation of models to identify the propaganda spread in Hindi, we introduce two datasets: H-Prop and H-Prop-News. H-Prop is produced by machine translation from an existing dataset containing news articles-propagandist vs non-propagandist in English [11,12]. A subset of the instances in QProp is translated in Hindi using IBM's Watson language translator [13]. H-Prop-News has been curated and annotated from scratch from a set of news articles originally written in Hindi, collected from prominent Indian news websites. The H-Prop corpus contains 28,630 news articles, whereas H-Prop-News contains 5500 news articles.
This research focuses on digital or computational propaganda, which will hugely contribute to the field of computational propaganda detection as no significant prior work is reported for propaganda detection in the Hindi language. Our contributions are as follows.

•
We produce and release a new dataset of news articles in Hindi annotated for propaganda obtained from prominent news websites.

•
We produce and release a derived dataset of news articles (originally in English) translated in Hindi and annotated for propaganda.

•
We experiment with different machine learning models with the H-Prop-News dataset and show their effectiveness for propaganda classification.
Researchers can further utilize this dataset to train supervised models for the classification and detection of propaganda. These datasets can also be used for other research projects such as Hindi news articles classification and topic modeling.
The TSHP-17 dataset [14] consists of news articles from 11 sources organized in four classes: trusted, satire, hoax, and propaganda. The dataset consists of 22,580 articles, out of which 5330 are flagged as propaganda. The authors created the dataset using distant supervision, considering the source of the news articles.
The authors of [23] released a corpus to identify fine-grained propaganda. The corpus contains 451 articles and was manually annotated by 6 people to identify 18 propaganda techniques at a fine-grained level. The authors identified 7485 propaganda technique instances from 21,230 sentences. The PTC-SemEval20 Corpus was presented by [24] as part of SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. This corpus consists of news articles gathered from 13 propaganda and 36 non-propaganda news websites identified by Media Bias/Fact Check. The corpus consists of 536 news articles with 8981 identified propaganda snippets. The annotation was performed manually by 6 professional annotators considering 18 propaganda techniques.
QCRI's propaganda corpus, known as QProp, comprises news articles focusing on two classes: Propaganda and non-propaganda [11]. The corpus was built by distant supervision, as the TSHP-17 dataset, considering the news source information published by Media Bias/Fact Check (https://mediabiasfactcheck.com, accessed on 6 January 2022) (MBFC). QProp 51,246 articles: 5714 from propagandist sources and 45,532 from non-propagandist ones. Table 1 shows statistics of the QProp dataset. In the TSHP-17 and QProp datasets, the articles were labeled using the distant supervision technique, which relies entirely on the source of an article to label it as propaganda. This approach did not consider the actual contents of the article to identify the propaganda in the text. Also, the number of propagandist instances in both datasets is meager compared to other classes.
All the prominent datasets proposed for propaganda detection are in the English language. To the best of our knowledge, no such dataset is available for Hindi. In this work, two datasets in Hindi for propaganda by using two different approaches are generated. In the first case, QProp is translated into Hindi. In the second case, an original dataset of Hindi news is created. As the articles are labeled by looking at the actual contents of the news articles, the annotation is more reliable.

Data Description
This section provides a detailed description of the H-Prop and H-Prop-News datasets. Tables 2 and 3 show statistics of the two datasets.

H-Prop Dataset
The original QProp dataset consists of 51,246 news articles. The H-Prop dataset is derived from QProp and considered only 28,630 articles. The data is split into development, training, and testing partitions. The dataset files are in tab-separated format, and UTF-8 encoding is used. This subsample of the corpus is translated into Hindi using IBM Watson Language Translator [13] (https://www.ibm.com/cloud/watson-language-translator, accessed on 13 October 2021). The translation was done over the months in 2021. Table 4 shows the details of H-Prop dataset as per partitions.

H-Prop-News Dataset
The H-Prop-News Dataset is built by extracting the news articles from 32 prominent mainstream news portals in India. The articles are fetched from September 2021 to December 2021. Its focus is national and political news in the Indian context. Table 5 shows statistics about the H-Prop-News dataset. A total of 5500 articles were scraped from these websites using the parseHub web scraping tool (https://www.parsehub.com/, accessed on 6 January 2022). These articles are annotated as propaganda or non-propaganda considering their contents and identifying the propaganda techniques observed in them.  Table 6 shows the class-wise article distribution per medium. Most propagandist articles come from the news website Patrika News (available online: www.patrika.com, accessed on 6 January 2022), whereas most non-propaganda articles come from Amar Ujala (available online: www.amarujala.com, accessed on 6 January 2022).

Methods
This section elaborates on the methods and techniques used for data collection and dataset generation of H-Prop and H-Prop-News datasets.

H-Prop Dataset Generation
A portion of the QProp dataset for preparing the H-Prop dataset is considered, as explained in Section 2.1. IBM Watson Language Translator is used for translation purposes. The English translation process introduces several special characters due to the encoding conversion. These special characters are then removed to clean the data. Figure 1 shows the methodology used to generate the H-Prop dataset.

H-Prop Dataset Generation
A portion of the QProp dataset for preparing the H-Prop dataset is considered, as explained in Section 2.1. IBM Watson Language Translator is used for translation purposes. The English translation process introduces several special characters due to the encoding conversion. These special characters are then removed to clean the data. Figure 1 shows the methodology used to generate the H-Prop dataset.

H-Prop-News Dataset Generation
First, 32 prominent Hindi news websites were selected, reporting national and political news. Collecting data from different websites is a challenging task. Each website follows a different page layout. Parsehub is a cloud-based, free web-scraping tool that extracts data from a website in a few steps. We extracted News headlines, News URLs, and Article Texts from the websites. Figure 2 shows the process of H-Prop-News dataset creation.

H-Prop-News Dataset Generation
First, 32 prominent Hindi news websites were selected, reporting national and political news. Collecting data from different websites is a challenging task. Each website follows a different page layout. Parsehub is a cloud-based, free web-scraping tool that extracts data from a website in a few steps. We extracted News headlines, News URLs, and Article Texts from the websites. Figure 2 shows the process of H-Prop-News dataset creation.

H-Prop Dataset Generation
A portion of the QProp dataset for preparing the H-Prop dataset is considered, as explained in Section 2.1. IBM Watson Language Translator is used for translation purposes. The English translation process introduces several special characters due to the encoding conversion. These special characters are then removed to clean the data. Figure 1 shows the methodology used to generate the H-Prop dataset.

H-Prop-News Dataset Generation
First, 32 prominent Hindi news websites were selected, reporting national and political news. Collecting data from different websites is a challenging task. Each website follows a different page layout. Parsehub is a cloud-based, free web-scraping tool that extracts data from a website in a few steps. We extracted News headlines, News URLs, and Article Texts from the websites. Figure 2 shows the process of H-Prop-News dataset creation.

Data Annotation
The news articles in the QProp corpus were labeled using distant supervision. The authors [11] rely on the news outlet information provided by Media Bias Fact Check (MBFC) (https://mediabiasfactcheck.com/). The labels were obtained by considering news coming from propagandist news outlets as propagandist and news coming from nonpropaganda news outlets. We retain the annotations as provided in the original QProp

Data Annotation
The news articles in the QProp corpus were labeled using distant supervision. The authors [11] rely on the news outlet information provided by Media Bias Fact Check (MBFC) (https://mediabiasfactcheck.com/, accessed on 6 January 2022). The labels were obtained by considering news coming from propagandist news outlets as propagandist and news coming from non-propaganda news outlets. We retain the annotations as provided in the original QProp dataset.
The annotation task for the H-Prop-News dataset involved identifying propaganda methods used and labeling the articles as propaganda or non-propaganda. The definitions of 14 propaganda techniques are followed as listed in Table 7. The annotation task was done in two phases (i) two annotators labeled the articles independently as propaganda or non-propaganda class, and (ii) the annotations were then reviewed for conflicts. We used the LightTag text annotation tool [25] for the annotation and analysis. With reference to the annotation guidelines provided by the authors of [24], we present the flowchart for the article label decision process at the document level. As shown in Figure 3, the propaganda techniques are grouped as per specific indications. For example, the articles showing the addition of irrelevant data along with problem simplification may have propaganda techniques such as casual oversimplification, appeal to authority, black-and-white fallacy, or thought-terminating cliché. The annotators further referred to the more detailed definition of these techniques as listed in Table 7 for technique identification. If more than one technique is spotted in the article, the annotator labeled the article as propaganda.
Data 2022, 7, x FOR PEER REVIEW 6 of 11 used the LightTag text annotation tool [25] for the annotation and analysis. With reference to the annotation guidelines provided by the authors of [24], we present the flowchart for the article label decision process at the document level. As shown in Figure 3, the propaganda techniques are grouped as per specific indications. For example, the articles showing the addition of irrelevant data along with problem simplification may have propaganda techniques such as casual oversimplification, appeal to authority, black-and-white fallacy, or thought-terminating cliché. The annotators further referred to the more detailed definition of these techniques as listed in Table 7 for technique identification. If more than one technique is spotted in the article, the annotator labeled the article as propaganda. To evaluate annotation quality in terms of inter-annotator agreement, Cohen's Kappa [26] is used. Cohen's Kappa measures the agreement between two annotators, classifying articles in n mutually exclusive categories. The inter-annotator agreement (K) observed is on average 0.81.  To evaluate annotation quality in terms of inter-annotator agreement, Cohen's Kappa [26] is used. Cohen's Kappa measures the agreement between two annotators, classifying articles in n mutually exclusive categories. The inter-annotator agreement (K) observed is on average 0.81.

No.
Propaganda Technique Definition

Loaded Language
Use of strong emotional words and phrases [27] 2. Name Calling/Labelling Labeling the object of the propaganda with something the audience fears, hates, finds undesirable, or loves or praises [28] 3.

Repetition
Repeating the same message repeatedly [28,29] 4. Exaggeration/minimization Representing something excessively or making something seem less important or smaller than it actually is [30] 5. Doubt Questioning the credibility of someone or something 6. Appeal to fear/prejudice Infusing anxiety and/or panic towards an alternative, possibly based on prejudiced conclusions 7.
Flag-waving Playing on strong national feeling to justify or promote an action or idea [31] 8. Causal oversimplification Transfer of the blame to one person or group of people without investigating the complexities of an issue 9. Slogans A concise and dramatic phrase that may include labeling and stereotyping

10.
Appeal to authority Stating that a claim is true simply because a valid authority/expert on the issue supports it, without any other supporting evidence [32] 11. Black-and-white fallacy Presenting two alternative options as the only possibilities, when in fact, more possibilities exist [29] 12. Thought-terminating cliche Words or phrases that discourage critical thought and meaningful discussion on a topic [33] 13. Whataboutism Discredit an opponent's position by charging them with hypocrisy without directly disproving their argument [34] 14. Bandwagon Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action" [31] Sample news articles and the respective labels are shown in Table 8. The English translation is provided here for the understanding of our international readers. The first article does not contain any propaganda technique. In the second news article, propaganda techniques such as loaded language, exaggeration, and casual oversimplification can be observed. Table 8. Sample news articles text and labels.

Sample News Article Text English Translation of Article Text Article Label
Data 2022, 7, x FOR PEER REVIEW 7 of 11 8. Causal oversimplification Transfer of the blame to one person or group of people without investigating the complexities of an issue 9.
Slogans A concise and dramatic phrase that may include labeling and stereotyping

10.
Appeal to authority Stating that a claim is true simply because a valid authority/expert on the issue supports it, without any other supporting evidence [32] 11. Black-and-white fallacy Presenting two alternative options as the only possibilities, when in fact, more possibilities exist [29] 12. Thought-terminating cliche Words or phrases that discourage critical thought and meaningful discussion on a topic [33] 13. Whataboutism Discredit an opponent's position by charging them with hypocrisy without directly disproving their argument [34] 14. Bandwagon Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action" [31] Sample news articles and the respective labels are shown in Table 8. The English translation is provided here for the understanding of our international readers. The first article does not contain any propaganda technique. In the second news article, propaganda techniques such as loaded language, exaggeration, and casual oversimplification can be observed. . Propaganda

Experimental Setup
This section provides an overview of the experiments performed for the propaganda classification task using the H-Prop-News dataset. We trained four machine learning The report said that the city gas distribution companies will have to increase the prices by 10-11 percent. From October 2022 to March 2023, it will be $7.65 per unit. This means that the prices of CNG and PNG will increase by 22-23 percent in April 2022. In October 2022, the price will increase by another 11 to 12 percent. MGL and IGL will have to increase prices by 49 to 53 percent between October 2021 and October 2022 due to the hike in APM gas prices, the report said.
Causal oversimplification Transfer of the blame to one person or group of people without investigating the complexities of an issue 9.
Slogans A concise and dramatic phrase that may include labeling and stereotyping

10.
Appeal to authority Stating that a claim is true simply because a valid authority/expert on the issue supports it, without any other supporting evidence [32] 11. Black-and-white fallacy Presenting two alternative options as the only possibilities, when in fact, more possibilities exist [29] 12. Thought-terminating cliche Words or phrases that discourage critical thought and meaningful discussion on a topic [33] 13. Whataboutism Discredit an opponent's position by charging them with hypocrisy without directly disproving their argument [34] 14. Bandwagon Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action" [31] Sample news articles and the respective labels are shown in Table 8. The English translation is provided here for the understanding of our international readers. The first article does not contain any propaganda technique. In the second news article, propaganda techniques such as loaded language, exaggeration, and casual oversimplification can be observed. . Propaganda

Experimental Setup
This section provides an overview of the experiments performed for the propaganda classification task using the H-Prop-News dataset. We trained four machine learning Rahul Gandhi, furious at the Center over privatization, said-PM sold the country's 70 years of capital! By Lokmat News Desk|Published: 24 August 2021, 09:59 p.m. Congress leader Rahul Gandhi has attacked the Modi government. Rahul Gandhi targeted the government, terming the announcement of the National Monetization Plan (NMP) as an "attack on the future of the youth". He alleged that Prime Minister Narendra Modi sold the country's capital built in 70 years to some of his industrialist friends. Rahul Gandhi also claimed that giving this "gift" to some companies will make them a monopoly, due to which the youth of the country will not be able to get employment.

Experimental Setup
This section provides an overview of the experiments performed for the propaganda classification task using the H-Prop-News dataset. We trained four machine learning models: Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost. Figure 4 shows the propaganda classification framework. After preprocessing data by removing URLs, we remove the Hindi stopwords from the article text. The tokenization of the text is performed using the nlp-indic library. For representation, we use four different feature vectors and word embeddings: Bag-of-words, TFIDF (Term Frequency-Inverse Document Frequency), word2vec, and doc2vec. Each machine learning model is fed with each of the word embeddings. The entire dataset of 5500 articles is considered for the experimental setup. The dataset is split into training, testing and validation set using an 70:20:10 ratio. The resulting training set contains 3850 articles, testing set contains 1100 articles and validation set contains 550 articles.

Experimental Setup
This section provides an overview of the experiments performed for the propaganda classification task using the H-Prop-News dataset. We trained four machine learning models: Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost. Figure 4 shows the propaganda classification framework. After preprocessing data by removing URLs, we remove the Hindi stopwords from the article text. The tokenization of the text is performed using the nlp-indic library. For representation, we use four different feature vectors and word embeddings: Bag-of-words, TFIDF (Term Frequency-Inverse Document Frequency), word2vec, and doc2vec. Each machine learning model is fed with each of the word embeddings. The entire dataset of 5,500 articles is considered for the experimental setup. The dataset is split into training, testing and validation set using an 70:20:10 ratio. The resulting training set contains 3,850 articles, testing set contains 1,100 articles and validation set contains 550 articles.  Table 9 shows the performance of all the machine learning models using different features and word embeddings on training, testing and validation sets. The Logistic Regression with TF-IDF feature vectors gives the best results on the testing as well as validation dataset. The F1 score and accuracy obtained on validation set is 87.46 and 87.45 respectively. All the classifiers show least performance with doc2vec word embeddings.   Table 9 shows the performance of all the machine learning models using different features and word embeddings on training, testing and validation sets. The Logistic Regression with TF-IDF feature vectors gives the best results on the testing as well as validation dataset. The F1 score and accuracy obtained on validation set is 87.46 and 87.45 respectively. All the classifiers show least performance with doc2vec word embeddings. The main aim of this work was to develop a propaganda dataset in the Hindi language and a machine learning model for the classification of propaganda text. The annotation process required rigorous and time-consuming inspections of the news articles by the annotators. The annotation reliability is established by using Cohen Kappa as the measure. The most frequent propaganda techniques observed during the annotation process were loaded language and labeling or name-calling. Our observations are similar to the findings of the work [23].

Results and Discussion
Propaganda detection remains a challenging task with fine-grained analysis of the text. This work provided an opportunity to develop machine learning models that detect propaganda at the document level.

Use Cases of the H-Prop and H-Prop News Dataset
The proposed datasets have the following practical implications.

•
These datasets can be used for propaganda classification tasks at the article level.

•
The datasets can be further enriched for fine-grained propaganda labeling to identify various propaganda techniques.

•
The H-Prop-News dataset can be further utilized to explore various topics and events related to propaganda, such as the target of propaganda, source of propaganda, etc.

Conclusions and Future Work
The research presents two propaganda datasets. H-Prop consists of news articles translated from the English propaganda dataset QProp. H-Prop-News contains original Hindi News articles gathered from Hindi mainstream news websites. The H-Prop dataset contains 28,630 news articles, and the H-Prop-News dataset contains 5500 news articles. The annotations of articles are retained from the original QProp corpus. In contrast, the H-Prop-News dataset is manually annotated, considering the definitions of propaganda techniques. To the best of our knowledge, no significant work is reported in the area of propaganda detection in Hindi text. Hence, these newly created datasets are the first publicly available datasets of their kind. This work also explains the process for dataset creation and provides statistical details. Also, the propaganda classification using machine learning techniques is explored, obtaining an accuracy of 87%. Thus, this work is a contribution in this direction. As computational propaganda detection and analysis is an upcoming field of research, this work will help researchers explore natural language processing and machine learning techniques in this area.
As the future scope of this work, the aim is to augment the size of the H-Prop-News dataset by covering more news websites. Currently, the news articles are collected under the national and political categories. The dataset can also be included to evaluate the use of propaganda in opinion and editorial articles. As the dataset is manually annotated, it might have the annotators' bias. More annotators can be employed to dim it. It is also observed that even though the news articles are collected from Hindi news media, the text is not purely in Hindi. Some amount of code-mixing or use of English words is observed.