KazNewsDataset: Single Country Overall Digital Mass Media Publication Corpus

Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.


Summary (Required)
The mass media are a source of documents, which allows us to stay up-to-date, form our own judgments and opinions on certain events or decisions, and develop certain media consumption habits. However, this increases the possibility of spreading distorted, biased information, erroneous information, and, finally, the manipulation of information consumers.
Evaluating the impact of mass media requires rapid processing of large amounts of textual information, which can be achieved using natural language processing (NLP) and machine learning (machine learning -ML) techniques. These technologies allow users to extract information from large amounts of textual data [1,2], provide content analysis [3,4], personalized access to news [5][6][7], and even support its production and distribution [8,9].
Natural language processing (NLP) as a field of research includes a wide range of application areas: • Automatic translation [10]; • Automatic summarization; • Generating responses to user requests (question answering) [11]; • Data mining (IE) [12]; • Information retrieval [13,14]; • Sentiment analysis [15]; • Other areas are in some way related to the processing of spoken and written natural language.
NLP as a research field is changing very dynamically. Since [16], qualitatively new results have been obtained in the development of statistical language models. The large volumes of available texts in social networks and the usage of deep neural networks [17] lead to the formulation of image extraction tasks from vast amounts of unstructured information based on modern methods of distributed linguistics and so-called supervised learning.
Another modern approach is using concept-level knowledge databases, such as Word-Net, ConceptNet and SenticNet [18]. This approach proves to be highly efficient and interpretable; however, it requires a knowledge-base to be developed for a certain language. A solution for this problem was proposed in [19]. In addition, a survey on multilingual sentiment analysis, including proposing solutions for scarce resource languages, was published in [20].
The key aspects that have led to impressive results in automatic natural language processing are, according to [21], advances in the development of machine-learning methods, especially deep learning [22,23], the multiplication of computing power, the availability of large amounts of linguistic data and the development of an understanding of natural language structure as applied to the social context. However, the volumes of textual information collected are often narrowly focused [24][25][26][27][28]. In contrast, the Kazakh mass media corpus described below is broadly applicable. Below we describe the dataset's content, some models, and problem-solving methods using this largely unlabeled corpus.

Data Description (Required)
The dataset contains news publications from publicly available news websites as well as from social networks, including VK.com, YouTube, Instagram and Telegram. The corpus is presented in two forms: The first one contains only basic meta-information about each publication, such as title, source and publication date and time and consists of Kazakhstani and Russian news. It can be used, for example, for combative analysis of the news of the two countries or any other text-related tasks (topic-modeling, information retrieval, etc.).
The second form consists of news only from Kazakhstani sources but contains three additional columns groups: (1) Weights of correspondence to handpicked topic groups; (2) Weights of correspondence to 200 topics from BigARTM model; (3) Type-either news publication or governmental program document.
It should also be noted that the dataset was not manually verified or edited. Due to technical difficulties and limitations. Hence, the dataset can contain: • HTML and JavaScript code pieces; • Wrong publication date and time due to format issues; In rare cases, a text field can contain text from different publications.
However, such cases are not frequent and, according to verification on a small subset of data, only occur in less than 5% of publications.

Form 1 of Corpus Representation
Corpus of news from Russian and Kazakhstani news sources from 2000 to 2020 from 50 major sources, including social networks (VK.com, YouTube, Instagram and Telegram) and news websites. It includes 4,233,990 documents from Kazakhstani sources and 2,027,963 documents from Russian sources ( Figure 1). The first one contains only basic meta-information about each publication, such as title, source and publication date and time and consists of Kazakhstani and Russian news. It can be used, for example, for combative analysis of the news of the two countries or any other text-related tasks (topic-modeling, information retrieval, etc.).
The second form consists of news only from Kazakhstani sources but contains three additional columns groups: (1) Weights of correspondence to handpicked topic groups; (2) Weights of correspondence to 200 topics from BigARTM model; (3) Type-either news publication or governmental program document.
It should also be noted that the dataset was not manually verified or edited. Due to technical difficulties and limitations. Hence, the dataset can contain: • HTML and JavaScript code pieces; • Wrong publication date and time due to format issues; In rare cases, a text field can contain text from different publications.
However, such cases are not frequent and, according to verification on a small subset of data, only occur in less than 5% of publications.

Form 1 of Corpus Representation
Corpus of news from Russian and Kazakhstani news sources from 2000 to 2020 from 50 major sources, including social networks (VK.com, YouTube, Instagram and Telegram) and news websites. It includes 4,233,990 documents from Kazakhstani sources and 2,027,963 documents from Russian sources ( Figure 1). It is available at [29]. For each document, the following fields are included: It is available at [29]. For each document, the following fields are included:

Form 2 of Corpus Representation
Corpus of news from Kazakhstani sources from 2018 to 2020. It is available at [30]. It includes 1,142,735 documents from news web sites and social networks with the same data as in the corpus described above with an addition of: • Sixty-seven columns with handpicked and topic groups weights with semantic names (group economy, group politics, etc.). They were normalized to range from 0 to 1; • Two hundred columns with topic weights were obtained through topic modeling. These columns represent a theta-matrix of the topic model; • Type-either "news" or "governmental program".
The corpus also includes about 4000 documents, which represent fragments of governmental development program documents. Twenty-five governmental development programs were used, including 18 programs for separate regions and big cities, two longterm state development programs and five thematic programs (digitalization, countryside development, education, healthcare and welfare programs). They were manually divided into 4000 independent fragments by experts. The reason for such preprocessing is that governmental programs a generally very lengthy and can be very thematically diverse, which complicates usage of the whole documents in topic modeling.
Comparison of representation of these governmental documents in topics is one of the approaches used to evaluate the social significance of news in [31].
There are also two additional files for that corpus: • Topic-words.json file represents words with weights for the 200 topics obtained through topic-modeling. It is a compressed representation of a phi matrix; • Topic-expert-labeling-sentiment.json contains expert labeling of topics sentiment. It was used to obtain results described in [31].

Methods (Required)
The data gathering method is web-scraping of news and media websites with open free access, as well as the scraping of social networks either by a list of social network accounts (users, groups, channels, etc.) or by a list of search queries.
The scraping algorithms were implemented in the form of Apache Airflow operators. Apache Airflow is an ETL (extract-transform-load) tool, which allows to programmatically schedule and manage tasks, monitor them and handle errors and exceptions. It was used as a core ETL-solution for the mass media monitoring system "Media Analytics" [32,33].
The scraping algorithms were implemented using the Python library Scrapy 1.7.3. For web sites with dynamic content (for example, websites based on React.JS, Angular, and other modern frontend frameworks) scrapy-splash library was used along with an official Docker image for scrapy scrapinghub/splash: 3.3.1. Scrapy was applied to simulate Data 2021, 6, 31 6 of 12 running the JavaScript code inside the client's web browser to obtain the website's contents (list of news publications and news publications texts along with other meta-data).
A custom configurable Scrapy Spider was implemented, which accepts a starting URL and a list of scraping rules. This approach allows minimizing the amount of necessary software development for introducing new scraping sources since it only requires a list of scraping rules and a starting URL. It is certainly possible to create a universal parser, which would be able to scrap any source of information. However, such universal parsers tend to present a much lower quality of meta-data, and the results of scraping by such universal solutions are generally unstructured or weakly structured. That is the reason for the proposed approach, which requires a set of scraping rules for each website, but has a much higher quality and precision of meta-data, including publication date and time, a number of views, author, tags, comments, likes, shares, etc.
Scraping rules are a set of CSS-selectors for each of the accessible meta-data elements of news-publications on a given website. In principle, it is also possible to use regular expressions instead of CSS-selectors; however, CSS-selector seems to be a more fitting choice since the vast majority of HTML pages are built according to a certain CSS methodology, which makes it easy to navigate through using CSS-selectors. However, in our experience, there are some rare cases in which either an application of regular expressions or some workarounds in the Spider are required. Another technical issue with scraping is that news websites usually implement some measures against automatic scraping and DDoS (distributed denial of service) attacks, so using random proxy servers was implemented, as well as randomly picking user-agents to reduce chances of identification and ban. In addition, Scrapy allows configuring period between requests, concurrency and other parameters, tuning, which allows minimizing the chance of being banned by a website.
Social network scraping is a more complicated problem since social networks tend to implement some technical restrictions for scraping. In most cases, scraping is only possible through either an official API (access to which may be hard to obtain and which can be very limited) or through thorough user-actions simulation through Selenium or similar software. The second option is very costly to implement, so the first approach was used wherever possible.
Topic modeling. One of the methods that are productively applied in the field of NLP is topic analysis or topic modeling (TM). TM is a method based on the statistical characteristics of document collections, which is used in the tasks of automatic summarization, information retrieval and clustering [34]. TM transforms to the algorithm the intuitive understanding that documents in a collection form groups in which the frequency of occurrence of words or word combinations differs.
The basis of TM is the statistical model of natural language. Probabilistic TM describes documents (M) by a discrete distribution on a set of topics (T) and topics by a discrete distribution on a set of the term [35]. In other words, the TM determines which topics each document applies to and which words form each topic. Clusters of terms and phrases formed in the process of thematic modeling, in particular, allow solving the problems of synonymy and polysemy of terms [36].
To build a thematic model of the corpus of documents, a very popular latent Dirichlet allocation (LDA) [37,38] is used. LDA can be expressed by the following equality: which represents the sum of mixed conditional distributions on all T set topics, where p(w | t) is the conditional distribution of words in themes, and p(t | m) is the conditional distribution of topics in the news. The transition from conditional distribution p(w | t, m) in p(w | t) is carried out at the expense of the hypothesis of conditional independence, according to which the appearance of words in news m on the topic t depends on the topic, but does not depend on the news m, and is common for all news. This ratio is fair, based on the assumption that there is no need to maintain the order of documents (news) in the body and the order of words in the news. In addition, the LDA method assumes that the components ϕ wt and θ tm are generated by Dirichlet's continuous multidimensional probability distribution. The purpose of the algorithm is to search for parameters ϕ wt and θ tm by maximizing the likelihood function with appropriate regularization: The method of maximizing the coherence value based on the UMass metric is often used to determine the optimal number of topics [39]. Some generalization of LDA is additive regularization of topic models (ARTM), implemented in the form of BigARTM library [40]. The LDA method is described in more detail in Appendix A.
The described models, as well as some methods of multiple-criteria decision-making (MCDM): AHP (analytical hierarchy process) [41], Bayesian networks [42,43], are used to classify socially significant news [31], identify propaganda [44], assess the sentiment of publications [45,46], comparative analysis of publication activity in the field of renewable energy in Russia and Kazakhstan [47]. The details of the method are described in [48,49].
Another possible application is the assessment of the dynamics of publication activity in certain topics or regarding certain persons, organizations, and events. It can be used as a numerical evaluation of the popularity of certain topics and entities, for example, in humanitarian research and for practical applications, such as public-relation department effectivity estimation (KPI), reputation management, competitors' analysis, etc.
However, it should be noted that the absolute number of publications is not a representative estimate since publication activity in online media is growing rapidly, as illustrated in Figure 2-during the last ten years, the number of publications has grown tenfold. Hence, a normalization is required, which should take into account the overall number of publications in a given period. discrete distribution on a set of the term [35]. In other words, the TM determines which topics each document applies to and which words form each topic. Clusters of terms and phrases formed in the process of thematic modeling, in particular, allow solving the problems of synonymy and polysemy of terms [36].
To build a thematic model of the corpus of documents, a very popular latent Dirichlet allocation (LDA) [37,38] is used. LDA can be expressed by the following equality: which represents the sum of mixed conditional distributions on all T set topics, where ( | ) is the conditional distribution of words in themes, and ( | ) is the conditional distribution of topics in the news. The transition from conditional distribution ( | , ) in ( | ) is carried out at the expense of the hypothesis of conditional independence, according to which the appearance of words in news m on the topic t depends on the topic, but does not depend on the news m, and is common for all news. This ratio is fair, based on the assumption that there is no need to maintain the order of documents (news) in the body and the order of words in the news. In addition, the LDA method assumes that the components and are generated by Dirichlet's continuous multidimensional probability distribution. The purpose of the algorithm is to search for parameters and by maximizing the likelihood function with appropriate regularization: The method of maximizing the coherence value based on the UMass metric is often used to determine the optimal number of topics [39]. Some generalization of LDA is additive regularization of topic models (ARTM), implemented in the form of BigARTM library [40]. The LDA method is described in more detail in Appendix A.
The described models, as well as some methods of multiple-criteria decision-making (MCDM): AHP (analytical hierarchy process) [41], Bayesian networks [42,43], are used to classify socially significant news [31], identify propaganda [44], assess the sentiment of publications [45,46], comparative analysis of publication activity in the field of renewable energy in Russia and Kazakhstan [47]. The details of the method are described in [48,49].
Another possible application is the assessment of the dynamics of publication activity in certain topics or regarding certain persons, organizations, and events. It can be used as a numerical evaluation of the popularity of certain topics and entities, for example, in humanitarian research and for practical applications, such as public-relation department effectivity estimation (KPI), reputation management, competitors' analysis, etc.
However, it should be noted that the absolute number of publications is not a representative estimate since publication activity in online media is growing rapidly, as illustrated in Figure 2-during the last ten years, the number of publications has grown tenfold. Hence, a normalization is required, which should take into account the overall number of publications in a given period.  Such normalization allows obtaining much more representative results. For example, Figure 3 illustrates normalized weekly publication activity on topics related to viruses and infections over the last ten years. It is obvious that publication activity on this topic was very stable over the years and almost doubled during the COVID-19 outbreak in yearly 2020. Such normalization allows obtaining much more representative results. For example, Figure 3 illustrates normalized weekly publication activity on topics related to viruses and infections over the last ten years. It is obvious that publication activity on this topic was very stable over the years and almost doubled during the COVID-19 outbreak in yearly 2020.

Limitations of the Study
The presented corpus has the following limitations: • The results of model verification are based on datasets, each of which was labeled by a single expert. No thorough validation of expert's assessments was performed, only visual validation.

•
The volume of the labeled subset is small compared to the volume of the corpus.

Conclusions
The paper described a text corpus, which contains over 4 million publications of Kazakhstani media, more than 2 million texts of Russian media and about 4000 sections of state development program documents. The corpus was used in several research cases, such as identification of propaganda, assessment of the sentiment of publications, calculation of the level of socially significant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the social significance of news using the topic model of the text corpus since an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved in the classification task, which is comparable with results obtained for the same task by applying the bidirectional encoder representations from transformers (BERT) model. The proposed method of identifying texts with propagandistic content was crossvalidated on a labeled subsample of 1000 news and showed high predictive power-ROC AUC 0.73. In the task of sentiment analysis, the proposed method showed a 0.93 ROC AUC score.
Despite the noted limitations, the corpus will be of interest to researchers analyzing media, including comparative analysis and identification of common patterns inherent in the media of different countries.
One of the directions of further research of the corpus is the analysis of publication activity related to individual organizations, topics and events, for example, healthcare and the COVID-19 pandemic.

Limitations of the Study
The presented corpus has the following limitations:

•
The results of model verification are based on datasets, each of which was labeled by a single expert. No thorough validation of expert's assessments was performed, only visual validation.

•
The volume of the labeled subset is small compared to the volume of the corpus.

Conclusions
The paper described a text corpus, which contains over 4 million publications of Kazakhstani media, more than 2 million texts of Russian media and about 4000 sections of state development program documents. The corpus was used in several research cases, such as identification of propaganda, assessment of the sentiment of publications, calculation of the level of socially significant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the social significance of news using the topic model of the text corpus since an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved in the classification task, which is comparable with results obtained for the same task by applying the bidirectional encoder representations from transformers (BERT) model. The proposed method of identifying texts with propagandistic content was cross-validated on a labeled subsample of 1000 news and showed high predictive power-ROC AUC 0.73. In the task of sentiment analysis, the proposed method showed a 0.93 ROC AUC score.
Despite the noted limitations, the corpus will be of interest to researchers analyzing media, including comparative analysis and identification of common patterns inherent in the media of different countries.
One of the directions of further research of the corpus is the analysis of publication activity related to individual organizations, topics and events, for example, healthcare and the COVID-19 pandemic.