KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection

Lamasheva, Zhanar; Nekessova, Anargul; Kantureyeva, Mansiya; Sambetbayeva, Madina; Kaldarova, Mira; Nazymkhan, Aksaule

doi:10.3390/bdcc10060183

Open AccessArticle

KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection

by

Zhanar Lamasheva

^1,2

,

Anargul Nekessova

^1,2,3,

Mansiya Kantureyeva

^2,*,

Madina Sambetbayeva

^1,2,3,*,

Mira Kaldarova

^1,3 and

Aksaule Nazymkhan

^1,2

¹

International Science Complex Astana, Astana 010017, Kazakhstan

²

Institute of Digital Sciences and Artificial Intelligence, L.N. Gumilyov Eurasian National University, 2 Satpayev Str., Astana 010008, Kazakhstan

³

School of Information Technology and Engineering, Astana International University, Astana 010017, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(6), 183; https://doi.org/10.3390/bdcc10060183

Submission received: 23 March 2026 / Revised: 17 May 2026 / Accepted: 19 May 2026 / Published: 1 June 2026

(This article belongs to the Section Data Mining and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the lack of bilingual annotated resources for automatic fake news detection in the Kazakh–Russian media space, as well as the limitations of binary annotation, which does not always allow disinformation to be represented as a complex and interpretable phenomenon. The aim of the study is to develop KazFakeCorpus and propose a multi-level annotation scheme that captures not only the final veracity of a message, but also the type of fake content, the disinformation technique, communicative intent, modality, and the characteristics of the source and evidence base. The corpus was constructed on the basis of official news materials published on the Gov.kz portal for the REAL class and synthetically generated messages for the FAKE class, complemented by an external validation set of authentic fake news from independent fact-checking sources to assess generalization. After data collection, the texts underwent cleaning, normalization, balancing, and sampling. The final resource includes 4276 texts in Kazakh and Russian, with an average length of approximately 200 words and a balanced distribution across languages and classes. Annotation was carried out in the Label Studio environment by two independent experts: a linguist and a fact-checking specialist. Before the main annotation phase, a pilot study was conducted on a subsample of 120 texts, the results of which were used to refine the categories and prepare the annotation guidelines. Krippendorff’s alpha was used to assess inter-annotator agreement; the obtained values, ranging from 0.79 to 0.88, indicate sufficient stability of the annotation across the key categories. The corpus analysis showed that misattribution (32.5%) is the most frequent disinformation technique, followed by clickbait (23.0%) and emotional pressure (16.4%). The results show that the proposed scheme makes it possible to treat fake news not only as a binary class but also as a multi-level semantic object that includes mechanisms of information distortion and features of content presentation. The practical contribution of the study lies in the creation of a bilingual corpus and annotation protocol that can be used in disinformation research, interpretable text analysis, and cross-lingual studies.

Keywords:

fake news detection; bilingual corpus; multi-level semantic annotation; disinformation analysis; interpretable machine learning

1. Introduction

The spread of digital media and social platforms has significantly transformed the structure of news production and consumption. The high speed at which content is published and replicated in the online environment is accompanied by a growing volume of false information, including disinformation and fake news [1,2]. Such materials can influence public opinion, shape distorted perceptions of socially significant processes, and complicate decision-making at both individual and institutional levels [2,3]. In this context, the automatic detection of fake news has become one of the most relevant research areas in natural language processing and data mining [1].

Modern approaches to fake news detection increasingly rely on computational methods; however, their effectiveness is largely determined by the quality and representativeness of the training data. Corpus resources play a key role in the development and validation of disinformation detection models by providing a formalized representation of texts and their annotation according to specified criteria [1,4,5]. However, most existing datasets focus on English and other widely used languages [1], whereas the number of specialized annotated datasets for low-resource languages remains limited [6,7,8].

This problem is especially relevant to the Kazakh language, for which resources for disinformation analysis remain fragmented [9]. The Russian-language segment of the media space operating in a multilingual environment is also characterized by a limited number of specialized corpora that take regional features of information exchange into account. At the same time, the media environment of Kazakhstan is marked by the active use of both Kazakh and Russian in official and unofficial sources, creating a bilingual information context. The absence of bilingual annotated corpora makes it difficult to develop models that can function effectively in a mixed-language space and account for cross-linguistic features of disinformation dissemination [10,11].

Most existing studies focus on the binary classification of news according to veracity (REAL/FAKE). This approach limits the possibility of analyzing the internal structure of disinformation and does not reflect the diversity of information distortion types and techniques [12]. In addition, binary classification often provides limited support for the interpretability of automatic decisions [13,14]. In a multilingual media environment, there is a need to move beyond simple classification toward modeling the discursive structure of news reports, including claims, sources, evidence, context, and presumed communicative intentions [2,13,15].

This study examines the problem of automatic fake news detection in Kazakh and Russian using multi-level semantic annotation of texts [16]. Particular attention is paid to the creation of a bilingual corpus resource and the development of a structured model for representing disinformation aimed at increasing the interpretability of the analysis [13]. This approach is intended to address the existing research gap associated with the shortage of bilingual corpora and the limitations of binary classification models for low-resource languages [9,10].

Accordingly, this study aims to answer the following research questions:

RQ1: How can a multi-level semantic annotation scheme improve the representation of fake news compared to traditional binary labeling?

RQ2: How can a structured multi-level annotation scheme capture different dimensions of disinformation, including content type, techniques, and contextual characteristics?

RQ3: What opportunities does a bilingual corpus with a unified annotation scheme provide for the analysis of fake news in a Kazakh–Russian media environment?

The remainder of the article is organized as follows. Section 2 reviews related work and identifies the research gap. Section 3 describes the construction of KazFakeCorpus, including data sources for the REAL class, the generation and balancing of FAKE examples, and the preprocessing steps. Section 4 outlines the annotation methodology and presents the multi-level annotation scheme. Section 5 presents corpus analysis, experimental setup, baseline detection experiments, ablation analysis, and generalization tests. Section 6 discusses the findings, Section 7 addresses the limitations of the study, and Section 8 concludes the paper and outlines directions for future work.

2. Related Work and Research Gap

2.1. Corpora and Data Sources

Recent research increasingly emphasizes a data-centric perspective on fake news detection, showing that model performance depends not only on algorithmic design but also on dataset quality, diversity, labeling schemes, and bias control [17]. Most early resources for fake news detection were built either around news publications and metadata or around social media content. One of the most widely used resources is FakeNewsNet, which combines news texts with metadata and signals of information dissemination on social networks [18]. In parallel, corpus initiatives were developed for individual languages, including POLygraph for Polish [19] and MCFEND, a multi-source benchmark dataset for Chinese fake news detection [20].

A separate line of research concerns corpora created for related tasks, which are often used as additional sources of signals in fake news detection. Such resources include, for example, Emergent, used for stance detection [4], as well as corpora for political fact-checking and graded assessments of statement veracity [21].

In recent years, there has been a trend toward expanding data sources. In addition to news texts themselves, researchers increasingly use user comments and discussions on social networks, since such data contain audience reactions and indicators of doubt or support. Approaches to building corpora based on comments and signals from social platforms are discussed, for example, in [22]. However, the use of such sources complicates dataset comparability because of differences across platforms, genres, and language norms.

2.2. Annotation Levels: From Binary Labels to Typologies and Techniques

In many studies, the annotation of news texts remains binary (REAL/FAKE), which simplifies classifier training and facilitates comparison across models [5]. However, this approach limits analysis of the internal structure of disinformation. The transition to more meaningful typologies is associated with the concept of information disorder—misinformation, disinformation, and malinformation—and with classifications of forms of content distortion proposed by Wardle and Derakhshan [12]. Attempts to formally define the concept of fake news and systematize its interpretations are also presented in the work of Tandoc et al. [3].

For applied annotation, it is important to capture manipulative techniques used in information dissemination. Closely related tasks are addressed in propaganda studies, where different techniques of influence are distinguished. For example, within the SemEval task on propaganda techniques, a scheme for annotating manipulative strategies was proposed [23]. Studies of this kind show that technique-level annotation can be useful for model training, but it requires clear guidelines and careful control of annotation consistency.

Alongside corpora designed for binary classification of news texts, there are also more structured resources in the related tasks of fact-checking and claim verification. For example, the LIAR corpus contains short political statements with multiple levels of veracity as well as additional metadata about the source, context, and speaker [24]. The FEVER corpus focuses on claim verification using external evidence and includes links between the claim, supporting evidence, and final assessment [25].

However, such datasets mainly focus on individual claims rather than full news texts and are most often created for English-language material. In addition, they usually emphasize either degrees of veracity or relations between claims and evidence, but do not combine within a single scheme the type of fake content, disinformation technique, authorial intent, source characteristics, and modality. In this respect, KazFakeCorpus differs by offering a multi-level annotation scheme for full news texts in Kazakh and Russian.

To compare existing resources with the proposed KazFakeCorpus, Table 1 presents the key characteristics of the datasets in terms of language, data source, and annotation level.

As shown in Table 1, most existing resources remain monolingual and rely on binary annotation, whereas bilingual corpora with multi-level semantic annotation for the Kazakh- and Russian-language media space are virtually absent.

2.3. Detection Models, Synthetic Data, and Hybrid Approaches

With regard to fake news detection methods, previous research shows that model performance depends not only on algorithmic design but also on the domain, data sources, and annotation scheme [2]. Therefore, in the context of the present study, the focus is placed not on model architecture, but on corpus design, annotation quality, and the potential risks associated with synthetic data.

Recent work has also focused on creating specialized datasets for more fine-grained analysis of fake news. For example, FineFake introduced a knowledge-enriched dataset for fine-grained fake news classification across different domains [28].

Multimodal fake news detection has emerged as an active research direction, in which textual and visual cues are analyzed jointly to improve detection accuracy [29]. Such approaches demonstrate that images and videos frequently accompany disinformation content and carry independent evidential value. While the current version of KazFakeCorpus focuses on text, the annotation scheme is designed to accommodate multimodal extensions in future work.

Recent studies also emphasize the methodological risks of using synthetic and LLM-generated text data. Synthetic text generation may introduce dataset-specific artifacts and stylistic cues that affect model generalization [30]. In addition, LLM-generated text detection research shows that generated texts often contain identifiable linguistic patterns, which may influence downstream classification tasks [31]. In fake news detection, LLM-based data augmentation can be useful, but it requires careful validation to avoid training models on generation-specific patterns rather than genuine misinformation markers [32].

2.4. Interpretable Modeling of the “Claim–Evidence–Source” Structure

One of the limitations of many classification approaches is the lack of transparent explanations for model decisions. For this reason, growing attention has been paid to explainable fake news detection, in which individual claims are analyzed, evidence is extracted, and model decisions are accompanied by interpretable justifications.

Such approaches are discussed, for example, in studies on statement veracity assessment and the construction of explainable models for disinformation detection [13]. A number of studies also examine the correctness and faithfulness of generated explanations [33]. In the context of large language models, methods for improving decision transparency and generating explanations based on text structure are also being explored [14].

Recent studies have also explored graph-based approaches for modeling evidence and source relations. For example, Gu et al. [34] propose a source-aware heterogeneous evidence graph for multi-source fact verification, which enables structured reasoning over claims, sources, and evidence.

These studies provide the foundation for approaches in which a corpus should contain not only a final veracity label but also the structural elements that enable explanation building—for example, the source, evidence, the author’s presumed intention, and the manipulation techniques used.

2.5. Low-Resource and Multilingual Environments

For low-resource languages, the problem is typically twofold: data scarcity and weak transferability of models across languages and domains. In the case of Kazakh, existing resources point to a shortage of corpus resources and systematic annotation, which limits the development of applied NLP tasks [9]. Related studies have introduced datasets for Urdu [6], Bengali [26], Kurdish [7], and Romanian [27], as well as methods designed for low-resource settings, including hybrid solutions [8].

Multilingual and cross-lingual scenarios are viewed as a way to partially compensate for data scarcity: models are trained in one language and transferred to another, or trained jointly on multiple languages [10,11]. However, even with multilingual models, a practical problem remains: the lack of comparable annotation schemes and datasets in which both languages are labeled consistently.

2.6. Research Gap

Although fact-checking and claim verification datasets already include more detailed annotation structures that capture links between claims, evidence, and final assessments [13,14], most existing fake news corpora remain monolingual, binary-labeled, or focused on individual claims rather than full news texts [3,5].

For the Kazakh- and Russian-language context, bilingual fake news corpora with unified multi-level annotation remain scarce [9,10,11]. Existing resources do not sufficiently combine full-document analysis, veracity labels, disinformation techniques, source characteristics, evidence-related features, and modality within a single framework.

KazFakeCorpus addresses this gap by providing a bilingual Kazakh–Russian corpus with multi-level semantic annotation. The proposed structure supports not only REAL/FAKE classification but also a more interpretable analysis of disinformation mechanisms, source-related characteristics, and evidence-related features.

3. KazFakeCorpus: Data Collection

This section describes the data sources and the procedure used to construct KazFakeCorpus. The corpus was compiled manually, followed by expert verification of each included item.

3.1. Sources of Real News

Official news materials published on the Gov.kz portal, specifically in the sections of the Ministry of Digital Development, Innovations and Aerospace Industry of the Republic of Kazakhstan, were used to construct the core corpus.

The sample included publications devoted to digitalization, artificial intelligence, IT infrastructure development, and state digital initiatives. Each text was preserved in its original form together with metadata including the publication date, source, and thematic category.

The choice of the Gov.kz portal as the primary source for the REAL class was motivated by the need to ensure high data reliability and uniform editorial standards at the corpus construction stage. This approach is commonly used in the development of corpus resources for disinformation analysis and fact-checking tasks [3,17].

However, reliance on a single primary source may introduce thematic and stylistic bias into the corpus. To improve source and domain diversity, an additional REAL collection was assembled beyond the Gov.kz portal. This additional material supplements the core corpus rather than replacing it and is used primarily in the generalization experiments reported in Section 5.9. The main corpus statistics reported in Table 2 refer only to the core Gov.kz-based collection and its paired synthetic FAKE component.

Table 2 presents the distribution of texts by language and veracity class, as well as the average number of words in each subset.

The corpus is balanced both with respect to the REAL/FAKE labels and across languages (Kazakh and Russian), which makes it possible to minimize the effect of data imbalance on the training and evaluation of automatic classification models.

The additional REAL collection includes three categories of sources selected to balance editorial reliability with topical and stylistic diversity. The first category consists of official government publications and remains anchored on Gov.kz, including, beyond the Ministry of Digital Development, the Ministry of Health, the Ministry of National Economy, the Ministry of Information and Social Development, and selected regional akimat portals. The second category consists of established Kazakhstani news agencies and independent media that maintain editorial fact-checking, including Kazinform, KazTAG, Tengrinews, and Informburo.kz. The third category consists of bilingual public service outlets such as Khabar 24 and Atameken Business, which publish parallel Kazakh- and Russian-language materials and therefore support the bilingual organization of the corpus.

The thematic coverage of the additional collection spans five domains: (i) digitalization and technology (the focus of the core corpus), (ii) public health and epidemiology, (iii) economy and finance, (iv) domestic politics and public administration, and (v) social affairs, including education and labour. Domain labels were assigned at the document level using the editorial section provided by the source, followed by manual verification by the annotators. This information is recorded as an additional metadata field (domain) and is used in the cross-domain generalization experiments reported in Section 5.9.

Texts were sampled to maintain approximate balance across domains and languages, and source-level deduplication was performed to prevent any single outlet from dominating the additional collection. Together with the authentic fake news described in Section 3.3.1, this additional material makes the evaluation more representative of the bilingual Kazakh–Russian news space and reduces the editorial-style bias that may arise from relying on a single ministry portal.

3.2. Data Cleaning and Preprocessing

After the data collection stage, the news materials were preprocessed to improve corpus quality and reduce noise. The cleaning procedure included the removal of duplicate entries, hyperlinks, special characters, and other irrelevant elements that did not carry substantial semantic content.

In addition, text normalization was performed to ensure a consistent format across Kazakh- and Russian-language materials and to reduce orthographic and structural variability. This approach increased data homogeneity and supported the consistency of subsequent annotation.

Only texts containing verifiable factual statements and socially significant public information were included in the corpus and auxiliary evaluation sets. Advertising materials, personal narratives, purely opinion-based texts, and irrelevant publications were excluded during the filtering stage.

Such preprocessing procedures are standard practice in the construction of text corpora for news analysis and misinformation detection tasks [3].

3.3. Construction of the FAKE Class

The FAKE class is built primarily from synthetically generated fake news, produced with the controlled procedure described below. To complement this synthetic core, an additional component of authentic fake news was collected from external sources verified by fact-checking organizations. This authentic component is provided as an additional set alongside the main corpus rather than being merged into the statistics reported in Table 2; a class flag (fake_origin ∈ {authentic, synthetic}) distinguishes the two so that researchers can study, train on, or evaluate against either subset independently.

3.3.1. Authentic Fake News Collection

Authentic fake news items were collected from publicly accessible fact-checking platforms and from social media posts that were subsequently refuted by such platforms. The principal Kazakhstani sources used are Factcheck.kz and Stopfake.kz, supplemented by region-specific debunks published by EurasiaNet’s Kazakhstan desk and by the fact-checking sections of Tengrinews and Informburo.kz. For each fact-checked claim, the original misleading statement and, where available, the surrounding post or article were retrieved, together with the corresponding refutation. Items were retained only when (a) the refutation explicitly classified the claim as false, misleading, or fabricated, and (b) the original text could be obtained in either Kazakh or Russian.

Collection covered material published between January 2021 and June 2025. Each item was manually inspected and re-annotated by the same expert annotators who labelled the rest of the corpus, using the multi-level scheme described in Section 4. Items containing personally identifiable information about private individuals, items consisting solely of multimedia content, and items whose textual portion fell below 40 words were excluded. After filtering, the authentic FAKE set contains 612 texts in Kazakh and 658 texts in Russian, spanning the five domains introduced in Section 3.1.

The inclusion of authentic fake news materially broadens the empirical profile available for evaluation. Authentic items are stylistically heterogeneous, exhibit shorter average length, contain a wider variety of pragmatic registers (including colloquial and social-media phrasing), and reference real events, individuals, and organizations more frequently than the synthetic items. These characteristics make the authentic set particularly important for the cross-domain generalization experiments and for the error analysis presented in Section 5.7 and Section 5.9.1.

3.3.2. Synthetic Fake News Generation

Synthetic generation, described in the previous version of the manuscript, remains the primary basis of the FAKE class for two reasons. First, authentic fake news collected from Kazakhstani fact-checkers remains limited in volume, particularly in Kazakh, and synthetic generation provides controlled coverage of disinformation techniques that are underrepresented in the authentic set (notably, downplaying and exaggeration). Second, the synthetic procedure produces examples that are parameterized by the annotation scheme, which supports controlled evaluation of individual disinformation techniques. The authentic items collected in Section 3.3.1 are used together with the synthetic data in the joint-training and cross-origin experiments reported in Section 5.9.3.

One of the methodological challenges in constructing KazFakeCorpus was class imbalance caused by the limited number of verified fake news examples available in official sources. Since reliable news reports are published far more frequently than refuted or intentionally misleading ones, it proved difficult to create a balanced training sample based solely on authentic data.

To address this issue, a controlled procedure for generating synthetic examples of fake content was employed. As the base generation model, ChatGPT (GPT-5) was used to transform reliable news texts into synthetic examples exhibiting features of disinformation.

REAL texts served as the source context (seed texts). Generation was performed using a system prompt (Box 1), taking into account the parameters of the corpus annotation scheme. This approach made it possible to vary both the type of fake message and the disinformation technique used while preserving the general thematic structure of the source text.

Box 1. Example of a system prompt for generating synthetic news messages.

Input: real news text T
Instruction: Analyze the factual structure of T. select relevant FAKE_type categories. Select corresponding disinformation_TECHNIQUE categories. Apply controlled transformation while preserving thematic coherence. Perform expert Validation.
Output format: synthetic training example T′.

After generation, each synthetic text underwent expert review. The annotators evaluated textual coherence, linguistic correctness, and compliance with the selected categories of the annotation scheme. Texts containing direct copying from the source material, logical inconsistencies, or obvious language errors were excluded from the corpus.

This multi-stage quality-control procedure helped minimize generation artifacts and improve the linguistic naturalness of the synthetic examples. However, synthetic data may still differ from organically disseminated disinformation, and this remains one of the limitations of the study.

To illustrate the differences between authentic and synthetic news fragments, Table 3 presents examples of REAL and FAKE texts in Kazakh and Russian. To improve readability and accessibility for an international audience, the table includes shortened excerpts of the original texts in Kazakh and Russian, while the full examples are provided in English translation. The complete original-language versions are included in Appendix A. Each example is explicitly labeled by language to ensure clarity, and typographic consistency has been maintained across scripts.

The analysis of the presented examples shows that reliable fragments (REAL) are characterized by a more structured presentation of information, the use of institutional vocabulary, and references to official initiatives (for example, “representatives of the Ministry reported on the development of the region’s digital infrastructure” and “the development of an IT hub is planned”). In contrast, the generated fake texts (FAKE) contain markers of uncertainty (“may,” “supposedly,” “suggest”), focus on potential risks, and often refer to the opinions of social media users without confirmation from official sources (for example, “there are statements on social networks…” and “some users associate the development of digital infrastructure with possible restrictions”).

Thus, the differences between the categories are manifested not only at the thematic level but also in the pragmatic and discursive characteristics of the texts. These features can be used as indicators for training automatic fake news classification models.

This generation approach made it possible not only to create a binary REAL/FAKE distinction but also to introduce specific linguistic markers of manipulation into synthetic texts corresponding to the second and third levels of corpus annotation.

3.4. Corpus Balancing

To ensure transparency of the corpus structure and its reproducible use in experimental studies, the main statistical characteristics of KazFakeCorpus were calculated and reported in Table 2.

Strict quantitative equivalence between classes reduces the risk of classifier bias toward the majority class and increases the suitability of the corpus for subsequent training, comparable model evaluation, and more interpretable and analytically meaningful analysis of the results. Synthetic texts were integrated into the corpus as a full component of the dataset and were used within the general experimental procedure in accordance with the data-splitting protocol. The division into training, validation, and test sets was performed after the complete corpus had been constructed, thereby ensuring a unified statistical context for all experiments.

KazFakeCorpus is a bilingual collection of news texts in Kazakh and Russian devoted to artificial intelligence and digital technologies. Each text unit is accompanied by a final veracity label (REAL/FAKE), as well as multi-level semantic annotation, including the type of fake content (fake_type), the disinformation technique (disinformation_technique), and additional contextual parameters.

As shown in Figure 1, misattribution (32.5%) is the most represented category, reflecting the frequency of distortions associated with incorrect source attribution or the false assignment of statements. A substantial proportion is also accounted for by clickbait (23.0%) and emotional_pressure (16.4%), which indicates the prevalence of manipulative headlines and emotionally charged formulations. Downplaying (14.2%) and exaggeration (13.9%) are less frequent but still contribute to the diversity of disinformation strategies.

The resulting distribution shows that the synthetic generation process covers several types of information distortion in accordance with the annotation scheme. This makes it possible to use the corpus both for binary veracity classification and for a more fine-grained analysis of disinformation techniques.

4. Annotation Methodology

4.1. Corpus Construction Workflow

Figure 2 illustrates the generalized architecture of the KazFakeCorpus construction process, which integrates the stages of data collection, preprocessing, annotation, and structuring into a unified workflow.

The data collection stage used reliable news materials from the Gov.kz portal, supplemented with synthetically generated examples of fake news in order to balance the classes. As a result, a set of unstructured data (raw data) was formed.

The data preprocessing phase involved cleaning, filtering, and normalizing the texts, which made it possible to create a prepared dataset (processed data). Such preprocessing procedures are widely used in the construction of corpus resources for news analysis and misinformation detection [3].

At the corpus structuring stage, the prepared texts were transferred to the Label Studio platform, where experts performed multi-level annotation of news reports in accordance with the developed annotation scheme. The use of specialized annotation tools is a common practice in the construction of corpus resources for natural language processing tasks [4].

The final structuring and storage stage ensured the organization and storage of the annotated data within a unified structure. The result of this process was KazFakeCorpus, a bilingual annotated resource for research on automatic fake news detection.

4.2. Multi-Level Annotation Scheme

The annotation scheme of KazFakeCorpus (Figure 3) was developed on the basis of modern theoretical models of disinformation analysis and the principles of corpus construction for automatic fake news detection. Unlike traditional binary annotation, the proposed model implements a multi-level semantic approach that makes it possible to account for the structural, pragmatic, and contextual characteristics of a news message.

In developing the scheme, existing theoretical and applied models of disinformation analysis were taken into account. For example, Wardle and Derakhshan proposed the concept of information disorder, distinguishing misinformation, disinformation, and malinformation [12]. Tandoc and co-authors, in turn, developed a typology of fake news based on the nature of information distortion and the communicative purposes of publication [3].

In the field of computational linguistics, annotation schemes have also been proposed to identify manipulative strategies in media texts. For example, within the SemEval task on propaganda technique detection, an annotation framework for specific information-manipulation strategies was introduced [23]. Such studies show that annotating not only the final veracity label but also the mechanisms of information distortion significantly increases the analytical value of corpus resources.

The proposed KazFakeCorpus scheme extends these approaches by combining several analytical levels within a single annotation structure. In addition to the final veracity label, it includes the type of fake content, the disinformation technique, the author’s communicative intent, and the characteristics of the evidence base. This structure allows for a more complete modeling of the internal organization of disinformation and provides a foundation for the development of interpretable fake news detection models.

The multi-level organization of the annotation scheme formalizes the key dimensions of news-message analysis:

final veracity of the message (REAL_or_FAKE);
type of fake content (fake_type);
disinformation dissemination technique (disinformation_technique);
presumed communicative intent of the author (author_intent);
characteristics of the source and evidence base (source_type, evidence, source_credibility);
modality of content presentation (modality).

A news message is therefore treated not as a flat text object but as a set of interrelated semantic components reflecting different aspects of information distortion.

4.2.1. Ontological Structure of the Annotation Scheme

The annotation scheme of KazFakeCorpus can be interpreted as a conceptual ontological model describing the main structural components of news reports and the semantic relations among them. Within this model, each news text is treated as a structured object consisting of interrelated entities and attributes that reflect information reliability and mechanisms of disinformation dissemination.

This approach is consistent with structured knowledge representation methods used in fake news detection and media content analysis [3,4]. The central entity of the annotation scheme is NEWS, which represents an individual news message. This entity links the main annotation components, including the final veracity label (REAL_or_FAKE), fake_type, disinformation_technique, author_intent, source_type, evidence, and source_credibility.

The resulting structure provides a systematic description of disinformation indicators and facilitates the analysis of semantic relations among news-content elements. It may also serve as a basis for future knowledge-graph extensions, which are increasingly used in media content analysis and automatic fake news detection tasks [9,28]. The architecture of the proposed annotation scheme and the relationships among its main categories are presented in Figure 3.

4.2.2. REAL NEWS Block

The REAL_NEWS block includes parameters that reflect the mechanisms used to confirm the reliability of a news message. Within the annotation scheme, this component is represented by two categories: evidence, which records the presence of supporting data or references to verifiable sources, and source_credibility, which characterizes the degree of reliability of the information source.

The evidence category reflects the presence in the text of verifiable support, such as references to official documents, statistics, or expert commentary. The source_credibility category makes it possible to account for the institutional reliability of the publication source. This structure corresponds to modern approaches to automated fact-checking, in which the reliability of news messages is assessed through the verifiability of claims and the credibility of information sources [3].

Unlike the REAL_NEWS block, which describes the mechanisms for confirming information reliability, the FAKE_NEWS block is intended to model the structure and mechanisms of disinformation dissemination.

4.2.3. FAKE NEWS Block

The FAKE_NEWS block models the internal structure of a disinformation message and includes three key categories: fake_type, disinformation_technique, and author_intent.

The fake_type category describes the type of fake content and is based on the classification of forms of misinformation proposed by Wardle and Derakhshan [12], including fabricated content, misleading content, and false context. The disinformation_technique category captures specific strategies of information manipulation and draws on typologies of fake news and manipulative communication, including clickbait, emotional pressure, and misattribution [13]. The author_intent category reflects the presumed communicative intention of the author and makes it possible to account for the pragmatic dimension of disinformation, including fear induction, political influence, and attention capture.

4.2.4. Modality and Semantic Relations

The annotation scheme also includes the modality parameter, which records the form of content presentation. In the current version of the corpus, this parameter is limited to textual content, but it is included to support future multimodal extensions, such as images, video, and audio.

The model also defines typed relations between annotation elements, for example: NEWS–CLASSIFIED_AS → REAL_or_FAKE, FAKE_NEWS–HAS_TYPE → fake_type, FAKE_NEWS–SPREADS_VIA → disinformation_technique, and REAL_NEWS–SUPPORTED_BY → evidence. These relations help represent news content as a structured set of veracity, source, evidence, and manipulation-related features.

5. Corpus Analysis

5.1. Annotation Protocol

At the final stage of KazFakeCorpus construction, multi-level annotation of news reports was carried out using the Label Studio platform. The annotation was performed in accordance with the developed scheme and covered the key structural, semantic, and contextual characteristics of news texts. This approach makes it possible to account not only for the final veracity of a message but also for the mechanisms of disinformation dissemination and presentation.

In developing the annotation scheme, the authors relied on existing typologies of disinformation, in particular the classifications proposed in [3,12], in which false content is considered not only from the perspective of truthfulness but also with regard to the form of information distortion, the context of dissemination, and the communicative effect. Based on these approaches, the annotation categories were adapted to the tasks of bilingual analysis of news reports.

The annotation was performed by two independent experts with experience in media text analysis and fact-checking. The annotation team included a linguistics specialist and a fact-checking expert proficient in both Russian and Kazakh. Before the main annotation phase, a pilot study was conducted on a subsample of 120 texts in order to refine category definitions, identify ambiguous cases, and assess the clarity of the instructions. Following the pilot phase, individual definitions and labeling rules were revised and compiled into a unified document, Annotation Guidelines, which was then used during the main annotation phase.

The main annotation procedure involved several consecutive steps. First, a set of Kazakh- and Russian-language news texts was compiled, pre-cleaned, normalized, and balanced. Each text was then independently annotated by two experts across all levels of the annotation scheme, including the final veracity label, the type of fake content, the disinformation technique, the author’s communicative intent, and the message modality. In cases of disagreement, an additional adjudication procedure was conducted, during which disputed cases were discussed on the basis of the rules specified in the annotation guidelines. The final label was assigned only after consensus had been reached between the annotators. Thus, the final version of the corpus was formed on the basis of agreed decisions rather than the individual annotation of a single expert.

Although only two experts participated in the main annotation phase, the use of a pilot stage, unified annotation guidelines, and an adjudication procedure helped minimize annotation subjectivity. In future work, corpus expansion may involve additional annotators or a separate arbitrator in order to further improve annotation stability.

The overall workflow of the annotation process is shown in Figure 4. It reflects the sequence of stages in corpus construction, from the selection and preparation of news reports to their independent annotation and the formation of the final dataset version. Unlike one-level classification, the proposed approach assumes document-level analysis and makes it possible to capture multiple aspects of the structure of a disinformation message.

To assess annotation reliability, 15% of the corpus, selected at random, was re-annotated by two experts without prior agreement on the decisions. Krippendorff’s alpha [35] was used as a measure of inter-annotator agreement, as it is widely applied to evaluate the consistency of categorical annotation and allows for multiple classes and heterogeneous categories.

The obtained values indicate that agreement between the annotators remained high across the main levels of the scheme. For the REAL/FAKE category, the coefficient was α = 0.88; for fake_type, α = 0.84; for disinformation_technique, α = 0.81; and for modality, α = 0.79. These values indicate good reproducibility of the proposed annotation scheme, although the categories associated with the interpretation of the pragmatic characteristics of the message show slightly greater variability in expert judgments.

The Krippendorff’s alpha values for the key annotation categories are presented in Table 4.

Taken together, these results confirm the reliability of the proposed annotation scheme and show that KazFakeCorpus can be used not only for binary news classification tasks but also for disinformation analysis at several interrelated levels. The presence of a formalized annotation scheme, detailed annotation guidelines, and an inter-annotator agreement assessment procedure ensures the reproducibility of the corpus and makes it suitable for further research on automatic fake news detection and cross-lingual media analysis.

In addition, a statistical analysis of the corpus was carried out, including an assessment of average text length, class distribution, and the frequency of different disinformation techniques. The results show that the corpus remains balanced both in terms of language (Kazakh and Russian) and veracity labels (REAL/FAKE), which reduces the risk of model bias toward the dominant class and increases the reliability of subsequent experimental studies.

To provide an estimate of statistical uncertainty, 95% confidence intervals were calculated for key proportions in the corpus, including the distribution of disinformation techniques and class balance. The obtained intervals indicate that the observed proportions remain stable within relatively narrow bounds, which reflects the controlled construction and balanced structure of the dataset.

In addition, the Kazakh and Russian subsets were compared to identify possible cross-linguistic variation. No major differences were observed in the overall distribution of classes and disinformation techniques across the two language subsets. This suggests that the annotation scheme was applied in a broadly consistent manner, although a more detailed statistical analysis of cross-linguistic variation remains a direction for future work.

These findings confirm that the observed distributions are not driven by random variation but reflect systematic properties of the corpus design. However, the statistical analysis remains primarily descriptive, as the main objective of the study is corpus construction and validation rather than hypothesis-driven statistical modeling.

Figure 5 presents an example of multi-level annotation of a FAKE news text in the Label Studio environment. The interface illustrates the assignment of core annotation layers, including the final veracity label, fake content types, and contextual fields such as source type, temporal reference, and target entities. The example shows how annotation captures uncertainty markers and references to unverified sources, which are reflected in the selection of specific disinformation-related categories.

Figure 6 provides a complementary example of annotated news text, focusing on the identification of disinformation techniques and semantic relations between text fragments and annotation categories. The highlighted segments are linked to specific manipulation strategies, such as emotional pressure and misattribution, demonstrating how the annotation process captures fine-grained patterns of information distortion.

The presented examples demonstrate how the proposed annotation scheme is applied in practice and show its capacity to capture multiple dimensions of disinformation within a single text. The annotation process combines document-level labeling with the identification of specific linguistic and discursive markers, allowing news content to be represented as a structured set of interconnected elements.

The use of relation-based annotation makes it possible to explicitly link textual fragments to particular disinformation techniques and contextual parameters. This provides a more detailed representation of information distortion than approaches based solely on binary classification. As a result, the corpus supports not only the identification of fake news but also the analysis of how such content is constructed and presented.

From a modeling perspective, this structure creates the conditions for developing methods that operate at both global and local levels of text representation. The availability of fine-grained annotation and explicit semantic relations supports interpretability, since model decisions can be traced back to specific annotated components.

5.2. Distribution of Disinformation Techniques in the FAKE Subset

Figure 7 shows the distribution of disinformation techniques in the FAKE subset of KazFakeCorpus. The visualization reflects the number of texts assigned to each category of the annotation scheme, including clickbait, emotional pressure, exaggeration, downplaying, scapegoating, and misattribution.

As can be seen from the figure, the distribution of texts across categories remains relatively uniform. Each technique is represented by approximately the same number of examples, reflecting the controlled nature of the synthetic component of the corpus. This balancing helps reduce the risk of bias in the subsequent use of the corpus for analytical and experimental tasks and ensures comparability across categories within the multi-level annotation scheme.

The resulting distribution shows that the FAKE subset covers several different disinformation strategies rather than a single type of information distortion. This is important for subsequent analysis, since it makes it possible to consider fake news not only from the perspective of final veracity but also with regard to specific mechanisms of manipulation. Thus, Figure 7 illustrates the internal structure of the FAKE subset and confirms that the corpus was constructed with attention to the diversity of disinformation techniques.

Cross-Linguistic Comparison of Kazakh and Russian Subsets

To address cross-linguistic comparability, we conducted an analysis of disinformation techniques and selected linguistic features across the Kazakh and Russian subsets of KazFakeCorpus. Table 5 presents the distribution of disinformation techniques across languages.

As shown in Table 5, the distribution of disinformation techniques is broadly consistent across both language subsets. Misattribution remains the dominant technique in both Kazakh (33.1%) and Russian (31.9%), followed by clickbait and emotional pressure. The close alignment of technique frequencies across languages suggests that the synthetic generation process produced structurally comparable FAKE examples, supporting the cross-lingual coherence of the corpus. Table 6 summarizes key linguistic characteristics of REAL and FAKE texts in both languages.

At the linguistic level, FAKE texts are consistently longer than REAL texts in both languages, with a more pronounced difference in the Russian subset. In addition, FAKE texts contain substantially higher proportions of hedging markers (e.g., “may,” “allegedly,” “reportedly”) and uncertainty expressions (e.g., “some users claim,” “unverified reports suggest”) compared to REAL texts.

This pattern is consistent across both Kazakh and Russian subsets, indicating that linguistic markers of uncertainty function as a language-independent signal of disinformation within the corpus. Overall, these results support the suitability of KazFakeCorpus for bilingual and cross-lingual fake news detection research.

5.3. Statistical Characteristics and Structural Balance of the Corpus

At this stage, the main statistical characteristics of KazFakeCorpus were calculated in order to describe its quantitative and structural organization. The analysis showed that the corpus remains balanced both in terms of language and veracity labels: the Kazakh- and Russian-language parts are represented by a comparable number of texts, and the distribution of the REAL and FAKE classes remains quantitatively equivalent.

In addition, text length analysis, measured in the number of words, was conducted. As shown in Figure 8, most texts are concentrated in the medium-length range, while the overall distribution remains sufficiently broad to include both shorter and longer materials. Such variability corresponds to the characteristics of news discourse and makes it possible to regard the corpus as relatively close to real-world media text conditions.

At the same time, the distribution of text length does not indicate a pronounced structural bias toward either very short or very long materials. This is important for the subsequent use of the resource, as it reduces the likelihood that differences between texts can be explained solely by formal length-related parameters. Thus, the obtained statistical characteristics confirm that KazFakeCorpus is a quantitatively balanced and structurally coherent resource suitable for further analysis of misinformation in the bilingual media space.

To assess statistical uncertainty, 95% bootstrap confidence intervals were computed for the key proportions in the corpus, including the distribution of disinformation techniques and class balance. The obtained intervals indicate that the observed proportions remain stable within relatively narrow bounds.

In addition, a chi-square test of homogeneity was performed to compare the distribution of disinformation techniques across the Kazakh and Russian subsets. The results did not reveal statistically significant differences (p > 0.05), indicating that the observed differences are not statistically meaningful. A similar test for the distribution of veracity classes across the Kazakh and Russian subsets also did not show statistically significant differences.

5.4. Experimental Setup and Reproducibility

All experiments were implemented in Python 3.11 using PyTorch 2.1 and the HuggingFace Transformers library (version 4.38). Additional preprocessing and TF–IDF baselines were implemented with scikit-learn (version 1.4). The experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), AMD Ryzen 9 processor, and 64 GB RAM.

The main corpus was divided into training, validation, and test subsets using stratified sampling with a ratio of 70:15:15 while preserving language balance and REAL/FAKE class proportions across all splits. The authentic fake news set described in Section 3.3.1 is held out entirely as an external validation set and is not included in the 70:15:15 split, except in the cross-origin experiment (Section 5.9.3) where it is used as training data under controlled conditions.

The following HuggingFace model checkpoints were used for the neural baselines: bert-base-multilingual-cased for mBERT, xlm-roberta-base for XLM-RoBERTa, and microsoft/mdeberta-v3-base for mDeBERTa-v3. Transformer-based models were fine-tuned using the AdamW optimizer with a learning rate of 2 × 10⁻⁵, batch size of 16, weight decay of 0.01, maximum sequence length of 256 tokens, linear warmup over the first 10% of training steps, and training for up to 5 epochs. Early stopping based on validation loss with a patience of 2 epochs was applied to reduce overfitting.

To reduce the influence of random initialization, each experiment was repeated across five random seeds (42, 52, 62, 72, and 82). The results reported in Table 7, Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 correspond to mean values averaged across runs, and standard deviations across the five runs are reported alongside the mean values.

Model evaluation was performed using Accuracy, Precision, Recall, and Macro F1-score, complemented by confusion matrix analysis and cross-domain generalization experiments described in Section 5.9.

5.5. Baseline Detection Experiments

To assess the usability of KazFakeCorpus in a bilingual fake news detection setting, baseline classification experiments were conducted for the REAL/FAKE task.

Baseline experiments are reported for four models: a TF–IDF baseline with logistic regression, multilingual BERT, XLM-RoBERTa, and multilingual DeBERTa-v3 (mDeBERTa-v3). For each model, accuracy, precision, recall, and F1 are reported per class (REAL, FAKE), together with macro-averaged scores; all values are computed on the test split. As shown in Table 7, per-class precision and recall remain close for all neural models, indicating that errors are not concentrated on a single class.

The TF–IDF baseline shows slightly lower recall on the FAKE class and greater sensitivity to surface lexical cues, suggesting that traditional lexical representations are more vulnerable to stylistic imbalance than transformer-based multilingual encoders. To examine the error structure of the strongest model, Table 8 presents the confusion matrix for mDeBERTa-v3 on the bilingual test set (n = 1192 examples; 50% REAL, 50% FAKE).

The off-diagonal cells show that both types of errors are present, with 84 REAL items incorrectly classified as FAKE and 97 FAKE items incorrectly classified as REAL. This indicates a slight tendency of the model to classify ambiguous FAKE texts as REAL, although the overall prediction distribution remains relatively balanced. A language-level inspection showed similar error patterns, with minor variation across the Kazakh and Russian subsets.

The results show consistent performance across the Kazakh and Russian subsets, confirming the structural balance and cross-lingual coherence of the corpus. The small performance gap between languages suggests that the unified annotation scheme effectively supports bilingual fake news detection.

In addition to within-language baselines, we conducted cross-lingual evaluation using XLM-RoBERTa. The model was trained in three settings: monolingual, joint bilingual, and cross-lingual transfer (KZ → RU and RU → KZ). The results are presented in Table 9.

The results show that joint bilingual training slightly improves performance compared to monolingual models, while cross-lingual transfer leads to a moderate decrease in performance. Nevertheless, the results confirm that the unified annotation scheme enables cross-lingual generalization and supports bilingual model development.

5.6. External Validation on Authentic Data

To assess whether models trained on KazFakeCorpus generalize beyond the controlled synthetic setting, an auxiliary external validation set of authentic news items was compiled from independent sources. Authentic FAKE items were drawn from independent fact-checking resources, including Factcheck.kz and Stopfake.kz, using the disputed or refuted claim as the text unit. Authentic REAL items were collected from independent Kazakh media outlets, including Tengrinews.kz, Kazinform.kz and Informburo.kz. The set deliberately covers domains that are underrepresented in the main corpus, such as health, economy, social policy, and international affairs. All items were reviewed by the annotation team and assigned a REAL/FAKE label following the same protocol used for the main corpus. This material is held out entirely from training and is not included in the statistics reported in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. The composition of the external validation set is summarized in Table 10.

The best-performing baseline model, mDeBERTa-v3 trained on KazFakeCorpus (Section 5.5), was evaluated on the external validation set without any additional fine-tuning. Table 11 reports its performance on the external set alongside the corresponding results on the in-domain KazFakeCorpus test set, provided here for reference.

On the external validation set, the model reached an accuracy of 0.724 and a macro F1 of 0.720, compared with 0.848 and 0.845 on the in-domain test set. The gap of approximately 12 percentage points indicates that part of the signal learned on the synthetic corpus reflects generation-specific stylistic cues rather than fully transferable misinformation markers. At the same time, performance well above chance on authentic, out-of-domain texts in both languages (Kazakh F1 = 0.702, Russian F1 = 0.738) shows that models trained on KazFakeCorpus retain useful and transferable signal. The relatively balanced results across the two languages further indicate that the unified annotation scheme supports consistent bilingual generalization beyond the controlled corpus.

5.7. Error Analysis

To characterize the model’s failure modes, the 178 misclassified test items produced by mDeBERTa-v3 were inspected manually by the annotation team. Each error was assigned to one of six categories, summarized in Table 12. The categories were defined inductively from the first 60 errors and applied to the remaining items; inter-annotator agreement on category assignment was α = 0.83.

Two patterns stand out. First, the largest error category (26.4%) corresponds to authentic FAKE items written in a register that closely resembles official news. These are precisely the cases that controlled synthetic generation tends to under-represent, and their presence in the test set demonstrates why authentic fake news is important for evaluation. Second, hedging-cue false positives (18.0%) suggest that the model partially relies on uncertainty markers as proxy signals. This pattern is consistent with the synthetic generation procedure described in Section 3.3.2 and with the methodological limitations discussed in Section 7, since uncertainty markers may be over-represented in the synthetic FAKE class.

Examples of each error type are provided in Appendix B. The error categories are also used as auxiliary features in the ablation study below, since they correspond directly to dimensions captured by the multi-level annotation scheme.

5.8. Ablation Study on Multi-Level Annotation

A central claim of the manuscript is that multi-level annotation provides analytical value beyond the binary REAL/FAKE label. To test this claim in the context of automatic detection, an ablation study was conducted in which auxiliary tasks corresponding to the additional annotation levels were progressively added to the mDeBERTa-v3 backbone within a multi-task learning framework. The primary objective is REAL/FAKE classification; auxiliary objectives are joint prediction of fake_type, disinformation_technique, source_credibility, and modality.

Each auxiliary task is trained with cross-entropy loss against the corresponding annotation level. Loss weights are uniform and fixed across conditions; only the set of active auxiliary objectives differs across rows of Table 13. All other hyperparameters are identical to those reported in Section 5.4.

Each additional annotation level produces a small but consistent improvement, with the largest gain observed after adding disinformation technique as an auxiliary objective. The full multi-level configuration improves Macro F1 by 2.2 points compared with the binary-only baseline. These results suggest that the multi-level scheme is not merely descriptive but provides additional supervision that can be exploited by automatic models for the primary REAL/FAKE detection task.

Examined by error category (Table 12), the improvement is mainly associated with fewer subtle-manipulation and stylistic-confusion errors, which are directly related to auxiliary supervision on disinformation_technique and source_credibility. The gain for hedging-cue false positives is smaller, suggesting that auxiliary supervision corrects content-level errors more readily than discourse-level over-reliance on uncertainty markers.

5.9. Cross-Domain, Cross-Topic, and Cross-Origin Generalization

Three generalization tests assess whether models trained on KazFakeCorpus transfer beyond the conditions of their training distribution. All experiments use the mDeBERTa-v3 backbone with binary REAL/FAKE supervision; multi-task variants follow the same pattern with an average gain of approximately +1.5 macro F1 (range +1.2 to +1.9) and are reported in Appendix C.

5.9.1. Cross-Domain Generalization

For each of the five domains introduced in Section 3.1, a leave-one-domain-out evaluation was performed: the model is trained on four domains and tested on the held-out domain. Table 14 reports macro F1 in this setting and, for reference, the in-domain F1 (model trained on a stratified mix that includes the target domain).

Performance drops by approximately 6.6 macro F1 points on average when an entire domain is excluded from training. The largest drop is observed for domestic politics, where disinformation strategies rely most heavily on real-world context and entity knowledge, and the smallest drop is for digitalization and AI, which remains the most represented domain in the corpus. These results indicate that the corpus, while broadened, still rewards domain coverage and that practitioners deploying detectors should include in-domain data where possible.

5.9.2. Cross-Topic Generalization Within a Domain

To further probe topical robustness, a within-domain cross-topic split was constructed for the public-health domain, the domain with the most heterogeneous topical structure. Topics were assigned manually (vaccination, epidemic surveillance, healthcare reform, drug regulation, mental health). Training was performed on four topics and testing on the fifth, repeated for each topic. The average macro F1 across topics is 81.3%, against an in-topic reference of 86.0%, indicating that within-domain topic shift accounts for roughly half of the cross-domain drop and that some of the residual drop in Section 5.9.1 reflects genuine cross-domain transfer difficulty rather than topical novelty alone.

5.9.3. Cross-Origin Generalization (Authentic ↔ Synthetic)

The reviewer’s concern that synthetic generation may not adequately represent organic disinformation is examined directly here. Two cross-origin evaluations were performed: (a) training only on synthetic FAKE items together with REAL items, and testing on authentic FAKE items; (b) the reverse direction. Results are reported in Table 15.

Two findings are notable. First, models trained only on synthetic data lose approximately 15 macro F1 points when tested on authentic data, confirming the reviewer’s concern that synthetic generation does not fully cover the variability of organic disinformation. Second, the reverse direction is more forgiving: models trained on authentic data lose only about 3 macro F1 points when tested on synthetic data. Joint training with both origins yields the most robust performance and is recommended as the default training configuration for downstream studies that use KazFakeCorpus.

Taken together, the three generalization tests show that (i) the corpus supports useful cross-domain transfer when domains overlap, (ii) topical shift within a domain is a non-trivial but smaller source of degradation, and (iii) the inclusion of authentic fake news, requested by the reviewer, is empirically necessary for models to behave reliably on naturally occurring disinformation.

6. Discussion

This study addressed three research questions concerning the role of multi-level semantic annotation and bilingual corpus design in fake news analysis. The results make it possible to provide answers at the level of corpus construction, annotation design, and baseline experimental evaluation.

In response to RQ1, the study shows that the proposed multi-level semantic annotation scheme provides a more informative representation of fake news than traditional binary labeling. By extending annotation beyond a simple REAL/FAKE distinction, the proposed scheme captures additional dimensions of information distortion, including fake content type, disinformation technique, source-related characteristics, and evidence-related information. This makes it possible to represent fake news not as a single categorical label, but as a structured phenomenon with several interrelated components [12,16].

In response to RQ2, the annotation structure demonstrates the capacity to capture multiple dimensions of disinformation within a single framework. The inclusion of categories such as fake_type, disinformation_technique, author_intent, source_type, and evidence enables a more detailed description of how misleading information is constructed and presented. The ablation results further indicate that these additional annotation levels provide useful auxiliary supervision for REAL/FAKE detection, especially when disinformation techniques and source-related features are included.

In response to RQ3, the bilingual structure of KazFakeCorpus, combined with a unified annotation scheme, creates opportunities for the analysis of fake news in the Kazakh–Russian media environment. The cross-lingual experiments reported in Section 5.9.2 show that joint bilingual training slightly improves performance compared with monolingual training, while direct cross-lingual transfer remains more challenging. These findings suggest that the unified annotation scheme supports bilingual model development, although additional data and broader domain coverage are still needed to improve transfer robustness.

Taken together, these findings show that KazFakeCorpus is not only a collection of labeled texts but also a structured resource for representing and modeling disinformation in a bilingual setting. The corpus can therefore support further research on fake news detection, corpus-based analysis of disinformation strategies, and cross-lingual modeling in low-resource media environments [10,11,13,16]. In addition, performance on the external authentic set (Section 5.9) demonstrates that models trained on the controlled corpus retain useful signal beyond synthetic stylistic cues, while the observed performance gap motivates future hybrid corpus design that combines synthetic and authentic material.

7. Limitations

Several limitations should be acknowledged. First, despite the inclusion of authentic fake news, a substantial part of the FAKE class remains synthetically generated. Second, the core corpus is still centered on digitalization-related content, although additional domains were introduced for external evaluation. Third, the authentic external validation set remains relatively small compared with the synthetic component. Finally, the experiments reported in this study should be interpreted as baseline evaluations rather than exhaustive benchmarking.

Despite the controlled generation procedure and subsequent expert verification, the use of synthetic fake texts introduces a methodological limitation. Generated messages may differ from organically disseminated disinformation in stylistic, pragmatic, and discursive characteristics, as well as in emotional tone and dissemination patterns typical of social media environments.

Since FAKE texts were generated from REAL texts using controlled prompts, the dataset may contain prompt-engineered stylistic contrasts between REAL and FAKE classes. As a result, models trained on the corpus may learn generation artifacts or stylistic cues rather than robust misinformation markers. This limits the direct generalizability of the current dataset to naturally occurring misinformation.

An additional limitation is the use of a single main source of reliable news (the Gov.kz portal), which may lead to thematic and stylistic bias in the corpus. Official government sources tend to have a more formal structure and a more limited thematic range. Moreover, the corpus is primarily focused on digitalization and artificial intelligence topics, which may further restrict its domain representativeness.

Although the experimental results demonstrate relatively consistent improvements across the evaluated settings, the reported values should be interpreted with caution. Due to the controlled nature of the corpus construction process and the balanced structure of the dataset, some experimental trends may appear smoother than those typically observed in large-scale organically collected misinformation benchmarks. While repeated experiments and cross-domain evaluations were conducted to reduce this effect, additional large-scale validation on naturally occurring multilingual misinformation streams remains necessary.

In addition, the current version of KazFakeCorpus does not include multimodal content such as images, video, or audio, although such elements increasingly accompany modern disinformation messages. While the annotation scheme incorporates a modality parameter to support future extensions, all analyses in the present work are based exclusively on textual data. The inclusion of multimodal components remains a direction for future corpus development. While the main corpus relies on controlled synthetic generation for the FAKE class, authentic fake news from independent fact-checking sources is provided as an additional set alongside the corpus, with a fake_origin flag distinguishing the two subsets. This additional material is used in the cross-origin and generalization analyses reported in Section 5.9.

In future work, the corpus will be expanded through the inclusion of authentic fake news materials from independent media, social networks, and fact-checking resources. Special attention will be given to incorporating diverse topics and heterogeneous sources to improve corpus representativeness. It is also planned to enrich the corpus with references to verified news sources and related evidence materials in order to support comparison between fake and real news at the level of content, source, and factual grounding.

More specifically, although an additional set of authentic fake news collected from Kazakhstani fact-checking platforms now complements the FAKE class, the volume of authentic items remains smaller than that of the synthetic data, particularly in Kazakh. The cross-origin experiment (Section 5.9.3) shows that models trained on a joint pool generalize better than those trained on either origin alone; however, future releases should aim for a larger authentic share in order to reduce remaining stylistic biases.

The cross-domain experiment (Section 5.9.1) shows an average drop of approximately 6.6 macro F1 when an entire domain is held out of training. While this is consistent with the literature on domain transfer in fake news detection, it indicates that detectors trained on the present corpus cannot be deployed without retraining for new domains such as elections-specific or wartime disinformation, which are not covered by the current sampling protocol.

The error analysis (Section 5.7) shows that approximately one in five errors corresponds to legitimate hedging being misread as a disinformation marker. This reflects a known limitation of binary supervision and is partially mitigated by the multi-level annotation scheme (Section 5.8), but it remains an open challenge for interpretable disinformation detection. Future work will explore explicit discourse-level supervision for hedging and modality.

8. Conclusions

This paper presents the construction of KazFakeCorpus, a bilingual corpus designed for research on automatic fake news detection in Kazakh and Russian. The corpus is based on a multi-level annotation scheme that provides a formalized representation of message veracity, types of fake content, disinformation techniques, and contextual characteristics of news texts.

The proposed annotation structure is oriented toward modeling semantic relations among the components of a news message and provides a foundation for the development of interpretable methods for disinformation analysis. The bilingual organization of the data and the unified annotation scheme make it possible to use the corpus in studies related to cross-lingual model transfer and the analysis of disinformation dissemination patterns in a bilingual media environment.

KazFakeCorpus is intended as a resource for further research on automatic fake news detection and the development of methods for disinformation analysis. Future work will focus on expanding the corpus through the inclusion of authentic examples of fake news from diverse sources, further developing the annotation model, and applying modern neural architectures, including large language models.

A further direction of development is the creation of an LLM-based analysis component for classifying input information as REAL or FAKE and providing an explanation of the decision. Such a component may support explanation at the level of disinformation techniques, source-related characteristics, and evidence-related features. Another planned extension is the integration of links to verified news reports and other supporting materials, which would make it possible to connect fake content with corresponding real news and support evidence-based analysis.

To ensure the reproducibility of results and facilitate further use of the resource in scientific research, KazFakeCorpus will be published in open access upon acceptance of the article. The dataset will be made available in the GitHub and Hugging Face Datasets repositories together with annotation guidelines, preprocessing scripts, and implementations of baseline models. To ensure long-term availability, the resource will also be registered in the Zenodo repository with a permanent DOI.

Another planned extension is the integration of links to verified news reports and other supporting materials, which would make it possible to connect fake content with corresponding real news and support evidence-based analysis. Future work will also focus on incorporating explicit claim–evidence–source chains and multi-source evidence modeling, inspired by recent graph-based approaches to fact verification.

Author Contributions

Conceptualization, Z.L. and M.S.; methodology, A.N. (Anargul Nekessova) and M.K. (Mansiya Kantureeva); software, A.N. (Aksaule Nazymkhan); validation, M.K. (Mira Kaldarova) and A.N. (Anargul Nekessova); formal analysis, M.K. (Mansiya Kantureeva); investigation, A.N. (Aksaule Nazymkhan); resources, Z.L. and M.K. (Mira Kaldarova); data curation, M.S.; writing—original draft preparation, Z.L. and A.N. (Aksaule Nazymkhan); writing—review and editing, Z.L. and M.K. (Mira Kaldarova); visualization, A.N. (Anargul Nekessova); supervision, M.S.; project administration, M.K. (Mansiya Kantureeva); funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan Grant AP26195591.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

KazFakeCorpus, including the core bilingual corpus and the auxiliary authentic fake news evaluation set used in the generalization experiments, is publicly available in the GitHub repository: https://github.com/Anargul-Aimuratovna/news-veracity-corpus. (accessed on 22 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This appendix provides the full original-language versions of the examples presented in Table 2. The texts are given in Kazakh and Russian and correspond to both REAL (verified) and FAKE (generated) categories.

Kazakh Example (REAL)

Цифрлық кoд цифрлық құқықтарды қoрғауға бағытталған жүйелі құжат ретінде ұсынылды. Құжат деректерді қoрғау, электрoндық сервистерді дамыту және азаматтардың цифрлық құқықтарын қамтамасыз ету мәселелерін қамтиды. Codex "ұмытылу құқығы", цифрлық инфрақұрылымды дамыту және мемлекеттік қызметтерді қағазсыз фoрматқа көшіру сияқты нoрмаларды қарастырады. Құжат цифрландыруды қoлдау және құқықтық базаны жаңғырту құралы ретінде сипатталады.

Kazakh Example (FAKE)

Цифрлық кoд азаматтардың цифрлық кеңістіктегі еркіндігін шектеуі мүмкін деген пікірлер тарауда. Әлеуметтік желілерде кейбір қoлданушылар Цифрлық кoд жаңа шектеулер енгізуге құқықтық негіз бoлуы мүмкін деген бoлжамдар айтуда. Сoнымен қатар, "ұмытылу құқығы" мен деректерді қoрғау нoрмалары нақты қалай іске асатыны жөнінде күмән білдірілуде. Алайда бұл пікірлер ресми түрде расталмаған және қoғамдық талқылау деңгейінде ғана қарастырылуда.

Russian Example (REAL)

В Жамбылскoй oбласти oбсудили развитие цифрoвых прoектoв. В хoде рабoчей пoездки представители министерства сooбщили o развитии цифрoвoй инфраструктуры региoна, включая пoдключение населённых пунктoв к интернету и внедрение элементoв кoнцепции Smart City. Также oбсуждалoсь расширение сети телекoммуникаций и мoдернизация базoвых станций. В рамках инициативы планируется развитие IT-хаба и пoвышение дoступнoсти гoсударственных услуг. Представители министерства oтметили, чтo данные меры направлены на пoвышение качества связи и цифрoвых сервисoв для жителей региoна.

Russian Example (FAKE)

В сoциальных сетях распрoстраняются заявления o скрытoй цифрoвoй мoдели управления. В ряде публикаций утверждается, чтo прoекты Smart City якoбы предпoлагают усиленный мoнитoринг граждан через цифрoвые системы и камеры наблюдения. Некoтoрые пoльзoватели связывают развитие цифрoвoй инфраструктуры с вoзмoжным oграничением традициoнных фoрм oказания услуг. Также высказываются предпoлoжения o перехoде региoна на "экспериментальную" мoдель цифрoвoгo управления, oднакo oфициальных пoдтверждений пoдoбным заявлениям нет. Представленные утверждения oснoваны преимущественнo на интерпретациях пoльзoвателей сoциальных сетей.

Appendix B. Representative Error Categories

This appendix provides representative examples of the main error categories identified in Section 5.7 for the mDeBERTa-v3 model on the bilingual REAL/FAKE test set. Each example includes the original fragment, a short English translation, the gold label, the model prediction, and a brief explanation of the error type. Texts are shortened to the most relevant segments, and identifying details have been anonymized where necessary.

Appendix B.1. Stylistic Confusion (FAKE → Predicted REAL)

Language: Russian

“В Министерстве здравooхранения сooбщили o расширении прoграммы скрининга. Пo данным ведoмства, в текущем гoду дoпoлнительнoе oбследoвание прoйдут бoлее 1.2 миллиoна граждан […]”.

English translation: “The Ministry of Health reported an expansion of the screening programme. More than 1.2 million citizens are expected to undergo additional examination this year […].”

Gold label: FAKE; Predicted label: REAL.

Note: The item imitates the institutional style of an official press release and contains no obvious uncertainty markers. The model relies on register-level cues and incorrectly classifies the text as REAL.

Appendix B.2. Topic-Shift Effect (REAL → Predicted FAKE)

Language: Kazakh

“Маңғыстау oблысында мал шаруашылығын дамыту бағдарламасы аясында жайылым алқаптары ұлғайтылды […]”.

English translation: “In the Mangystau region, pasture areas were expanded as part of a livestock development programme […].”

Gold label: REAL; Predicted label: FAKE.

Note: The agricultural and regional-administration vocabulary is underrepresented in the training data, leading the model to misinterpret topical novelty as a FAKE signal.

Appendix B.3. Hedging-Cue False Positives (REAL → Predicted FAKE)

Language: Russian

“Пo прoгнoзу Нациoнальнoгo банка, инфляция в текущем гoду мoжет сoставить oт 6.5 дo 7.5 прoцента […]”.

English translation: “According to the National Bank’s forecast, inflation this year may range between 6.5 and 7.5 per cent […].”

Gold label: REAL; Predicted label: FAKE.

Note: Legitimate forecasting language (“may range”, “according to analysts”) is incorrectly treated by the model as a marker of disinformation.

Appendix B.4. Short-Text Errors

Language: Russian

“В Алматы oткрыт нoвый сервисный центр для приёма заявлений на цифрoвые услуги”.

English translation: “A new service centre for digital services has opened in Almaty.”

Gold label: REAL; Predicted label: FAKE.

Note: The text is too short to provide sufficient lexical and contextual information for reliable classification.

Appendix B.5. Subtle Manipulation (FAKE → Predicted REAL)

Language: Kazakh

“Үкімет өкілінің мәліметінше, келесі айдан бастап барлық мемлекеттік қызметтер тек электрoнды фoрматта көрсетіледі […]”.

English translation: “According to a government representative, all public services will be provided exclusively in electronic form starting next month […].”

Gold label: FAKE; Predicted label: REAL.

Note: The misleading element lies in source misattribution rather than in overt sensationalism, making the item difficult for a text-only classifier to detect.

Appendix B.6. Annotation–Boundary Cases

Language: Russian

“Пo мнению ряда экспертoв, прoвoдимая цифрoвая рефoрма фактически усиливает кoнтрoль над пoвседневнoй жизнью граждан […]”.

English translation: “According to several experts, the ongoing digital reform increases control over citizens’ daily lives […].”

Gold label: FAKE; Predicted label: REAL.

Note: The item combines analytical commentary with misleading framing and lies close to the boundary between opinion journalism and disinformation.

Table A1 provides a concise summary of the representative error categories illustrated in Appendix B.1, Appendix B.2, Appendix B.3, Appendix B.4, Appendix B.5 and Appendix B.6.

Table A1. Summary of error categories in mDeBERTa-v3 predictions.

Category	Share (%)	Typical Direction
Stylistic confusion	26.4%	FAKE → REAL
Topic-shift effect	21.9%	REAL → FAKE
Hedging-cue false positives	18.0%	REAL → FAKE
Short-text errors	13.5%	Both directions
Subtle manipulation	12.4%	FAKE → REAL
Annotation-boundary cases	7.9%	Both directions

Together, these examples illustrate that many classification errors are associated not only with lexical patterns, but also with discourse structure, topical distribution, source attribution, and contextual ambiguity. The results further support the motivation for the proposed multi-level annotation scheme, which aims to capture multiple dimensions of disinformation beyond binary REAL/FAKE classification.

Appendix C. Multi-Task Generalization Results

This appendix reports the multi-task variants of the generalization experiments presented in Section 5.9. The multi-task configuration corresponds to the full multi-level setting from Table 13 and jointly optimizes the primary REAL/FAKE objective together with the auxiliary objectives fake_type, disinformation_technique, source_credibility, and modality. All other hyperparameters, splits, and evaluation protocols are identical to those used for the binary models in Section 5.9.

Appendix C.1. Cross-Domain Generalization (Multi-Task)

Table A2 reports leave-one-domain-out macro F1 for the multi-task variant, alongside the binary baseline from Table 14.

Table A2. Cross-domain generalization (leave-one-domain-out), multi-task variant vs. binary baseline.

Held-Out Domain	Binary (Table 14)	Multi-Task	Δ vs. Binary
Digitalization and AI	82.7%	84.3% ± 0.4	+1.6
Public health and epidemiology	78.3%	79.7% ± 0.5	+1.4
Economy and finance	80.1%	81.6% ± 0.4	+1.5
Domestic politics	76.9%	78.4% ± 0.6	+1.5
Social affairs	79.4%	80.9% ± 0.5	+1.5
Macro-average	79.5%	81.0% ± 0.2	+1.5

The multi-task variant preserves the same qualitative pattern observed in Table 14. The largest residual performance drop remains in the domestic-politics domain, while the smallest drop is observed for digitalization and AI, which is the most represented domain in the training data. Overall, the multi-task configuration provides a small but consistent improvement across all held-out domains.

Appendix C.2. Cross-Origin Generalization (Synthetic ↔ Authentic, Multi-Task)

Table A3 reports the cross-origin evaluation between synthetic and authentic FAKE items, corresponding to the experiments presented in Section 5.9.3 and Table 15.

Table A3. Cross-origin generalization between synthetic and authentic FAKE items, multi-task variant vs. binary baseline.

Training FAKE Origin	Test FAKE Origin	Binary Macro F1	MT Macro F1	Δ vs. Binary
Synthetic only	Synthetic	88.1%	89.3% ± 0.3	+1.2
Synthetic only	Authentic	73.0%	74.7% ± 0.9	+1.7
Authentic only	Authentic	84.4%	85.8% ± 0.6	+1.4
Authentic only	Synthetic	81.6%	83.5% ± 0.5	+1.9
Joint (auth. + synth.)	Authentic	85.5%	86.8% ± 0.7	+1.3
Joint (auth. + synth.)	Synthetic	87.3%	88.9% ± 0.4	+1.6

The multi-task configuration improves performance consistently across all origin conditions. However, the gap between synthetic-trained and authentic-tested settings remains substantial, indicating that auxiliary supervision alone cannot compensate for the distributional differences between synthetic and authentic disinformation. The strongest overall performance is obtained under joint training on both synthetic and authentic FAKE items.

Appendix C.3. Summary

The multi-task configuration provides a stable improvement across both cross-domain and cross-origin settings, with an average gain of approximately +1.5 macro F1, ranging from +1.2 to +1.9 across settings over the binary baseline. These results support the interpretation that the proposed multi-level annotation scheme supplies transferable supervision beyond binary REAL/FAKE classification. At the same time, the remaining performance gap between synthetic and authentic conditions confirms the methodological importance of including authentic fake news in evaluation.

References

Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Zhou, Z.; Zafarani, R. A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv. 2020, 53, 1–40. [Google Scholar] [CrossRef]
Tandoc, E.C.; Lim, Z.W.; Ling, R. Defining “fake news”: A typology of scholarly definitions. Digit. J. 2018, 6, 137–153. [Google Scholar] [CrossRef]
Ferreira, W.; Vlachos, A. Emergent: A novel data set for stance classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), San Diego, CA, USA, 12–17 June 2016; pp. 1163–1168. [Google Scholar] [CrossRef]
Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, NM, USA, 20–26 August 2018; pp. 3391–3401. [Google Scholar]
Amjad, M.; Sidorov, G.; Zhila, A.; Gómez-Adorno, H.; Voronkov, I.; Gelbukh, A. “Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation. J. Intell. Fuzzy Syst. 2020, 39, 2457–2469. [Google Scholar] [CrossRef]
Azad, R.; Mohammed, B.; Mahmud, R.; Zrar, L.; Sdiqa, S. Fake news detection in low-resourced languages: Kurdish language using machine learning algorithms. Turk. J. Comput. Math. Educ. 2021, 12, 4219–4225. [Google Scholar]
Alghamdi, J.; Lin, Y.; Luo, S. Fake news detection in low-resource languages: A novel hybrid summarization approach. Knowl. –Based Syst. 2024, 296, 111884. [Google Scholar] [CrossRef]
Makhambetov, O.; Makazhanov, A.; Yessenbayev, Z.; Matkarimov, B.; Sabyrgaliyev, I.; Sharafudinov, A. Assembling the Kazakh language corpus. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, WA, USA, 18–21 October 2013; pp. 1022–1031. [Google Scholar]
Faustini, P.H.A.; Covões, T.F. Fake news detection in multiple platforms and languages. Expert Syst. Appl. 2020, 158, 113503. [Google Scholar] [CrossRef]
Chu, S.K.W.; Xie, R.; Wang, Y. Cross-language fake news detection. Data Inf. Manag. 2021, 5, 100–109. [Google Scholar] [CrossRef]
Wardle, C.; Derakhshan, H. Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making; Council of Europe: Strasbourg, France, 2017. [Google Scholar]
Irnawan, B.R.; Xu, S.; Tomuro, N.; Fukumoto, F.; Suzuki, Y. Claim veracity assessment for explainable fake news detection. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 4011–4029. [Google Scholar]
Wang, B.; Ma, J.; Lin, H.; Yang, Z.; Yang, R.; Tian, Y.; Chang, Y. Explainable fake news detection with large language model via defense among competing wisdom. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 2452–2463. [Google Scholar] [CrossRef]
Dementieva, D.; Panchenko, A. Fake news detection using multilingual evidence. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 775–776. [Google Scholar] [CrossRef]
Sambetbayeva, M.; Nekessova, A.; Yerimbetova, A.; Bayangali, A.; Kaldarova, M.; Telman, D.; Smailov, N. A multi-level annotation model for fake news detection: Implementing a Kazakh-Russian corpus via Label Studio. Big Data Cogn. Comput. 2025, 9, 215. [Google Scholar] [CrossRef]
Kuntur, S.; Wróblewska, A.; Ganzha, M.; Paprzycki, M.; Sachdeva, S. Fake news detection: It’s all in the data! Appl. Sci. 2026, 16, 1585. [Google Scholar] [CrossRef]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef] [PubMed]
Dzienisiewicz, D.; Graliński, F.; Jabłoński, P.; Kubis, M.; Skórzewski, P.; Wierzchoń, P. POLygraph: Polish fake news dataset. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA 2024), Mexico City, Mexico, 20 June 2024; pp. 250–263. [Google Scholar] [CrossRef]
Li, Y.; He, H.; Bai, J.; Wen, D. MCFEND: A multi-source benchmark dataset for Chinese fake news detection. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 4018–4027. [Google Scholar] [CrossRef]
Rashkin, H.; Choi, E.; Jang, J.Y.; Volkova, S.; Choi, Y. Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 7–11 September 2017; pp. 2931–2937. [Google Scholar] [CrossRef]
Wen, Y.F.; Chang, W.H.; Wang, C.C.; Yang, K. Fake news detection and corpus establishment from comment data for social network posts. Soc. Netw. Anal. Min. 2024, 14, 222. [Google Scholar] [CrossRef]
Derczynski, L.; Bontcheva, K.; Liakata, M.; Procter, R.; Hoi, G.W.S.; Zubiaga, A. SemEval-2020 Task 11: Detection of propaganda techniques in news articles. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, 12 December 2020; pp. 1377–1414. [Google Scholar] [CrossRef]
Wang, W.Y. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar] [CrossRef]
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 809–819. [Google Scholar] [CrossRef]
Hossain, M.Z.; Rahman, M.A.; Islam, M.S.; Kar, S. BanFakeNews: A dataset for detecting fake news in Bangla. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2862–2871. [Google Scholar]
Buzea, M.C.; Trausan-Matu, S.; Rebedea, T. Automatic fake news detection for Romanian online news. Information 2022, 13, 151. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, L.; Zhang, J.; Liu, E.; Cambria, E.; Li, C. FineFake: A knowledge-enriched dataset for fine-grained multi-domain fake news detection. Inf. Fusion 2026, 132, 104253. [Google Scholar] [CrossRef]
Lv, J.; Gao, Y.; Li, L.; Shi, L.; Li, S. Multi-modal fake news detection: A comprehensive survey on deep learning technology, advances, and challenges. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 306. [Google Scholar] [CrossRef]
Chim, J.; Ive, J.; Liakata, M. Evaluating Synthetic Data Generation from User Generated Text. Comput. Linguist. 2025, 51, 191–233. [Google Scholar] [CrossRef]
Wu, J.; Yang, S.; Zhan, R.; Yuan, Y.; Chao, L.S.; Wong, D.F. A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions. Comput. Linguist. 2025, 51, 275–338. [Google Scholar] [CrossRef]
Arık, A.O.; Parlayandemir, G.; Çelik, S. LLM-Based Data Augmentation for Text Classification on Imbalanced Datasets: A Case Study on Fake News Detection. Egypt. Inform. J. 2026, 33, 100886. [Google Scholar] [CrossRef]
Yoshida, K.; Zhang, J. Faithful fake news detection based on natural language explanation generation. In Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval, Seoul, Republic of Korea, 13–15 December 2024; pp. 69–74. [Google Scholar] [CrossRef]
Gu, J.; Li, W.; Liu, F.; Liu, W.; Wang, H. HeterMV: Multi-view reasoning over source-aware heterogeneous evidence graph for multi-source fact verification. Inf. Process. Manag. 2026, 63, 104709. [Google Scholar] [CrossRef]
Artstein, R.; Poesio, M. Inter-coder agreement for computational linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]

Figure 1. Distribution of disinformation techniques in KazFakeCorpus.

Figure 2. Architecture of the KazFakeCorpus construction and multi-level annotation process.

Figure 3. Conceptual scheme of the KazFakeCorpus annotation model and the relationships among its main categories.

Figure 4. Multi-level annotation workflow for news texts by two independent experts during the construction of KazFakeCorpus.

Figure 5. Example of multi-level annotation of a FAKE news text in Label Studio.

Figure 6. Example of relation-based annotation of a FAKE news text in Label Studio.

Figure 7. Distribution of disinformation techniques in the FAKE subset of KazFakeCorpus.

Figure 8. Distribution of text length (words).

Table 1. Comparison of existing corpora for fake news detection.

Corpus	Language	Size	Data Source	Multilingual	Annotation Level	Reference
LIAR	English	~12.8 K	Short political claims + metadata	No	Fine-grained veracity labels + speaker/context metadata	[24]
FEVER	English	~185 K claims	Claims + evidence from Wikipedia	No	Claim veracity + evidence links	[25]
POLygraph	Polish	~5 K	News + comments	No	Binary (REAL/FAKE)	[19]
MCFEND	Chinese	~23.8 K	Multi-source: social platforms, messaging apps, and online news	No	Binary veracity (fact-check verified)	[20]
FakeNewsNet	English	~23 K (varies by subset)	News content + social context (social media signals) + spatiotemporal/dynamic information	No	Veracity labels + social context/metadata (promotion/engagement)	[18]
BanFakeNews	Bangla	~50 K	Online news/mixed	No	Binary (REAL/FAKE)	[26]
Urdu “Bend the Truth” dataset	Urdu	900	News/online content	No	Binary	[6]
Kurdish fake news dataset	Kurdish	~15 K unique	Online sources	No	Binary	[7]
Romanian Online News	Romanian	38,905	Online news	No	Binary	[27]
Comment-based fake news corpus (resource type)	Various	−	Social media + comments	Varies	Binary/corpus construction	[22]
Kazakh Language Corpus (general-purpose)	Kazakh	−	Mixed texts	No	Linguistic annotation (not fake-news-specific)	[9]
KazFakeCorpus	Kazakh + Russian	4276	Core corpus: official Gov.kz news + controlled synthetic fakes; auxiliary authentic and external evaluation sets	Yes	Multi-level semantic annotation (REAL/FAKE, fake_type, technique, intent, source, evidence, modality)	This work

Table 2. Statistical characteristics of KazFakeCorpus.

Language	Class	Number of Texts	Average Number of Words	Corpora Percentage (%)
Kazakh	REAL	1085	≈278	25.4%
Kazakh	FAKE	1085	≈330	25.4%
Russian	REAL	1053	≈301	24.6%
Russian	FAKE	1053	≈348	24.6%
Total	-	4276	≈314	100%

Table 3. Examples of authentic (REAL) and generated (FAKE) news fragments (simplified presentation; see Appendix A for full original texts).

Type	Language	Original (Excerpt)	English Translation (Full Version)
REAL	Kazakh	Цифрлық кoд цифрлық құқықтарды қoрғауға бағытталған жүйелі құжат ретінде ұсынылды…	The Digital Code as a framework for digital rights regulation The Digital Code is presented as a regulatory framework intended to govern the digital environment. The document outlines provisions related to data protection, the development of electronic services, and the protection of citizens’ digital rights. It includes measures such as the “right to be forgotten,” initiatives aimed at developing digital infrastructure, and the transition of public services toward paperless formats. In this context, the Code is described as part of broader efforts to support digital transformation and update the existing legal framework
FAKE	Kazakh	Цифрлық кoд азаматтардың цифрлық кеңістіктегі еркіндігін шектеуі мүмкін деген пікірлер тарауда…	Public concerns regarding the potential strengthening of digital control Public discussions on social media have raised concerns that the Digital Code may potentially restrict citizens’ freedoms within the digital environment. Some publications hypothesize that the document could provide a legal basis for introducing additional regulatory measures. Questions have also been raised regarding the practical implementation of provisions related to the “right to be forgotten” and data protection mechanisms. At present, however, these interpretations remain unverified and are discussed only within the context of public debate.
REAL	Russian	В Жамбылскoй oбласти oбсудили развитие цифрoвых прoектoв в хoде рабoчей пoездки…	Discussion on the development of digital projects in the Zhambyl region During a working visit, representatives of the ministry reported on the development of the region’s digital infrastructure, including the expansion of internet connectivity to populated areas and the implementation of elements related to the Smart City concept. The discussions also addressed the expansion of telecommunication networks and the modernization of base stations. As part of the initiative, plans include the development of an IT hub and improving the accessibility of public services. According to ministry representatives, these measures are intended to enhance communication quality and digital service availability for residents of the region.
FAKE	Russian	В сoциальных сетях распрoстраняются заявления o скрытoй цифрoвoй мoдели управления…	Claims circulating on social media about a hidden digital governance model A number of social media publications claim that Smart City projects may involve enhanced monitoring of citizens through digital systems and surveillance cameras. Some users associate the development of digital infrastructure with potential restrictions on traditional forms of service delivery. Additional assumptions suggest a possible transition of the region toward an “experimental” model of digital governance; however, no official confirmation of such claims has been provided. These statements are primarily based on interpretations expressed by social media users.

Table 4. Inter-annotator agreement for the key categories of the KazFakeCorpus multi-level annotation scheme.

Category	Krippendorff’s Alpha (α)
REAL/FAKE	0.88
FAKE_TYPE	0.84
DISINFORMATION_TECHNIQUE	0.81
MODALITY	0.79

Table 5. Cross-linguistic distribution of disinformation techniques in the Kazakh and Russian subsets of KazFakeCorpus.

Disinformation Technique	Kazakh (%)	Russian (%)
Misattribution	33.1%	31.9%
Clickbait	22.4%	23.6%
Emotional pressure	17.2%	15.6%
Downplaying	13.8%	14.6%
Exaggeration	13.5%	14.3%

Table 6. Comparison of linguistic characteristics of REAL and FAKE texts across the Kazakh and Russian subsets of KazFakeCorpus.

Feature	Kazakh REAL	Kazakh FAKE	Russian REAL	Russian FAKE
Avg. words	278	330	301	348
Avg. sentence length	18.2	22.4	20.1	24.6
Hedging markers (%)	4.1%	18.3%	3.8%	17.9%
Uncertainty markers (%)	2.3%	15.7%	2.1%	14.9%

Table 7. Per-class precision, recall, and F1 on the REAL/FAKE task.

Model	Acc.	Prec. REAL	Rec. REAL	F1 REAL	Prec. FAKE	Rec. FAKE	F1 FAKE	Macro F1
TF–IDF + LR	76.1 ± 0.8	75.4	77.0	76.2	77.1	75.2	76.1	76.1 ± 0.7
mBERT	80.4 ± 0.6	79.8	81.2	80.5	81.0	79.6	80.3	80.4 ± 0.5
XLM-RoBERTa	83.1 ± 0.5	82.4	84.0	83.2	83.8	82.1	82.9	83.0 ± 0.6
mDeBERTa-v3	85.0 ± 0.6	84.7	85.9	85.3	85.6	84.1	84.8	85.1 ± 0.5

Table 8. Confusion matrix for mDeBERTa-v3 on the bilingual REAL/FAKE test set.

	Predicted REAL	Predicted FAKE	Total
True REAL	512	84	596
True FAKE	97	499	596
Total	609	583	1192

Table 9. Cross-lingual and bilingual baseline results (XLM-RoBERTa).

Setting	Train	Test	Accuracy	F1
Monolingual	Kazakh	Kazakh	83.0%	83.2%
Monolingual	Russian	Russian	83.5%	84.0%
Joint bilingual	Kazakh + Russian	Kazakh	83.8%	84.1%
Joint bilingual	Kazakh + Russian	Russian	84.2%	84.6%
Cross-lingual	Kazakh	Russian	76.5%	77.2%
Cross-lingual	Russian	Kazakh	75.9%	76.8%

Table 10. Composition of the external validation set.

Class	Language	Number of Texts	Sources	Domains
FAKE	Kazakh	65	Factcheck.kz, Stopfake.kz	politics, health, social
FAKE	Russian	81	Factcheck.kz, Stopfake.kz	politics, health, social
REAL	Kazakh	58	Tengrinews.kz, Informburo.kz, Kazinform.kz	health, economy, social policy
REAL	Russian	76	Tengrinews.kz, Informburo.kz, Kazinform.kz	health, economy, social policy
Total	—	280	—	—

Table 11. External validation results.

Setting	Accuracy	Kazakh F1	Russian F1	Macro F1
KazFakeCorpus → external validation set	0.724	0.702	0.738	0.720
KazFakeCorpus → KazFakeCorpus test (reference)	0.848	0.841	0.849	0.845

Table 12. Distribution of error types for mDeBERTa-v3 on the test set.

Error Type	Description	Count	Share (%)
Stylistic confusion	Authentic FAKE items written in a neutral or institutional register are predicted as REAL.	47	26.4%
Topic-shift effect	REAL items from domains underrepresented in training (e.g., labour, social affairs) are predicted as FAKE.	39	21.9%
Hedging-cue false positives	REAL items containing legitimate hedging (e.g., expert opinion, projection) are predicted as FAKE.	32	18.0%
Short-text errors	Items below 80 words yield insufficient signal for either class.	24	13.5%
Subtle manipulation	FAKE items relying on misattribution or false context, without surface uncertainty markers, are predicted as REAL.	22	12.4%
Annotation-boundary cases	Items at the boundary between misinformation and opinion journalism.	14	7.9%

Table 13. Ablation study: effect of multi-level auxiliary annotation on REAL/FAKE detection.

Configuration	Active Auxiliary Tasks	Acc.	Macro F1	Δ vs. Binary
Binary only	—	85.3 ± 0.6	84.9 ± 0.7	—
+fake_type	fake_type	86.0 ± 0.5	85.8 ± 0.6	+0.9
+technique	fake_type, disinformation_technique	86.5 ± 0.7	86.3 ± 0.5	+1.4
+source/evidence	above + source_credibility, evidence	87.0 ± 0.5	86.8 ± 0.6	+1.9
Full multi-level	above + author_intent, modality	87.3 ± 0.6	87.1 ± 0.5	+2.2

Table 14. Cross-domain generalization (leave-one-domain-out).

Held-Out Domain	Train: Other Four Domains	In-Domain Reference	Δ
Digitalization and AI	82.7%	87.4%	−4.7
Public health and epidemiology	78.3%	85.9%	−7.6
Economy and finance	80.1%	86.3%	−6.2
Domestic politics	76.9%	85.1%	−8.2
Social affairs	79.4%	85.6%	−6.2
Macro-average	79.5%	86.1%	−6.6

Table 15. Cross-origin generalization between synthetic and authentic FAKE items.

Training FAKE Origin	Test FAKE Origin	Binary Macro F1	MT Macro F1	Δ vs. Binary
Synthetic only	Synthetic	88.1%	89.3% ± 0.3	+1.2
Synthetic only	Authentic	73.0%	74.7% ± 0.9	+1.7
Authentic only	Authentic	84.4%	85.8% ± 0.6	+1.4
Authentic only	Synthetic	81.6%	83.5% ± 0.5	+1.9
Joint (auth. + synth.)	Authentic	85.5%	86.8% ± 0.7	+1.3
Joint (auth. + synth.)	Synthetic	87.3%	88.9% ± 0.4	+1.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lamasheva, Z.; Nekessova, A.; Kantureyeva, M.; Sambetbayeva, M.; Kaldarova, M.; Nazymkhan, A. KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection. Big Data Cogn. Comput. 2026, 10, 183. https://doi.org/10.3390/bdcc10060183

AMA Style

Lamasheva Z, Nekessova A, Kantureyeva M, Sambetbayeva M, Kaldarova M, Nazymkhan A. KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection. Big Data and Cognitive Computing. 2026; 10(6):183. https://doi.org/10.3390/bdcc10060183

Chicago/Turabian Style

Lamasheva, Zhanar, Anargul Nekessova, Mansiya Kantureyeva, Madina Sambetbayeva, Mira Kaldarova, and Aksaule Nazymkhan. 2026. "KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection" Big Data and Cognitive Computing 10, no. 6: 183. https://doi.org/10.3390/bdcc10060183

APA Style

Lamasheva, Z., Nekessova, A., Kantureyeva, M., Sambetbayeva, M., Kaldarova, M., & Nazymkhan, A. (2026). KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection. Big Data and Cognitive Computing, 10(6), 183. https://doi.org/10.3390/bdcc10060183

Article Menu

KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection

Abstract

1. Introduction

2. Related Work and Research Gap

2.1. Corpora and Data Sources

2.2. Annotation Levels: From Binary Labels to Typologies and Techniques

2.3. Detection Models, Synthetic Data, and Hybrid Approaches

2.4. Interpretable Modeling of the “Claim–Evidence–Source” Structure

2.5. Low-Resource and Multilingual Environments

2.6. Research Gap

3. KazFakeCorpus: Data Collection

3.1. Sources of Real News

3.2. Data Cleaning and Preprocessing

3.3. Construction of the FAKE Class

3.3.1. Authentic Fake News Collection

3.3.2. Synthetic Fake News Generation

3.4. Corpus Balancing

4. Annotation Methodology

4.1. Corpus Construction Workflow

4.2. Multi-Level Annotation Scheme

4.2.1. Ontological Structure of the Annotation Scheme

4.2.2. REAL NEWS Block

4.2.3. FAKE NEWS Block

4.2.4. Modality and Semantic Relations

5. Corpus Analysis

5.1. Annotation Protocol

5.2. Distribution of Disinformation Techniques in the FAKE Subset

Cross-Linguistic Comparison of Kazakh and Russian Subsets

5.3. Statistical Characteristics and Structural Balance of the Corpus

5.4. Experimental Setup and Reproducibility

5.5. Baseline Detection Experiments

5.6. External Validation on Authentic Data

5.7. Error Analysis

5.8. Ablation Study on Multi-Level Annotation

5.9. Cross-Domain, Cross-Topic, and Cross-Origin Generalization

5.9.1. Cross-Domain Generalization

5.9.2. Cross-Topic Generalization Within a Domain

5.9.3. Cross-Origin Generalization (Authentic ↔ Synthetic)

6. Discussion

7. Limitations

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B. Representative Error Categories

Appendix B.1. Stylistic Confusion (FAKE → Predicted REAL)

Appendix B.2. Topic-Shift Effect (REAL → Predicted FAKE)

Appendix B.3. Hedging-Cue False Positives (REAL → Predicted FAKE)

Appendix B.4. Short-Text Errors

Appendix B.5. Subtle Manipulation (FAKE → Predicted REAL)

Appendix B.6. Annotation–Boundary Cases

Appendix C. Multi-Task Generalization Results

Appendix C.1. Cross-Domain Generalization (Multi-Task)

Appendix C.2. Cross-Origin Generalization (Synthetic ↔ Authentic, Multi-Task)

Appendix C.3. Summary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI