5.1. Annotation Protocol
At the final stage of KazFakeCorpus construction, multi-level annotation of news reports was carried out using the Label Studio platform. The annotation was performed in accordance with the developed scheme and covered the key structural, semantic, and contextual characteristics of news texts. This approach makes it possible to account not only for the final veracity of a message but also for the mechanisms of disinformation dissemination and presentation.
In developing the annotation scheme, the authors relied on existing typologies of disinformation, in particular the classifications proposed in [
3,
12], in which false content is considered not only from the perspective of truthfulness but also with regard to the form of information distortion, the context of dissemination, and the communicative effect. Based on these approaches, the annotation categories were adapted to the tasks of bilingual analysis of news reports.
The annotation was performed by two independent experts with experience in media text analysis and fact-checking. The annotation team included a linguistics specialist and a fact-checking expert proficient in both Russian and Kazakh. Before the main annotation phase, a pilot study was conducted on a subsample of 120 texts in order to refine category definitions, identify ambiguous cases, and assess the clarity of the instructions. Following the pilot phase, individual definitions and labeling rules were revised and compiled into a unified document, Annotation Guidelines, which was then used during the main annotation phase.
The main annotation procedure involved several consecutive steps. First, a set of Kazakh- and Russian-language news texts was compiled, pre-cleaned, normalized, and balanced. Each text was then independently annotated by two experts across all levels of the annotation scheme, including the final veracity label, the type of fake content, the disinformation technique, the author’s communicative intent, and the message modality. In cases of disagreement, an additional adjudication procedure was conducted, during which disputed cases were discussed on the basis of the rules specified in the annotation guidelines. The final label was assigned only after consensus had been reached between the annotators. Thus, the final version of the corpus was formed on the basis of agreed decisions rather than the individual annotation of a single expert.
Although only two experts participated in the main annotation phase, the use of a pilot stage, unified annotation guidelines, and an adjudication procedure helped minimize annotation subjectivity. In future work, corpus expansion may involve additional annotators or a separate arbitrator in order to further improve annotation stability.
The overall workflow of the annotation process is shown in
Figure 4. It reflects the sequence of stages in corpus construction, from the selection and preparation of news reports to their independent annotation and the formation of the final dataset version. Unlike one-level classification, the proposed approach assumes document-level analysis and makes it possible to capture multiple aspects of the structure of a disinformation message.
To assess annotation reliability, 15% of the corpus, selected at random, was re-annotated by two experts without prior agreement on the decisions. Krippendorff’s alpha [
35] was used as a measure of inter-annotator agreement, as it is widely applied to evaluate the consistency of categorical annotation and allows for multiple classes and heterogeneous categories.
The obtained values indicate that agreement between the annotators remained high across the main levels of the scheme. For the REAL/FAKE category, the coefficient was α = 0.88; for fake_type, α = 0.84; for disinformation_technique, α = 0.81; and for modality, α = 0.79. These values indicate good reproducibility of the proposed annotation scheme, although the categories associated with the interpretation of the pragmatic characteristics of the message show slightly greater variability in expert judgments.
The Krippendorff’s alpha values for the key annotation categories are presented in
Table 4.
Taken together, these results confirm the reliability of the proposed annotation scheme and show that KazFakeCorpus can be used not only for binary news classification tasks but also for disinformation analysis at several interrelated levels. The presence of a formalized annotation scheme, detailed annotation guidelines, and an inter-annotator agreement assessment procedure ensures the reproducibility of the corpus and makes it suitable for further research on automatic fake news detection and cross-lingual media analysis.
In addition, a statistical analysis of the corpus was carried out, including an assessment of average text length, class distribution, and the frequency of different disinformation techniques. The results show that the corpus remains balanced both in terms of language (Kazakh and Russian) and veracity labels (REAL/FAKE), which reduces the risk of model bias toward the dominant class and increases the reliability of subsequent experimental studies.
To provide an estimate of statistical uncertainty, 95% confidence intervals were calculated for key proportions in the corpus, including the distribution of disinformation techniques and class balance. The obtained intervals indicate that the observed proportions remain stable within relatively narrow bounds, which reflects the controlled construction and balanced structure of the dataset.
In addition, the Kazakh and Russian subsets were compared to identify possible cross-linguistic variation. No major differences were observed in the overall distribution of classes and disinformation techniques across the two language subsets. This suggests that the annotation scheme was applied in a broadly consistent manner, although a more detailed statistical analysis of cross-linguistic variation remains a direction for future work.
These findings confirm that the observed distributions are not driven by random variation but reflect systematic properties of the corpus design. However, the statistical analysis remains primarily descriptive, as the main objective of the study is corpus construction and validation rather than hypothesis-driven statistical modeling.
Figure 5 presents an example of multi-level annotation of a FAKE news text in the Label Studio environment. The interface illustrates the assignment of core annotation layers, including the final veracity label, fake content types, and contextual fields such as source type, temporal reference, and target entities. The example shows how annotation captures uncertainty markers and references to unverified sources, which are reflected in the selection of specific disinformation-related categories.
Figure 6 provides a complementary example of annotated news text, focusing on the identification of disinformation techniques and semantic relations between text fragments and annotation categories. The highlighted segments are linked to specific manipulation strategies, such as emotional pressure and misattribution, demonstrating how the annotation process captures fine-grained patterns of information distortion.
The presented examples demonstrate how the proposed annotation scheme is applied in practice and show its capacity to capture multiple dimensions of disinformation within a single text. The annotation process combines document-level labeling with the identification of specific linguistic and discursive markers, allowing news content to be represented as a structured set of interconnected elements.
The use of relation-based annotation makes it possible to explicitly link textual fragments to particular disinformation techniques and contextual parameters. This provides a more detailed representation of information distortion than approaches based solely on binary classification. As a result, the corpus supports not only the identification of fake news but also the analysis of how such content is constructed and presented.
From a modeling perspective, this structure creates the conditions for developing methods that operate at both global and local levels of text representation. The availability of fine-grained annotation and explicit semantic relations supports interpretability, since model decisions can be traced back to specific annotated components.
5.3. Statistical Characteristics and Structural Balance of the Corpus
At this stage, the main statistical characteristics of KazFakeCorpus were calculated in order to describe its quantitative and structural organization. The analysis showed that the corpus remains balanced both in terms of language and veracity labels: the Kazakh- and Russian-language parts are represented by a comparable number of texts, and the distribution of the REAL and FAKE classes remains quantitatively equivalent.
In addition, text length analysis, measured in the number of words, was conducted. As shown in
Figure 8, most texts are concentrated in the medium-length range, while the overall distribution remains sufficiently broad to include both shorter and longer materials. Such variability corresponds to the characteristics of news discourse and makes it possible to regard the corpus as relatively close to real-world media text conditions.
At the same time, the distribution of text length does not indicate a pronounced structural bias toward either very short or very long materials. This is important for the subsequent use of the resource, as it reduces the likelihood that differences between texts can be explained solely by formal length-related parameters. Thus, the obtained statistical characteristics confirm that KazFakeCorpus is a quantitatively balanced and structurally coherent resource suitable for further analysis of misinformation in the bilingual media space.
To assess statistical uncertainty, 95% bootstrap confidence intervals were computed for the key proportions in the corpus, including the distribution of disinformation techniques and class balance. The obtained intervals indicate that the observed proportions remain stable within relatively narrow bounds.
In addition, a chi-square test of homogeneity was performed to compare the distribution of disinformation techniques across the Kazakh and Russian subsets. The results did not reveal statistically significant differences (p > 0.05), indicating that the observed differences are not statistically meaningful. A similar test for the distribution of veracity classes across the Kazakh and Russian subsets also did not show statistically significant differences.
5.4. Experimental Setup and Reproducibility
All experiments were implemented in Python 3.11 using PyTorch 2.1 and the HuggingFace Transformers library (version 4.38). Additional preprocessing and TF–IDF baselines were implemented with scikit-learn (version 1.4). The experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), AMD Ryzen 9 processor, and 64 GB RAM.
The main corpus was divided into training, validation, and test subsets using stratified sampling with a ratio of 70:15:15 while preserving language balance and REAL/FAKE class proportions across all splits. The authentic fake news set described in
Section 3.3.1 is held out entirely as an external validation set and is not included in the 70:15:15 split, except in the cross-origin experiment (
Section 5.9.3) where it is used as training data under controlled conditions.
The following HuggingFace model checkpoints were used for the neural baselines: bert-base-multilingual-cased for mBERT, xlm-roberta-base for XLM-RoBERTa, and microsoft/mdeberta-v3-base for mDeBERTa-v3. Transformer-based models were fine-tuned using the AdamW optimizer with a learning rate of 2 × 10−5, batch size of 16, weight decay of 0.01, maximum sequence length of 256 tokens, linear warmup over the first 10% of training steps, and training for up to 5 epochs. Early stopping based on validation loss with a patience of 2 epochs was applied to reduce overfitting.
To reduce the influence of random initialization, each experiment was repeated across five random seeds (42, 52, 62, 72, and 82). The results reported in
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12 and
Table 13 correspond to mean values averaged across runs, and standard deviations across the five runs are reported alongside the mean values.
Model evaluation was performed using Accuracy, Precision, Recall, and Macro F1-score, complemented by confusion matrix analysis and cross-domain generalization experiments described in
Section 5.9.
5.5. Baseline Detection Experiments
To assess the usability of KazFakeCorpus in a bilingual fake news detection setting, baseline classification experiments were conducted for the REAL/FAKE task.
Baseline experiments are reported for four models: a TF–IDF baseline with logistic regression, multilingual BERT, XLM-RoBERTa, and multilingual DeBERTa-v3 (mDeBERTa-v3). For each model, accuracy, precision, recall, and F1 are reported per class (REAL, FAKE), together with macro-averaged scores; all values are computed on the test split. As shown in
Table 7, per-class precision and recall remain close for all neural models, indicating that errors are not concentrated on a single class.
The TF–IDF baseline shows slightly lower recall on the FAKE class and greater sensitivity to surface lexical cues, suggesting that traditional lexical representations are more vulnerable to stylistic imbalance than transformer-based multilingual encoders. To examine the error structure of the strongest model,
Table 8 presents the confusion matrix for mDeBERTa-v3 on the bilingual test set (n = 1192 examples; 50% REAL, 50% FAKE).
The off-diagonal cells show that both types of errors are present, with 84 REAL items incorrectly classified as FAKE and 97 FAKE items incorrectly classified as REAL. This indicates a slight tendency of the model to classify ambiguous FAKE texts as REAL, although the overall prediction distribution remains relatively balanced. A language-level inspection showed similar error patterns, with minor variation across the Kazakh and Russian subsets.
The results show consistent performance across the Kazakh and Russian subsets, confirming the structural balance and cross-lingual coherence of the corpus. The small performance gap between languages suggests that the unified annotation scheme effectively supports bilingual fake news detection.
In addition to within-language baselines, we conducted cross-lingual evaluation using XLM-RoBERTa. The model was trained in three settings: monolingual, joint bilingual, and cross-lingual transfer (KZ → RU and RU → KZ). The results are presented in
Table 9.
The results show that joint bilingual training slightly improves performance compared to monolingual models, while cross-lingual transfer leads to a moderate decrease in performance. Nevertheless, the results confirm that the unified annotation scheme enables cross-lingual generalization and supports bilingual model development.