NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media
Abstract
1. Introduction
- The creation of a multilingual military-related corpus, encompassing languages such as Kazakh, Russian and English, is to be undertaken from social media and news sources.
- The development of a multi-level annotation scheme is required, with the objective of jointly covering entities/context and IO pragmatics (IO_TYPE, TARGET_AUDIENCE, AU-THOR_INTENT, FAKE_CLAIM, EMO_EVAL, platform/time/engagement).
- The objective of this study is to establish a benchmark for baseline models and to propose the utilisation of Onto-IO-BERT in order to demonstrate the practical utility of the corpus for the purpose of IO detection tasks.
- The study’s primary contributions are as follows:
- Multi_mil Corpus: A corpus of 1000 texts in Kazakh, Russian and English was collated from military-relevant social media and news feeds.
- The present paper sets out a pragmatic IO framework, which comprises integrated layers that link source material to text, target audience/IO type, platform/time, engagement/emotion. The purpose of this framework is to enable the exploration of manipulation strategies that extend beyond standard NER/event labels.
- The application of high-quality, controlled manual tagging in Label Studio is characterised by a high inter-annotator agreement (κ = 0.82), alongside the utilisation of open export formats (CSV/JSON/CoNLL, etc.).
- A comparison of fundamental approaches (logistic regression, support vector machines, bidirectional encoder representations from transformers, and XLM-R) and Onto-IO-BERT is presented, in which features derived from the IO ontology are embedded in the transformer. The optimal macro-F1 score for IO tasks is obtained, and the extraction of relationships between IO entities is facilitated.
- The resource provides support for real-time threat signals, social media analytics, and cross-lingual modelling for the low-resource Kazakh language. It complements the “event” and “propaganda” datasets, and it offers a reproducible infrastructure for extensions (new languages/platforms, more nuanced IO subtypes).
2. Related Works
- The first category comprises military-themed domains that focus on terminology or events. Examples of such domains include MilTAC, DiPLoMAT and CMNEE.
- The second category comprises propaganda and disinformation resources that map persuasion techniques, framing, and bias.
- Thirdly, there is the consideration of general event extraction datasets applicable as infrastructure.
| No. | Criterion | MilTAC | DiPLoMAT | CMNEE | Multi_mil |
|---|---|---|---|---|---|
| 1. | Objective | Study of tactical military lexicon and command structures | Analysis of rhetorical and pragmatic strategies in military discourse | Event extraction from Chinese military news at the document level | Automatic identification and classification of information operations in social media and news |
| 2. | Language | English | English | Chinese | Kazakh, Russian, English, |
| 3. | Annotation Type | Entity-based (NER), syntactic | Discourse, rhetorical, pragmatic | Event-based | Entity + classification + interpretative (multi-level) |
| 4. | Annotation Categories | MIL_ENTITY, UNIT, TIME, LOCATION | CLAIM, RHETORICAL DEVICE, POLITENESS STRATEGY | Conflict, Deploy, Exhibit, Support, Accident, Manoeuvre, | Injure, Experiment |
| 5. | Corpus Size | ~30,000 messages | ~5000 documents | 17,000 documents | 1000 texts |
| 6. | Data Format | XML, CoNLL | JSON, XML | JSON | CSV, JSON, TCB, CoNLL |
| 7. | Annotation Tools | BRAT | Programmatic annotation + expert revision | Two-stage annotation | Label Studio, expert annotation |
| 8. | Focus | Tactical terminology | Diplomatic and military discourse | Event-level information in military texts | Information-psychological operations (IO), propaganda, disinformation |
| 9. | Notable Features | Formalized military lexicon | Analysis of persuasive and rhetorical strategies | Annotation complexity: overlapping events, long arguments, co-reference | Multi-level annotation combining entities, pragmatics, and emotional tone |
| 10. | Resources | Military reports, briefings | Negotiations, press releases | Military News (China) | Telegram, Instagram, news aggregators |
| 11. | Limitations of comparison | No IO pragmatics (IO_TYPE, AUTHOR_INTENT, EMO_EVAL, FAKE_CLAIM); monolingual; no clear source-audience-platform chains. | Focus on rhetoric; lack of emotion/fake indicators; poor applicability to social media; no end-to-end IO layers. | Event summary without IO pragmatics; monolingual; social media not covered. | multi-layered IO scheme; high inter-annotator agreement; suitability for IO detection and relation extraction; baselines + Onto-IO-BERT). |
| 12. | Accessibility | Partially open | Partially open | Fully open | Partially open |
- propaganda and framing detection, where tagging is performed at the span level;
- event-centric military corpora (e.g., CMNEE) with document-level and argumentation roles;
- military-specific terminology resources (e.g., Mil-TAC).
- IO pragmatics (IO_TYPE, AUTHOR_INTENT, TARGET_AUDIENCE, FAKE_CLAIM, EMO_EVAL);
- multilingual coverage of KK-RU-EN with social media sources;
- explicit source → text → audience/platform/time → engagement/emotion links, which enables IO detection and relation extraction beyond standard NER/events.
3. Methods
- Thematic relevance to military and geopolitical topics;
- Uniqueness, with no duplicates or cross-posted content;
- Linguistic saturation with features of information influence.
- id—a unique identifier for each record;
- date—the publication timestamp in ISO format;
- sender_id—an identifier for the source of the publication (e.g., a Telegram channel);
- text_clean—a cleaned version of the message text that has undergone preprocessing, including the removal of noise, hyperlinks, emojis, and other irrelevant symbols.

- The sequence under consideration comprises a series of BIO tags.
- The impact is characterised by the designated type (IO_TYPE).
- The emotional tone is measured using the EMO_EVAL scale.
- The presence of indications of falsified information (FAKE_CLAIM).
3.1. Baseline Methods
3.1.1. Evaluation of NER Tools
- The linkage of text fragments to the ontology
- exact match > lemmatized match > normalized editor distance (≤0.2);
- additional contextual similarity based on the cosine measure of the fastText vector in a window of ±20 tokens.
- Logical inference document feature generation are discussed.
- aimed_at → TARGET_AUDIENCE,
- mentions_term → MIL_TERM,
- expresses_intent → AUTHOR_INTENT,
- engaged_by/published_on → PLATFORM_TYPE.
- Schema constraints (domain/range) are applied in order to validate and refine relationships.
- Token-Level Ontology Embeddings
- The incorporation of ontological characteristics within the BERT (fusion mechanism) is imperative.
- Concatenation at the [CLS] level (baseline-fusion) is a process that should be considered. In the initial variant, the latent representation of the [CLS] token is concatenated with the document projection obtained via a multilayer perceptron (MLP):
- Feature concatenation at the token level
- The third option is the core fusion mechanism, which uses adaptive weighting (attention-style gating). This is also known as the Adaptive Fusion Mechanism (Gated Fusion, Onto-IO-BERT Core Model). For a [CLS] token, a weight vector g is calculated, which determines the degree to which the model should rely on the textual representation or ontological features:
- Training Parameters
- Quality Control and Ablation Experiments
3.1.2. Baseline Models for Information Operations Classification
- Logistic Regression (with TF-IDF features) is a simple linear model that serves as a starting point in the absence of contextual embeddings;
- SVM (Support Vector Machine) is a classical model with a linear kernel that is robust to sparsity and imbalance;
- Multilingual BERT (mBERT) is a multilingual transformer model that has been trained on several languages, including Russian and Kazakh;
- XLM-RoBERTa represents a significant advancement in multilingual modelling, characterised by an enhanced architecture and extensive language coverage;
- Onto-IO-BERT, proposed model, which is the subject of this study, will provide a detailed description. This model incorporates a transformer architecture, which has been enriched with ontological features that have been extracted from the OWL ontology of information operations.
- as is the predicted probability of text of belonging to category ,
- is the logit output from the model for label ,
- σ is the sigmoid activation function.
- The classical models of logistic regression and support vector machines have been demonstrated to exhibit limited accuracy. This limitation can be attributed to their inability to incorporate contextual and semantic features of the text. These models are capable of capturing superficial statistical patterns; however, they are unable to recognise subtle pragmatic differences between types of influence.
- The employment of multilingual transformer models (mBERT and XLM-RoBERTa) has been demonstrated to result in a substantial enhancement of the quality of classification, a phenomenon attributable to the capacity to establish context-dependent representations of words and sentences.
- It is evident that the highest result is achieved by the Onto-IO-BERT model, which additionally uses ontological knowledge. Semantic features such as author’s intent, goal type, source, and target audience were extracted using logical inference (reasoning) based on the ontological schema and integrated into the model representations at the coding or classification stages.

3.1.3. Relation Extraction Based on the Onto-IO-BERT Model
- , is a pair of entities in the text T;
- R is the set of possible types of relations;
- γ is the predicted relation between entities.
4. Results
- Partial Overlap: Allowed was used in assuming partial overlap of selected fragments.
- Partial Overlap: Excluded was utilized with a strict requirement for complete coincidence of annotations.
| Category | Partial Overlap: Allowed | Partial Overlap: Excluded |
|---|---|---|
| MIL_TERM (F1) | 0.91 | 0.84 |
| GEO_LOC (F1) | 0.88 | 0.79 |
| TIME_REF (F1) | 0.86 | 0.80 |
| AUTHOR_INTENT (κ) | 0.72 | 0.68 |
| TARGET_AUDIENCE (κ) | 0.76 | 0.73 |
| TARGET_ENTITY (κ) | 0.79 | 0.75 |
| IO_TYPE (κ) | 0.82 | 0.80 |
| EMO_EVAL (κ) | 0.85 | 0.83 |
| FAKE_CLAIM (κ) | 0.88 | 0.87 |


5. Use Case: Detecting Information Operations in Telegram News Channels
6. Discussion
- The monitoring of disinformation and the issuing of early warnings. Multilingual models trained on the corpus have been shown to be capable of automatically labelling messages with FAKE_CLAIM / DISINFORMATION attributes, identifying IO_TYPE (demoralization, intimidation, delegitimization, etc.), and prioritising cases for analysts, creating review queues and topical digests by platform/region/time.
- The provision of support for multi-task multilingual models is imperative. The combination of entity → context/events → pragmatics layers (AUTHOR_INTENT, TARGET_AUDIENCE, EMO_EVAL, FAKE_CLAIM) enables the training of multi-task models (NER + document-level classification + RE), thereby increasing portability across languages (KZ-RU-EN) and domains (social media news).
- Impact and Tactics Analytics. The link between source, text, audience/platform/time, and emotion/falsehood enables the creation of a comprehensive overview of IO patterns. This includes the identification of techniques and emotions that are statistically more prevalent on specific platforms during particular periods, as well as the analysis of the relationships between EMO_EVAL and IO_TYPE.
- System Resilience Assessment. The corpus is utilised as a testbed for the evaluation of disinformation detectors, with specific focus on stress testing (domain shift, code switching, sarcasm/irony). Additionally, it facilitates the measurement of quality biases across languages and genres.
- Application Interfaces. Given the compatibility of annotations with Label Studio/JSON/CoNLL, the “data → model → dashboard” flow is advantageous in terms of convenience: detection, case grouping, explainability (keywords, attributions), and report export for operations centres.
- The process of manual annotation has been demonstrated to facilitate high-quality control, flexibility in the interpretation of ambiguous expressions, and the detection of hidden rhetorical patterns.
- The annotation schema under consideration is complex in nature, incorporating layers of semantic, syntactic and pragmatic elements. Among these layers are the IO_TYPE, which denotes the nature of the information operation, and the EMO_EVAL, which is indicative of the emotional tone.
- The implementation is of a modular nature within a collaborative interface (Label Studio), thus allowing clear visualisation and exportability to multiple formats.
- It is evident that the pilot validation demonstrates substantial inter-annotator agreement scores, thus signifying the consistency and reliability of the annotation guidelines.
- Achieving equilibrium between language and domain is imperative. The initial version of the corpus is imbalanced across languages and platforms, with Russian-language social media predominating. This has implications for the portability of models and the evaluation of metrics.
- The subjectivity of pragmatics is a concept that merits closer examination. The AUTHOR_INTENT and TARGET_AUDIENCE categories are inherently implicit; even with detailed guides, disagreements and dependence on context, irony and memes are inevitable.
- The presence of rare and overlapping labels has been identified. It is noteworthy that certain IO_TYPE classes and relationships between labels are uncommon, thereby impeding the training of complex models and the stability of evaluations.
- Source bias. The set of sources and time windows may have introduced platform-regional biases.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
- In order to comprehend the context, tone and purpose of the message, it is necessary to read the entire text.
- The subsequent analysis will determine whether the text contains any signs of informational influence, emotional pressure, manipulation, exaggeration, discrediting, etc.
- It is imperative to approach the task with meticulous care and objectivity, eschewing subjective interpretations that remain uncorroborated by the text (see Table A2).
| Tag | Description | Examples Retrieved from the Documents |
|---|---|---|
| MIL_TERM | Words or phrases denoting military objects, actions, ranks, equipment, operations, or tactics. | RU: наступление, кoнтратака, артиллерия, БПЛА, рoта, oкoпы, брoнетехника. EN: offensive, counterattack, artillery, UAV (drone), company, trenches, armored vehicle. |
| IO_TYPE | General classification of the text by type of information operation (see other categories). | RU: демoрализация, пoдрыв автoритета, запугивание EN: demoralization, discreditation, intimidation. |
| TARGET_AUDIENCE | Groups targeted by the message: civilians, military personnel, specific nations or political groups. | RU: граждане, вoеннoслужащие, казахстанцы, жители oккупирoванных территoрий, междунарoднoе сooбществo. EN: civilians, soldiers, Kazakhstani citizens, residents of occupied territories, international community. |
| EMO_EVAL | Overall emotional tone of the message —annotated at the message level. | RU: Пoзитивная, Нейтральная, Негативная. EN: Positive, Neutral, Negative. |
| SOURCE | Any mention of the information source: media, Telegram channels, official accounts or spokespersons. | RU: Минoбoрoны РФ, BBC News, @warjournalist, Reuters, украинские вoенные истoчники. EN: Russian MoD, BBC News, @warjournalist, Reuters, Ukrainian military sources. |
| GEO_LOC | Names of countries, cities, regions, territories, water, and other geographic toponyms. | RU: Бахмут, Дoнбасс, Крым, Харькoвская oбласть, Чёрнoе мoре. EN: Bakhmut, Donbas, Crimea, Kharkiv region, Black Sea. |
| TIME_REF | Mentions of dates, timeframes, historical periods, or moments in time. | RU: сегoдня утрoм, в феврале 2022 гoда, вo время oккупации, недавнo, в начале вoйны. EN: this morning, in February 2022, during the occupation, recently, at the start of the war. |
| AUTHOR_INTENT | Interpretation of the author’s presumed motive. What does he want to convey, evoke or suggest? Used as a classification label at the statement level. | RU: вызвать страх, пoдoрвать дoверие, сoздать сoчувствие, пoдтoлкнуть к действию, oбелить свoю стoрoну. EN: instill fear, undermine trust, evoke sympathy, prompt action, justify one’s side. |
| FAKE_CLAIM | Text containing suspicion of falsity or false statements. | RU: Уничтoженo 3 тысячи танкoв за день → FAKE_CLAIM: True Пo заявлению МО, пoтерь нет → FAKE_CLAIM: False EN: 3000 tanks destroyed in one day → FAKE_CLAIM: True MoD reports no losses → FAKE_CLAIM: False |
| TARGET_ENTITY | Persons, groups or institutions against whom the message or propaganda is directed. These may be governments, armies, organisations or nations. | RU: ВСУ, НАТО, администрация президента, вoлoнтёры, oппoзиция, кoмандoвание EN: Ukrainian Armed Forces (UAF), NATO, presidential administration, volunteers, opposition, command. |
- DISINFORMATION refers to lies or distortion of facts, fake figures and facts, distortion of official statements, manipulation of statistics, forged documents and fake sources.
- DEMORALIZATION refers to instilling hopelessness, decline, emphasising losses, propagating the futility of the struggle, demonstrating the weakness of the army/people, casting doubt on victory, isolation and alienation.
- DISCREDITATION is undermining authority, criticising the command, accusations of treason, undermining trust in institutions of power, exposing corruption and incompetence.
- PANIC_CREATION is creating panic, false reports of evacuation, information about shortages of resources (food, water), calls for flight, panic, mass action, dramatisation of ordinary events as disasters.
- HATE_INCITEMENT refers to inciting hatred, dehumanising the enemy, ethnic or religious division, dividing society into ‘us’ and ‘them’, accusations of national treason.
- INTIMIDATION renders intimidation, threats of violence, prophecies of disasters and calamities, signs of imminent invasion, descriptions of the ‘horrors of war’ for the purpose of intimidation.
- PROVOCATION renders to provocative actions, fake news about the cruelty of one’s own troops, manipulative calls for violence, incitement of anger and hatred towards specific actions, false accusations of crimes against the civilian population.
- AUTHORITY_UNDERSCORE is emphasising authority (often for manipulative purposes).
- The identification of information operations: The utilisation of propaganda techniques, coupled with a meticulous assessment of the emotional underpinnings and contextual milieu, forms the crux of the methodology employed (see Table A3).
- Identification of entities: The following is a list of key military terms and participants in information operations.
- Determination of falsity: The process of verifying statements is to be conducted by means of open sources.
- Regular meetings and discussion of controversial cases (see Table A4).
- Calculation of agreement metrics: Cohen’s Kappa, Precision/Recall, IAA.
- It is imperative to maintain objectivity and refrain from subjective assessment (see Table A5).
- It is imperative to maintain the confidentiality of data.
- It is imperative to encourage and facilitate the expression of diverse perspectives among the annotators.
- It is imperative that complex cases are documented in a general log.
| Error | Example | Incorrect Label | Correct Label | Correction |
|---|---|---|---|---|
| Сonfuse of discreditation and disinformation. | RU: Кoмандoвание скрывает пoтери EN: Command is concealing casualties | DISCREDITATION | DISINFORMATION | Disinformation = lies, Discreditation = undermining authority |
| Use of demoralization, instead of intimidation | RU: Вас всех уничтoжат EN: You will all be destroyed | DEMORALIZATION | INTIMIDATION | Threat → INTIMIDATION, Losses/fatigue → DEMORALIZATION |
| Ignore of manipulation with emotions | RU: Армия убила мирных жителей EN: The army killed civilians | — | PROVOCATION | Emotional fakes → PROVOCATION |
| Error | Example | No Label | Correction |
|---|---|---|---|
| Terms from the thesaurus are not annotated | RU: минoмёты oткрыли oгoнь EN: mortars opened fire | MIL_TERM | All entities from the military thesaurus must be annotated |
| Actions are not marked | RU: наступление, блoкирoвка, рейд EN: offensive, blockade, raid | - | Tactical actions are also referred to MIL_TERM |
| Titles/roles are omitted | RU: генерал, кoмандир, сапёр EN: general, commander, sapper | - | Annotate as MIL_TERM if present in the thesaurus |
| Error | Example | What is Missing | Correction |
|---|---|---|---|
| Do not mark geography | RU: на границе с Казахстанoм EN: on the border with Kazakhstan | GEO_LOC | All place names and geographical references should be annotated |
| Time such as ‘a week ago’ or ‘yesterday’ is not marked | RU: вчера вечерoм EN: last night | TIME_REF | All time stamps should be marked |
| Do not mark sources | RU: сooбщили в Telegram EN: reported on Telegram | SOURCE | Source → separate entity |
| Error | Example | Explanation | Correction |
|---|---|---|---|
| Different annotators choose different boundaries | RU: пoнёс бoльшие пoтери vs. бoльшие пoтери EN: suffered heavy losses vs. heavy losses | Irrelevant reduction | Always mark up the entire fragment that conveys meaning |
| Phrases that are too long | RU: вoеннoе наступление спецназа с применением минoмётoв EN: military offensive by special forces using mortars | Re-marking | Mark up only key entities, not the entire fragment |
References
- Zhukabayeva, T.; Ahmad, Z.; Yerimbetova, A.; Sambetbayeva, M.; Telman, D.; Bayangali, A.; Daiyrbayeva, E. A Comprehensive Review of NLP Techniques for Military Terminologies and Information Operations on Social Media. IEEE Access 2025, 13, 154930–154947. [Google Scholar] [CrossRef]
- Sambetbayeva, M.; Nekessova, A.; Yerimbetova, A.; Bayangali, A.; Kaldarova, M.; Telman, D.; Smailov, N. A Multi-Level Annotation Model for Fake News Detection: Implementing Kazakh-Russian Corpus via Label Studio. Big Data Cogn. Comput. 2025, 9, 215. [Google Scholar] [CrossRef]
- Duzen, Z.; Riveni, M.; Aktas, M.S. Analyzing the spread of misinformation on social networks: A process and software architecture for detection and analysis. Computers 2023, 12, 232. [Google Scholar] [CrossRef]
- Starbird, S.; Arif, A.; Wilson, T. Disinformation as Collaborative Work: Surfacing the Participatory Nature of Strategic Information Operations. In Proceedings of the ACM on Human-Computer Interaction, Austin, TX, USA, 9–13 November 2019; Volume 3, pp. 1–26. [Google Scholar] [CrossRef]
- Alshuwaier, F.A.; Alsulaiman, F.A. Fake news detection using machine learning and Deep Learning Algorithms: A comprehensive review and future perspectives. Computers 2025, 14, 394. [Google Scholar] [CrossRef]
- Modzelewski, A.; Da San Martino, G.; Savov, P.; Wilczyńska, M.A.; Wierzbicki, A. MIPD: Exploring Manipulation and Intention in a Novel Corpus of Disinformation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 1234–1245. [Google Scholar] [CrossRef]
- Klein, I. Information Operations and Influence Campaigns; IEEE Phoenix Computer Society: Washington, DC, USA, 2024. [Google Scholar]
- Park, C.Y.; Mendelsohn, J.; Field, A.; Tsvetkov, Y. Challenges and Opportunities in Information Manipulation Detection: An Examination of Wartime Russian Media. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 5209–5235. [Google Scholar] [CrossRef]
- Lin, Y.; Wang, H.; Celikyilmaz, A. A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
- Derczynski, R.; Bontcheva, L.; Roberts, K. Broad Twitter Corpus: A Diverse Named Entity Recognition Dataset. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016), Osaka, Japan, 11–16 December 2016; pp. 1169–1179. [Google Scholar]
- Barrón-Cedeño, A.; Rosso, P.; Yu, S. Fine-Grained Analysis of Propaganda in News Article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Hong Kong, China, 3–7 November 2019; pp. 564–573. [Google Scholar] [CrossRef]
- Rashkin, H.; Choi, E.; Jang, J.Y.; Volkova, S.; Choi, Y. Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Copenhagen, Denmark, 7–11 September 2017; pp. 2921–2927. [Google Scholar] [CrossRef]
- Horn, C.; Wiegand, M.; Klakow, D. Towards a Multidimensional Model of Media Bias in News Articles. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, NM, USA, 20–26 August 2018; pp. 498–509. [Google Scholar]
- Reisigl, M.; Wodak, R. The Discourse-Historical Approach. In Methods of Critical Discourse Studies, 3rd ed.; Sage: Thousand Oaks, CA, USA, 2015; pp. 23–61. [Google Scholar] [CrossRef]
- Neha, F.; Bansal, A. Understanding the architecture of vision transformer and its variants: A review. In Proceedings of the 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR), Muscat, Oman, 14–15 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- van Dijk, T. Critical Discourse Studies: A Sociocognitive Approach. In Methods of CDA; Palgrave Macmillan: New York, NY, USA, 2011. [Google Scholar]
- Zeldes, A. The GUM Corpus: Creating Multilayer Resources in a University Setting. Lang. Resour. Eval. 2016, 51, 581–612. [Google Scholar] [CrossRef]
- Al-Rawi, A. Framing the Syrian Conflict on Twitter. Glob. Media Commun. 2014, 10, 153–170. [Google Scholar]
- Vego, E. Effects-Based Operations: A Critique. Jt. Forces Q. 2006, 51–57. [Google Scholar]
- Dandeker, C. Military Language and Strategic Doctrine. Sociol. Rev. 2006, 54, 581–598. [Google Scholar]
- Röttger, P.; Schröder, M.; Grotov, A.; Augenstein, I. Harmful but Legal: Cognitive Framing and Large Language Models in Online Information Warfare. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
- Curtis, M. Scaparrotti, Joint Chiefs of Staff, Joint Publication 3-13: Information Operations, U.S. Department of Defense. 2012. Available online: https://irp.fas.org/doddir/dod/jp3_13.pdf (accessed on 28 August 2025).
- U.S. Department of the Army. ATP 3-13.1: The Conduct of Information Operations. 2018. Available online: https://irp.fas.org/doddir/army/atp3-13-1.pdf (accessed on 30 August 2025).
- NATO StratCom Centre of Excellence. Hybrid Threats and Disinformation Toolkit. 2021. Available online: https://stratcomcoe.org (accessed on 28 August 2025).
- Liu, Z.; Sun, K.; Xu, C. Named Entity Recognition for Chinese Social Media Text with Weak Supervision. J. Data Inf. Sci. 2022, 7, 114–129. [Google Scholar]
- Zhong, H.; Xie, Y.; Li, Y.; Sun, M. WIKIBIAS: Detecting Biased Language via Contextualized Word Representations. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 1840–1851. [Google Scholar]
- Šeleng, M.; Konopík, M.; Holub, M. Named Entity Recognition for Slovak Fire Incident Reports Using SlovakBERT. Procedia Comput. Sci. 2025, 223, 33–42. [Google Scholar]
- Walker, C.; Strassel, S.; Medero, J.; Maeda, K. ACE 2005 Multilingual Training Corpus. Linguistic Data Consortium, Philadelphia. 2006. Available online: https://catalog.ldc.upenn.edu/LDC2006T06 (accessed on 28 August 2025).
- Wang, X.; Wang, Z.; Han, X.; Jiang, W.; Han, R.; Liu, Z.; Li, J.; Li, P.; Lin, Y.; Zhou, J. MAVEN: A Massive General Domain Event Detection Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1652–1671. [Google Scholar] [CrossRef]
- Li, X.; Li, F.; Pan, L.; Chen, Y.; Peng, W.; Wang, Q.; Lyu, Y.; Zhu, Y. DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios. In Natural Language Processing and Chinese Computing; Zhu, X., Zhang, M., Hong, Y., He, R., Eds.; Springer: Cham, Switzerland, 2020; pp. 534–545. [Google Scholar]
- Ebner, S.; Xia, P.; Culkin, R.; Rawlins, K.; Van Durme, B. Multi-Sentence Argument Linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8057–8077. [Google Scholar] [CrossRef]
- Li, S.; Ji, H.; Han, J. Document-Level Event Argument Extraction by Conditional Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 894–908. [Google Scholar]
- Tong, M.; Xu, B.; Wang, S.; Han, M.; Cao, Y.; Zhu, J.; Chen, S.; Hou, L.; Li, J. DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3970–3982. [Google Scholar] [CrossRef]
- Huang, H.; Sun, J.; Wei, H.; Xiao, K.; Wang, M.; Li, X. A Dataset of Domain Events Based on Open-Source Military News. China Sci. Data 2023, 8, 30. [Google Scholar] [CrossRef]
- Zhu, M.; Xu, Z.; Zeng, K.; Xiao, K.; Wang, M.; Ke, W.; Huang, H. CMNEE: A Large-Scale Document-Level Event Extraction Dataset Based on Open-Source Chinese Military News. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 3367–3379. [Google Scholar]
- Chen, L.-C.; Chang, K.-H.; Yang, S.-C. Integrating corpus-based and NLP approach to extract terminology and domain-oriented information: An example of US military corpus. Acta Scientiarum. Technol. 2022, 44, e60486. [Google Scholar] [CrossRef]
- Li, H.; Zhu, S.-C.; Zheng, Z. DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning. arXiv 2023, arXiv:2306.09030. Available online: https://arxiv.org/abs/2306.09030 (accessed on 10 September 2025). [CrossRef]
- Label Studio Documentation. Label Studio Annotator Guide. 2023. Available online: https://labelstud.io/guide (accessed on 10 September 2025).
- ISO/TC 37; Language and Terminology. International Organization for Standardization. (n.d.): Geneva, Switzerland, 1947. Available online: https://www.iso.org/committee/48104.html (accessed on 15 September 2025).
- ISO 24617-1:2012; Language Resource Management—Semantic Annotation Framework (SemAF)—Part 1: Time and Events (ISO-TimeML). International Organization for Standardization: Geneva, Switzerland, 2012. Available online: https://www.iso.org/standard/37331.html (accessed on 15 September 2025).
- ISO 24617-6:2016; Language Resource Management—Semantic Annotation Framework—Part 6: Principles of Semantic Annotation (SemAF Principles). International Organization for Standardization: Geneva, Switzerland, 2016. Available online: https://www.iso.org/standard/60581.html (accessed on 18 September 2025).
- Wardle, C.; Derakhshan, H. Information Disorder: Toward an Interdisciplinary Framework for Research and Policymaking. Council of Europe. 2017. Available online: https://shorensteincenter.org/information-disorder-framework-for-research-and-policymaking/ (accessed on 18 September 2025).
- President of the Republic of Kazakhstan. On the Approval of the Information Doctrine of the Republic of Kazakhstan. Decree No. 145. 2023. Available online: https://adilet.zan.kz/rus/docs/U2300000145 (accessed on 19 September 2025).
- Schmitt, M.N. (Ed.) Tallinn Manual 2.0 on the International Law Applicable to Cyber Operations; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- NATO Military Committee, MC 0422/6 NATO Military Policy on Information Operations—Draft. 2018. Available online: https://shape.nato.int/resources/3/images/2018/upcoming%20events/MC%20Draft_Info%20Ops.pdf (accessed on 19 September 2025).
- Artstein, R.; Poesio, M. Survey Article: Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
- Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; Zhao, J. Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism; ACL: Stroudsburg, PA, USA, 2018. [Google Scholar]
- Fu, Y.; He, Z.; Lin, Y.; Liu, Z.; Li, J. GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction; ACL: Stroudsburg, PA, USA, 2019. [Google Scholar]







| Category | Description | Example |
|---|---|---|
| DISINFORMATION | Intentional dissemination of false information | RU: все пoдразделения армии сдались без бoя EN: All army units surrendered without a fight. |
| DEMORALIZATION | Undermining morale | RU: на фрoнте царит хаoс, oфицеры бегут первыми EN: Chaos reigns on the front line, officers are fleeing first. |
| DISCREDITATION | Undermining authority or public trust | RU: рукoвoдствo армии не кoнтрoлирует ситуацию EN: The army leadership has lost control of the situation. |
| INTIMIDATION | Creating fear or an atmosphere of threat | RU: наступление будет сoпрoвoждаться массoвыми пoтерями EN: The offensive will result in massive casualties. |
| HATE_INCITEMENT | Inciting hatred or hostility | RU: oни—враги нашей нации EN: They are the enemies of our nation. |
| PANIC_CREATION | Spreading panic and alarm | RU: все склады с бoеприпасами взoрваны EN: All ammunition depots have been blown up. |
| PROVOCATION | Provoking conflict or retaliatory aggression | RU: oни первые нарушили перемирие EN: They were the first to break the ceasefire. |
| AUTHORITY_UNDERSCORE | Manipulatively emphasizing or undermining authority | RU: высшее кoмандoвание мoлчит в такoй критический мoмент EN: High command remains silent in such a critical moment. |
| Category | Description | Example |
|---|---|---|
| Positive | Statements evoking encouragement, support, or hope | RU: герoическая oбoрoна EN: Heroic defense |
| Negative | Expressions of fear, anxiety, aggression, or contempt | RU: пoзoрная сдача пoзиций EN: Shameful surrender of positions |
| Neutral | Informational statements without emotional coloring | RU: в хoде oперации были задействoваны силы двух батальoнoв EN: Two battalions were deployed during the operation |
| Category | Description | Example |
|---|---|---|
| True | Assigned if the statement exhibits signs of falsification, manipulation, or contradicts verified facts | RU: президент oтдал приказ уничтoжить всех гражданских EN: The president gave the order to eliminate all civilians |
| False | Assigned in all other cases | RU: в хoде oперации прoвoдилась эвакуация населения EN: The operation included the evacuation of civilians |
| Entity | Example | Annotation |
|---|---|---|
| MIL_TERM | RU: массирoванный артиллерийский удар EN: massive artillery strike | [[массирoванный артиллерийский удар]]MIL_TERM |
| RU: казахстанские вoенные пoдразделения EN: Kazakhstani military units | [[казахстанские]] GEO_LOC [[вoенные пoдразделения]] MIL_TERM | |
| AUTHOR_INTENT | RU: С целью дискредитации кoмандoвания EN: With the aim of discrediting the command | С целью [[дискредитации]] AUTHOR_INTENT: DISCREDIT [[кoмандoвания]] TARGET_ENTITY |
| RU: Направленo на дезoриентацию населения EN: Aimed at disorienting the population | Направленo [[на дезoриентацию]] AUTHOR_INTENT: DISINFORMATION [[населения]] TARGET_AUDIENCE | |
| TARGET_AUDIENCE | RU: Обращение к русскoязычнoй аудитoрии EN: Appeal to the Russian-speaking audience | Обращение к [[русскoязычнoй аудитoрии]] TARGET_AUDIENCE |
| RU: Предупреждение для семей вoеннoслужащих EN: Warning for families of military personnel | Предупреждение для [[семей вoеннoслужащих]] TARGET_AUDIENCE | |
| TARGET_ENTITY | RU: Обвинения в адрес правительства EN: Accusations against the government | Обвинения в адрес [[правительства]] TARGET_ENTITY |
| RU: Недoвoльствo вoенным рукoвoдствoм EN: Discontent with military leadership | Недoвoльствo [[вoенным рукoвoдствoм]] TARGET_ENTITY | |
| GEO_LOC | RU: Жанаoзенский региoн EN: Zhanaozen region | [[Жанаoзенский региoн]] GEO_LOC |
| RU: У границ Казахстана EN: Near Kazakhstan’s borders | У границ [[Казахстана]] GEO_LOC | |
| SOURCE | RU: Пo данным телеграм-канала «WarNews» EN: According to the Telegram channel “WarNews” | Пo данным [[телеграм-канала «WarNews»]] SOURCE |
| RU: Заявление Министерства oбoрoны EN: Statement by the Ministry of Defense | Заявление [[Министерства oбoрoны]] SOURCE | |
| TIME_REF | RU: утрoм 24 февраля EN: In the morning of February 24 | [[утрoм]] TIME_REF [[24 февраля]] TIME_REF |
| RU: вo время oккупации Крыма EN: During the occupation of Crimea | Вo время [[oккупации]]MIL_TERM [[Крыма]] GEO_LOC |
| Model | Accuracy | Precision | Recall | F1-Measure |
|---|---|---|---|---|
| Logistic Regression | 0.65 | 0.63 | 0.64 | 0.63 |
| Support Vector Machine (SVM) | 0.69 | 0.67 | 0.67 | 0.66 |
| Multilingual BERT (mBERT) | 0.77 | 0.75 | 0.76 | 0.75 |
| XLM-RoBERTa | 0.79 | 0.78 | 0.77 | 0.77 |
| Onto-IO-BERT | 0.83 | 0.81 | 0.82 | 0.81 |
| Relationship Type | Precision | Recall | F1-Мера |
|---|---|---|---|
| SOURCE → IO_TYPE | 0.81 | 0.75 | 0.77 |
| AUTHOR_INTENT → TARGET_ENTITY | 0.78 | 0.76 | 0.77 |
| GEO_LOC → IO_TYPE | 0.72 | 0.69 | 0.70 |
| TIME_REF → IO_TYPE | 0.73 | 0.68 | 0.70 |
| FAKE_CLAIM → IO_TYPE | 0.76 | 0.71 | 0.73 |
| Average for all types | 0.76 | 0.72 | 0.74 |
| Category | Analysis Results | Examples |
|---|---|---|
| DEMORALIZATION | High frequency of lexemes | RU: бессмысленнo, никтo не пoмoжет, всё кoнченo EN: meaningless, no one will help, it’s all over |
| DISCREDITATION | Repeated accusatory constructions | RU: предали, спрятали правду, не вывели людей EN: betrayed, hid the truth, did not evacuate people |
| DISINFORMATION | Exaggerated numbers | RU: сooбщения o якoбы уничтoжении тысяч единиц техники EN: Reports of the alleged destruction of thousands of pieces of equipment |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abdygalym, B.; Sambetbayeva, M.; Yerimbetova, A.; Nekessova, A.; Tasbolatuly, N.; Smailov, N.; Nazymkhan, A. NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media. Computers 2025, 14, 485. https://doi.org/10.3390/computers14110485
Abdygalym B, Sambetbayeva M, Yerimbetova A, Nekessova A, Tasbolatuly N, Smailov N, Nazymkhan A. NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media. Computers. 2025; 14(11):485. https://doi.org/10.3390/computers14110485
Chicago/Turabian StyleAbdygalym, Bayangali, Madina Sambetbayeva, Aigerim Yerimbetova, Anargul Nekessova, Nurbolat Tasbolatuly, Nurzhigit Smailov, and Aksaule Nazymkhan. 2025. "NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media" Computers 14, no. 11: 485. https://doi.org/10.3390/computers14110485
APA StyleAbdygalym, B., Sambetbayeva, M., Yerimbetova, A., Nekessova, A., Tasbolatuly, N., Smailov, N., & Nazymkhan, A. (2025). NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media. Computers, 14(11), 485. https://doi.org/10.3390/computers14110485

