Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources

Litvinov, Arsenii; Malishevskii, Lev; Karpulevich, Evgeny; Bespalov, Iaroslav; Nedumov, Yaroslav; Zhdanov, Sergey; Oseledets, Ivan; Shlyakhto, Evgeniy; Avetisyan, Arutyun

doi:10.3390/informatics13030045

Open AccessReview

Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources

by

Arsenii Litvinov

^1,*

,

Lev Malishevskii

²,

Evgeny Karpulevich

¹,

Iaroslav Bespalov

³,

Yaroslav Nedumov

¹,

Sergey Zhdanov

⁴,

Ivan Oseledets

³,

Evgeniy Shlyakhto

² and

Arutyun Avetisyan

¹

Trusted AI Research Center, Russian Academy of Sciences, 100904 Moscow, Russia

²

Almazov National Medical Research Center, 197341 St. Petersburg, Russia

³

Artificial Intelligence Research Institute, 123112 Moscow, Russia

⁴

Health Industry Center, PJSC Sberbank, 115432 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Informatics 2026, 13(3), 45; https://doi.org/10.3390/informatics13030045

Submission received: 11 December 2025 / Revised: 18 February 2026 / Accepted: 26 February 2026 / Published: 20 March 2026

(This article belongs to the Special Issue From Data to Evidence: Transformative AI for Real-World Data)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Russian-language resources for medical natural language processing (NLP) are expanding rapidly; however, their fragmentation, uneven curation, and limited clinical reliability hinder the development of safe machine learning systems for prognosis, prevention, and precision medicine. We provide the first systematic survey of Russian medical NLP datasets and analyze their suitability for clinically meaningful tasks as defined by the MedHELM taxonomy. We additionally perform expert clinical validation of three representative public corpora—RuMedPrimeData (real outpatient notes), MedSyn (synthetic clinical notes), and RuMedNLI (translated natural language inference)—assessing clinical plausibility, diagnosis accuracy, and logical consistency. Experts identified substantial reliability issues: across randomly sampled subsets of each corpus, only approximately 20% of RuMedPrimeData records, fewer than 15% of MedSyn records, and approximately 55% of RuMedNLI pairs met essential quality criteria, which can hinder downstream ML systems built on these data. To support robust applications—ranging from medical chatbots and triage assistants to predictive and preventive models—we outline practical requirements for high-quality datasets: coordinated, expert-validated, machine-readable corpora aligned with clinical guidelines and insurance logic, standardized de-identification, and transparent provenance. Strengthening these data foundations will enable the development of reliable, reproducible, and clinically relevant AI systems suitable for real-world healthcare applications.

Keywords:

survey; datasets; benchmarks; large language models; data provenance; Russian medical natural language processing; expert validation; clinical guidelines

1. Introduction

In recent years, there has been a rapid surge of interest in large language models (LLMs) within the medical domain. Modern models such as Google’s Med-PaLM have, for the first time, managed to surpass the “passing” threshold on the USMLE medical licensing exam [1]. Extensive English-language benchmarks for medical NLP have emerged, covering a wide range of tasks, from answering patient questions and solving medical exam problems to named entity recognition and clinical document summarization [1,2]. However, the vast majority of these resources are available only in English. Until recently, only a few Russian-language medical datasets have been available [3]. This creates a significant gap: models trained exclusively on English data fail to account for the linguistic and healthcare system features of the Russian Federation. Literal translation of existing English datasets is insufficient, as medical terminology, measurement units, and healthcare context often lack direct equivalents and require careful expert adaptation [4,5].

Therefore, there is an urgent need to review Russian-language medical datasets with consideration of local characteristics—such as mandatory clinical guidelines and the Compulsory Health Insurance System (CHIS)—and to compare them with their English-language counterparts.

Comprehensive benchmarks that combine multiple formalized tasks are widely used in the research community to assess model performance within a specific domain [6]. In general NLP, such collections include the English-language GLUE [7] and SuperGLUE [8] benchmarks for evaluating language understanding. For the medical field, domain-specific evaluation suites such as MedQA [9], MedMCQA [10], and MedNLI [11] have been introduced, enabling systematic assessment of model capabilities across a spectrum of medical reasoning and knowledge tasks.

MedHELM [12] represents one of the most detailed taxonomies of medical NLP tasks designed for evaluating LLMs. It comprises five main categories and 121 individual tasks: (1) Clinical Decision Support—patient data analysis and diagnostic or treatment recommendations; (2) Clinical Note Generation—automation and structuring of clinical documentation; (3) Patient Communication and Education—simplifying information and answering patient queries; (4) Medical Research Assistance—literature review and scientific data analysis; and (5) Administration and Workflow—scheduling, resource management, and document processing. This taxonomy provides a unified framework developed in collaboration with domain experts to describe the full spectrum of medical tasks potentially solvable by LLMs.

The aim of this survey is to describe and systematize Russian-language medical data, analyze existing corpora and identify domains that remain uncovered. Building on the MedHELM [12] taxonomy, we examine the extent to which current data support clinically meaningful tasks related to decision-making, documentation, and communication in medicine. Clinical guidelines play an especially important and legally binding role in the Russian healthcare system, which influences the formulation and interpretation of many clinical tasks. To assess the overall reliability of available data, we selected three representative datasets—covering real, translated, and generated clinical texts—and conducted expert verification by practicing physicians. The survey also briefly highlights MedHELM [12] task types that currently lack Russian-language counterparts. This evaluation provides insight into how trustworthy existing data are for training and validating medical AI systems.

2. Russian Medical Data Sources

Existing surveys and benchmarks usually categorize medical datasets into four broad clusters: electronic health records (EHRs), scientific literature, web data, and knowledge bases [13]. While this approach captures the main origins of current datasets used for medical NLP and LLM training, it overlooks many types of medical information that combine clinical, communicative, and regulatory elements. For example, multidisciplinary consultations or patient–doctor dialogues often fall outside these four groups, while legally significant documents or expert evaluations remain underrepresented despite their importance for healthcare decision-making. Furthermore, such classification tends to mix functional roles (clinical, educational, administrative) with technical storage forms (EHRs, web sources), which limits its usefulness for understanding how data are produced and used in practice.

To provide a more comprehensive view, we propose a functional taxonomy of medical textual data that focuses on their role in medical activity and source of origin. The framework distinguishes five complementary categories (Figure 1), which correspond to key layers of healthcare information.

2.1. Clinical Practice Documents

This category covers both outpatient and inpatient medical records, representing the routine documentation of diagnosis and treatment.

Outpatient documents—including consultation notes, diagnostic test results, referral forms and medical certificates—describe longitudinal care in ambulatory settings.

Inpatient documentation—such as admission notes, surgery and anesthesia reports, daily progress notes and discharge summaries—records the course of treatment in hospitals.

Together, these sources form the clinical basis of the patient’s history and correspond functionally to what is often referred to as electronic health records (EHRs) in international taxonomies. They represent structured, regulated data generated directly during medical care.

2.2. Communication and Interaction Data

This group includes materials reflecting professional and patient-facing communication: doctor––doctor interactions (e.g., multidisciplinary team meetings), patient–doctor consultations—including recorded conversations at the doctor’s office with the consent of both parties—as well as receptionist–patient appointment dialogues. It also encompasses broader social communication such as medical forums, news posts and blogs, which may be written by either healthcare professionals or non-specialists. Such sources capture spontaneous clinical language and reasoning but differ widely in reliability: verified teleconsultations and professional exchanges stand at one end of the spectrum, while open forums and patient reviews, where anyone can comment, occupy the opposite.

2.3. Scientific, Educational and Regulatory Resources

These texts include formally validated materials such as clinical practice guidelines, professional standards, peer-reviewed research and educational or methodological publications. In the Russian context, clinical practice guidelines occupy the highest authority in medical decision-making, followed by scientific and review literature that systematizes evidence, while educational materials (e.g., textbooks or training manuals) serve a supportive, didactic role. Collectively, these sources provide verified and structured medical knowledge that underlies both evidence-based medicine and the development of decision-support or reasoning systems.

2.4. Administrative and Legal Documents

This category includes highly standardized records used for formal medical and legal purposes—official medical certificates and statements, consent forms, referral and registration documents, as well as expert and evaluation reports. These texts rarely contain detailed clinical reasoning but are crucial for documenting institutional procedures, patient rights, and the legal basis of healthcare activities.

2.5. Auxiliary Reference Materials

This final group comprises structured repositories such as medical classifiers (ICD [14], ATC [15], SNOMED [16]), drug and device registries and terminological or ontological bases maintained by state agencies and research organizations. They provide the formal vocabulary and cross-linking mechanisms necessary for interoperability and knowledge extraction across datasets.

3. Specificities of Russian Clinical Practice

3.1. Mandatory Clinical Guidelines

According to Federal Law No. 323-FL of 21 November 2011, “On the Fundamentals of Public Health Protection in the Russian Federation” (hereinafter, “Law No. 323-FL”), medical care is organized and delivered in compliance with the mandatory Procedures for the Provision of Medical Care, which are binding for all medical organizations across Russia. These procedures are based on clinical guidelines and take into account standards of medical care, except for care provided within the framework of clinical approbation. Clinical guidelines further promote the standardization of medical documentation, not only in form but also in content. For NLP, this implies that when creating training datasets, it is necessary to include texts of clinical guidelines and to verify that the model’s outputs align with these official provisions.

3.2. The Role of the CHIS

The CHIS is a model of social health insurance. A medical organization receives payment for care provided to a patient, but only for those services covered by the state guarantee program and performed in accordance with established rules. The entire patient management process is strictly regulated: (1) the Procedures for the Provision of Medical Care define the stages, conditions, and standards of care for specific diseases; (2) clinical guidelines contain diagnostic and treatment algorithms based on evidence-based medicine; and (3) standards of medical care specify the required procedures, medications, and tests for a given diagnosis (by ICD-10 code [14]). To receive reimbursement, strict reporting is maintained through a system of unified, standardized document flow, which allows for tracking a patient’s pathway from primary care to hospital or specialized center. While these voluminous, standardized, and interconnected datasets with clear ICD [14] coding present a significant advantage for model training, they are not without drawbacks. For instance, template-like descriptions of objective status and a narrowed focus on a patient’s specific diagnosis are common, which artificially reduces data diversity.

3.3. Linguistic Features of Russian and Units of Measurement

Russian medical records contain a large number of abbreviations and acronyms, including those adopted within specific institutions for convenience and workflow efficiency. Models trained on English data are likely to misinterpret these Russian-specific abbreviations. Furthermore, many laboratory values and pharmaceutical products use units of measurement that differ from those in Europe and the US (e.g., mmol/L instead of mg/dL, temperature in Celsius, blood group designations using Roman numerals alongside the ABO system). Pharmacological agents are often registered under different brand names in Russia. The creators of the RuMedNLI dataset noted the necessity of manually replacing terms, measurement units, and drug names with their Russian counterparts [4]. Without such localization, automatically translated corpora appear unnatural. The occasional presence of colloquialisms and grammatical contractions in medical records [6] imposes additional requirements on the robustness of NLP models. A glossary of common Russian clinical terms and their English equivalents is provided in Appendix A, Table A1.

3.4. Structure of Medical Documentation

Russian medical records (both outpatient and inpatient) have a well-defined structure with a series of standard sections: chief complaint, medical history, objective status, diagnosis, and treatment plan. In contrast, medical records from other countries often consist predominantly of free text, are less template-driven, are structured according to different principles, and place greater emphasis on the patient’s social history [6]. Furthermore, inpatient records in Russia typically incorporate information from outpatient charts, allowing for an assessment of the dynamics of the primary disease up to the point of hospitalization. Currently, there is a lack of open datasets containing both outpatient and inpatient medical records that would enable tracking of all stages of a patient’s interaction with the Russian healthcare system.

The aforementioned specificities demonstrate that simply translating an English-language corpus is insufficient for developing robust models suited to the Russian context. There is a need for original Russian-language datasets, curated with local specifics in mind and annotated and validated by practicing medical experts.

4. Russian-Language Medical Datasets

4.1. Overview of Available Russian Medical NLP Datasets

In this review, we considered all available Russian-language datasets related to medicine (Table 1, Figure 2)—that is, any corpora from which a model can, in principle, acquire medical knowledge or reasoning abilities. The boundary between “medical” and “non-medical” data is not always clear: texts about health, pharmacology, or even social discussions of diseases may contain clinically relevant information. Therefore, we did not exclude corpora such as Fake vs. Real News (COVID-19) [17], as they also contribute to understanding medical discourse and public health communication.

This review was conducted as a descriptive survey of publicly available Russian-language datasets relevant to medical natural language processing, with a specific focus on datasets providing self-contained textual content suitable for large language models (LLMs). No explicit temporal restrictions were applied, as the Russian medical NLP ecosystem remains relatively limited and fragmented, and earlier datasets are often reused or adapted in contemporary research.

Dataset identification relied on a multi-source search strategy, including academic literature databases (e.g., Google Scholar and Semantic Scholar), dataset repositories and registries (such as GitHub, Kaggle, Hugging Face Datasets, and institutional project pages), as well as reference chaining from prior surveys and primary dataset publications. Search queries combined language-, domain-, and task-specific keywords (e.g., “Russian medical dataset”, “Russian clinical text”, “Russian biomedical NLP”, etc.). A list of publicly accessible online resources and dataset links is provided in Appendix C (Table A5).

Inclusion criteria covered publicly accessible Russian-language text corpora containing medically relevant information, regardless of whether they were originally created for clinical, biomedical, or public health-related tasks. We also considered datasets described in peer-reviewed publications even when the underlying data were not publicly released, focusing on their data provenance, annotation scheme and intended use.

Datasets lacking self-sufficient textual content were excluded, including purely tabular, audio-based, or image-centric resources, as well as multimodal datasets where text was not informative without additional modalities. Although such data can be transformed into text through additional processing, they do not constitute final LLM-ready textual corpora. Datasets for text normalization were retained, as they provide useful supervision for LLMs operating on noisy medical text.

Finally, we accounted for the reliability of data sources. While some user-generated or media-based texts may be weakly verified, datasets with explicit annotation or curation were retained, whereas unannotated collections of social media posts were excluded due to unclear utility for LLM training or evaluation.

Our primary aim was to evaluate how existing Russian datasets enable medical knowledge acquisition and what degree of reliability their linguistic and contextual adaptation provides. For the English language, large and diverse resources exist in every medical NLP domain, from clinical note processing [i2b2/n2c2—[18]; MIMIC-III—[19]] to reasoning [MedExQA—[20]] and question answering [MedQA (USMLE)—[9]; PubMedQA—[21]]. Translation and generation have become important but not exclusive approaches to corpus creation in Russian medical NLP.

Table 1. Overview of Russian-language medical datasets.

Name	Description	Year	Origin and Curation	Task	Size	Access	Source
RICD (Russian Intensive Care Dataset) [3]	Anonymized critical-care EHR data from the Federal Research and Clinical Center of Intensive Care and Rehabilitology, covering ICU and hospital stays. Records include detailed vital signs, lab results, diagnoses, treatments and outcomes collected for each hospitalization.	2024	Real EHR; structured tables and notes; de-identified; clinician-curated; no synthetic data	ICU prediction and clinical research. Can be used for tasks such as disease patterns from symptoms/vitals recognition; lab results interpretation; deterioration, readmission, and disease progression prediction; outcomes, adverse events, and discharge readiness prediction.	8135 patients; 10,938 hospitalizations; ~33 M records	Paid license, free demo	Official project website
SibMed Clinical Repository [4]	Repository of anonymized outpatient and inpatient records from SSMU clinics, including text, labs and imaging. Each entry contains visit-level data such as clinical notes, diagnostic tests, imaging reports, and anonymized identifiers.	2022	Real clinical data; structured and unstructured; continuously updated; anonymized; no generation	General research and algorithm development. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	1,427,210 records; 275,512 cases; 67,020 discharge summaries	Controlled access (research agreement required)	Official project website
RuMedPrimeData [22]	Anonymized clinical notes with symptoms and ICD-10 codes from SSMU outpatient visits. Each record contains a patient ID, visit ID, date/time, anamnesis, symptoms, and the assigned ICD-10 diagnosis.	2021	Real notes; manual ICD-10 and symptom annotation; no generation	ICD-10 classification. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	7625 total (4690 train/848 dev/822 test)	Public	Zenodo
RuMedTop3 [23]	Subset of RuMedPrime with filtered ICD-10 codes for top-3 prediction. Records mirror the structure of RuMedPrime (anamnesis, symptoms, ICD-10 labels).	2022	Derived from RuMedPrime; manual ICD labels; no generation	ICD-10 classification. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	6360 total (4690 train/848 dev/822 test)	Public	GitHub
RuMedSymptomRec [23]	Task derived from RuMedPrime to recommend a relevant symptom from a clinical note. Each record includes a clinical note (anamnesis and symptoms) with one symptom withheld to serve as the label.	2022	Derived from RuMedPrime; manual symptom codes; no generation	Symptom recommendation. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	3300 total (2470 train/415 dev/415 test); 141 symptom codes	Public	GitHub
RuMedDaNet [23]	Yes/no medical question-answering task with contexts from various domains. Each sample consists of a question, a supporting text passage and a binary (yes/no) answer.	2022	Real contexts; questions generated and annotated by assessors; human-verified	Binary question answering. Can be used for tasks such as medical knowledge questions answering.	2076 total (1308 train/256 dev/512 test)	Public	GitHub
RuMedNLI [23]	Natural language inference corpus translated and post-edited from MedNLI. Each example pairs a clinical premise sentence with a hypothesis and labels their relationship (entailment, neutral, contradiction).	2022	Machine translation (Google and DeepL) with manual post-editing; premise–hypothesis pairs	Entailment classification. No direct MedHELM tasks.	15,685 total (12,627 train/1422 dev/1536 test)	Public	GitHub
RuMedNER [23]	Named-entity recognition over drug-review texts. Each sample is a consumer review sentence annotated with labels for medication, ADR, disease and other relevant entities.	2022	Based on RuDReC [2]; manual annotation of medication, ADR, disease, note; no generation	Drug-review NER. No direct MedHELM tasks.	4809 total (3440 train/676 dev/693 test); 1.4 M raw texts	Public	GitHub
MedSyn-Synthetic [6]	A fully synthetic corpus of Russian-language clinical notes generated using large language models (GPT-4, LLaMA, etc.). Each record imitates a physician-written clinical note and follows a typical structure: complaints, symptoms, anamnesis, description of condition, and a diagnosis linked to an ICD-10 code. No reference summaries or real clinical texts are included.	2024	Entirely generated by LLMs using a small external anonymized set of 30 real clinical notes as templates	ICD-10 classification. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	41,185 clinical notes in Russian covering 219 ICD-10 codes	Public	Hugging Face
MedSyn-IFT [6]	A dataset for instruction fine-tuning of medical LLMs in Russian. It contains instruction–input–output triples designed to train models to perform medical tasks such as generating conclusions, extracting symptoms, explaining diagnoses, and other reasoning tasks. It is not a corpus of full clinical notes.	2024	Multiple sources, including synthetic data generated from MedSyn, public Russian medical NLP datasets, encyclopedic medical texts, and automatically generated instruction tasks	Instruction fine-tuning	138,048 instruction–input–output samples	Public	Hugging Face
smakov/ru_medsum [24]	Paired Russian medical abstracts and titles for summarization. Each data point contains a Russian abstract (body) and its concise title, making it suitable for abstractive summarization.	2025	Real CyberLeninka abstracts; titles as summaries; manual selection	Summarize papers.	26,353 total (23,711 train/1321 dev/1321 test)	Public	Hugging Face
Medical forum Q&A [25]	Russian forum questions and answers with metadata. Each record contains a question and one or more user-generated answers, along with categories and timestamps.	2021	User-generated forum data; structured; no annotation	Question answering and information retrieval. Can be used for tasks such as medical knowledge questions answering.	190,335 Q&A posts	Public	Hugging Face
RuMedQ [26]	Synthetic symptom-question pairs with a correctness label. Each line contains a symptom text, a question generated from that symptom and a binary label indicating correctness.	2021	RuGPT-3 generation; cleaned and annotated by medical specialists	Question generation and natural language inference. Can be used for tasks such as follow-up questions generation.	6053 pairs	Public	GitHub
FutureBeeAI Healthcare Chat [27]	Text-based chat conversations between customers and healthcare call-center agents. Records include full transcripts of dialogues covering appointment scheduling, insurance queries, and medical consultations.	2020	Real call-center dialogues; no annotation; no generation	Dialogue modeling. Can be used for tasks such as billing/insurance explanations, triage, appointment/refill handling and response drafting.	>10,000 conversations (300–700 words; 50–150 turns)	Commercial	Company Website
RuDReC [2]	Russian Drug Reaction Corpus with raw health texts and annotated consumer reviews (expanded version of the 2017 pilot Russian Drug Review Corpus). Contains consumer reviews annotated for medication, ADR, disease and note entities.	2020	User-generated reviews; manual entity annotation (medication, ADR, disease, note); no generation	NER, no direct MedHELM tasks.	500 annotated reviews; ~1,400,000 raw texts	Public	GitHub
RuADReCT	A Russian corpus of tweets annotated for the presence of adverse drug reactions (ADRs). Consists of tweets describing health issues, labeled for whether they contain information about an adverse side effect that occurred when taking a drug.	2020	Tweets; binary ADR labeling (tweet ID + class label; script provided to collect the source text)	Binary classification of ADR presence in tweets, no direct MedHELM tasks.	9515 tweets	Public	GitHub (within RuDReC repository)
RuCCoN [28]	Clinical concept normalization dataset linking entity mentions to UMLS concepts. Each entry provides a clinical phrase alongside its mapped concept ID(s).	2022	Manual annotation by medical professionals; no generation	Entity linking/concept normalization, no direct MedHELM tasks.	16,028 mentions; 2409 concepts	Public	GitHub
Full-Size Russian Corpus of Internet Drug Reviews [29]	Extended corpus of online drug reviews with complex NER and coreference annotations (full-size version of RuDReC [2]). Each review is annotated with medication, ADR, disease, note entities, and coreference chains.	2022	Real user reviews; manual NER and coreference; no generation	NER and coreference, no direct MedHELM tasks.	33,005 medication mentions; 1778 ADR mentions; 17,403 disease mentions; 4490 note mentions; 1560 coreference chains	Website unavailable	Website unavailable
Ophthalmology Russian/English Translations [30]	Parallel corpus of Russian–English sentence pairs from ophthalmology literature, accompanied by a glossary of unique terms.	2024	Human translation of medical abstracts; high quality; includes glossary; no generation	Machine translation and terminology extraction. Can be used for tasks such as generating visual aids, translating content, making content accessible.	3473 total (3304 train/169 test); glossary with 1211 unique terms	Public	Kaggle
DataN [31]	Anonymized EHR dataset from a large private clinic network in Russia, used for automated ICD-10 prediction. Each record contains visit-level diagnostic codes, clinical notes and structured attributes.	2020	Real EHR; de-identified; no generation	ICD-10 prediction. Can be used for tasks such as disease patterns from symptoms and vitals recognition.	251,763 patients; 1,685,253 visits	Not public	Described in publication (not released)
DataM [31]	Anonymized EHR dataset from a second private clinic network, complementary to DataN in the same ICD-10 prediction study. Each record includes visit-level ICD-10 codes and associated structured clinical data.	2020	Real EHR; no generation	ICD-10 prediction. Can be used for tasks such as disease patterns from symptoms and vitals recognition.	177,715 patients; 563,106 visits	Not public	Described in publication (not released)
DataT [31]	Evaluation EHR dataset from a network of public outpatient clinics, serving as an external test set. It contains visit-level entries with ICD-10 codes and basic clinical text fields.	2020	Real EHR; no generation	ICD-10 prediction. Can be used for tasks such as disease patterns from symptoms and vitals recognition.	694,063 patients; 1,728,529 visits	Not public	Described in publication (not released)
RuBioRoBERTa [32]	Large corpus of Russian biomedical texts for pre-training RuBioRoBERTa. The corpus comprises full-text articles and abstracts from CyberLeninka and related repositories.	2022	CyberLeninka articles; unsupervised; no annotation	Language model pre-training, no direct MedHELM tasks.	338,000 articles; 1.2 B words	Public	GitHub, Hugging Face
MmedBench [33]	Multilingual (including Russian) medical multiple-choice benchmark with rationales. Each question includes a clinical stem, four answer options, and a rationale explaining the correct answer.	2024	Collected from MMedC; partly generated; rationales human-checked	Multiple-choice question answering. Can be used for tasks such as medical knowledge questions answering.	53,566 total (45,048 train/8518 test)	Public	GitHub
NEREL-BIO [34]	Nested NER dataset of biomedical PubMed abstracts in Russian and English. Each abstract is annotated with nested entities and relations.	2023	PubMed abstracts; manual nested NER; no generation	Nested named-entity recognition, no direct MedHELM tasks.	766 Russian abstracts; 66,888 entity mentions	Public	GitHub
COVID-19 Fake vs. Real News Corpus [17]	Corpus of viral COVID-19 fake stories and corresponding authentic news. Each entry consists of a fake news text or a genuine news text with metadata; the dataset was compiled to support studies on misinformation and discourse.	2023	Social media posts; manually compiled; no generation	Fake-news detection and linguistic analysis, no direct MedHELM tasks.	1722 total (897 fake + 825 authentic); ~1.7 M words	Not public	Described in publication (not released)
RuCCoD [35]	Russian diagnosis conclusion texts extracted from EHRs, manually annotated by physicians for automatic clinical coding. Diagnosis narratives contain entity spans explicitly linked to ICD-10-CM codes. The dataset is designed for multiclass ICD coding and entity-level clinical concept grounding. The authors also described an unreleased automatically annotated diagnosis prediction dataset (RuCCoD-DP).	2025	Real clinical from a large urban clinical system human and automatic annotation	Multiclass classification, no direct MedHELM tasks.	3500 records (3000/500 train/test); 10,326 entities (8769/1557); 1455/548 ICD-10-CM codes; automatic labels: 865,539 visits (164,527 patients))	Public	GitHub
BioNNE-L [36]	A dataset for biomedical entity normalization (entity linking) with nested entities in Russian and English. Based on the NEREL-BIO corpus and extended with annotations linking disease, chemical, and anatomical entities to UMLS concepts. Contains annotated PubMed abstracts with entity spans and corresponding UMLS CUIs.	2025	Real scientific text data (PubMed abstracts); manual expert annotation of nested entities; Russian abstracts and English translations; partial use of translated data; no synthetic generation; curated for the BioASQ BioNNE-L task.	Entity normalization/entity linking to UMLS concepts, no direct MedHELM tasks.	Training and validation: 716 Russian + 54 English abstracts; Dev: 50 Russian + 50 English; Test: 154 Russian + 154 English; normalization dictionary with >3.9 M UMLS concepts	Public	GitHub
Medical Articles [37]	A collection of Russian-language medical articles scraped from the MEDSI website. Each record contains a title and full article text, often including author information and references. Designed for downstream NLP tasks on general medical informational texts.	2024	Real web data; scraped from medsi.ru/articles; no annotation	Summarize papers.	520 articles	Public	Kaggle
RuMedSpellchecker corpus [38]	Combined experimental corpus of Russian medical anamneses used in the RuMedSpellchecker study, aggregating texts from four sources: two public datasets (RuMedPrimeData and RuMedNLI) and two closed clinical datasets. There are also test sets consisting of correct–incorrect pairs, both with context and without	2023	Real clinical texts; anonymized medical histories; collected for internal research	Medical spell correction. Can be used for tasks such as disease patterns from symptoms/vitals recognition.	30,737 anamnesis texts (approximately 50–200 words each), 2300 contextual pairs, 2300 pairs without context	Not public	Described in publication (not released)

4.2. Coverage of MedHELM Clinical Tasks by Existing Russian Datasets

Mapping Russian medical NLP datasets to the MedHELM task taxonomy is inherently challenging. MedHELM defines clinically meaningful tasks relevant to real medical practice, whereas most available datasets were created for specific machine learning objectives. Consequently, not every MedHELM task can be realistically addressed through dataset-driven approaches. For example, administrative and organizational tasks—such as resource scheduling or referral processing—depend primarily on healthcare infrastructure and workflows rather than on clinical text analysis.

To provide a practical overview, we evaluate which MedHELM tasks are directly supported by existing datasets. A task is considered covered only if a dataset provides sufficient information to train a model, starting from random initialization and without relying on additional external data, to perform the task in an end-to-end manner under realistic conditions (e.g., outcomes for risk prediction or laboratory results for lab interpretation). Datasets supporting only intermediate components (e.g., named entity recognition or concept normalization) are not considered as covering a clinical task. The resulting coverage is summarized in Figure 3, which shows the distribution of tasks that can be addressed using currently available resources.

The mapping procedure followed a predefined rubric to ensure transparency. For datasets without public access, the mapping was based on information reported in the original publications and official descriptions, and a conservative approach was adopted: tasks were marked as covered only when explicitly supported by the described data fields. For publicly available datasets, the mapping was additionally informed by direct inspection of the data.

The mapping was performed independently by two authors, and disagreements were resolved through discussion. In ambiguous cases, a conservative interpretation was adopted, and the task was not considered covered.

MedSyn-IFT is treated separately. As an instruction-tuning corpus, it can be used to train general-purpose medical chatbots capable of addressing multiple tasks. However, such capabilities rely on generalization beyond the dataset itself. To avoid inflating task coverage, MedSyn-IFT is excluded from Figure 3, which focuses on task-specific datasets.

At the same time, this coverage analysis should not be over-interpreted. Dataset-task mapping is a coarse abstraction. Datasets differ substantially in size and in what constitutes a single “record” (e.g., a short snippet, a full note, a hospitalization episode, or a dialogue), making comparisons based on simple counts potentially misleading. Coverage also does not reflect dataset quality or domain scope; some resources address very narrow areas, such as ophthalmology translation, and therefore support a task only within a limited context. Data for summarization are also limited; for instance, the Medical Articles dataset provides only article texts and titles, which represent a constrained and indirect form of summarization. Furthermore, several widely used dataset types, including NER and concept normalization corpora, do not correspond to standalone clinical tasks in MedHELM. While essential for building practical systems, they primarily support intermediate technical capabilities rather than direct clinical decision-making.

The presented mapping demonstrates that, despite the apparent breadth of available resources, current Russian datasets provide limited support for many clinically important AI applications. While instruction-tuning resources such as MedSyn-IFT can be used to address multiple chatbot-oriented tasks, this capability depends on generalization beyond the underlying data. At the same time, there remains a clear lack of specialized datasets for critical clinical problems that require structured decision support based on real patient data. In particular, essential tasks—including generating differential diagnoses, applying clinical guidelines and best practices, matching protocols and screening contraindications, predicting treatment response, and predicting the need for procedures or referrals—remain largely unsupported by publicly available resources.

The task-level mapping alone does not explain how Russian medical datasets are created, adapted, or where they originate from. Many resources are complex hybrids that combine real clinical texts with translated, generated, or post-edited annotations, while others differ substantially in institutional provenance and practical scope. To properly interpret their strengths and limitations, it is necessary to consider both the processes of annotation and adaptation and the real-world sources from which the data are derived. Therefore, the following subsection examines data origin, annotation practices, adaptation pipelines, and institutional provenance within the Russian medical NLP ecosystem.

4.3. Dataset Analysis, Annotation, and Adaptation

A key methodological consideration is the role of annotation within the adaptation process. In medical NLP the boundary between “data” and “labels” is porous: the context may be real (e.g., a clinical note), while the question-and-answer pair, a rationale, or a diagnostic option set is effectively annotation that can be translated, newly authored by experts, or generated by LLMs and then post-edited. During translation, annotation may be copied verbatim from the source or re-created to match Russian clinical terminology and coding systems; during generation, labels/questions may be produced algorithmically and only later reviewed by clinicians. Hence, our taxonomy (Figure 4) distinguishes real, translated, and generated data, each possibly accompanied by or lacking expert post-editing and it applies equally to raw text and annotation; hybrid pipelines are common (e.g., real Russian notes combined with LLM-generated questions, expert verification and subsequent translation). We adopt this taxonomy to reason about reliability: beyond the intrinsic credibility of source data, adaptation itself can introduce errors (terminology drift, unit mismatches, code-system gaps) that are mitigated—or not—by post-editing.

Consequently, the current Russian-language ecosystem consists of a limited yet diverse collection of resources. RICD offers the most clinically faithful ICU data, but a paywalled license and small public demo severely limit reproducible research; moreover, governance and consistent de-identification policies are critical for long-term trust. The SibMed Clinical Repository [4] is a strategically important open platform; however, access is moderated, coverage is still modest and sustained contributions from multiple providers are required to reach breadth and temporal depth. RuMedPrime [22] (outpatient notes with ICD-10 [14] and symptoms) is a rare open clinical text resource; its strength is authenticity, while limitations include single-institution scope and task setups that often rely on downstream synthetic/translated annotation.

The RuMedBench [1] family (RuMedTop-3, RuMedSymptomRec, RuMedDaNet, RuMedNLI, RuMedNER) is a hybrid benchmark: base texts are real or forum-derived, while task scaffolding (choices, questions, NLI pairs) can be translated or partly generated and post-edited. This is appropriate for benchmarking, yet it means that performance may reflect how well a model learns the scaffold rather than purely clinical inference. RuMedNLI improves linguistic fidelity via post-editing but inevitably inherits conceptual templates of English NLI and code/units’ mismatches that had to be corrected; residual artifacts are still possible. RuMedNER capitalizes on pharmacological mention detection, but NER-only supervision rarely suffices for decision support; it remains valuable as a lexical/terminology layer rather than a standalone clinical task driver. RuMedDaNet provides clean yes/no QA contexts but, by design, constrains reasoning space, which is useful for comparability yet narrower than real clinical dialogue.

Among generated resources, MedSyn [6] demonstrates the power of LLM-driven simulation for data-hungry tasks: coverage of ICD-10 [14] is wide and empirical gains on downstream classifiers are reported. Still, synthetic artifacts (stylistic inconsistencies, occasional symbol noise, shallow causal chains) remind that these corpora should augment rather than replace real data, with transparent provenance and explicit post-editing protocols. RuMedQ [18] is a positive example of constraint-guided generation: medical facts structure the prompt space; then expert review enforces fidelity. Nevertheless, its supervision focuses the model on question formulation and matching, which is highly useful for triage and intake assistants, but does not directly train on longitudinal reasoning or treatment planning.

User-generated corpora fill an indispensable gap in modeling patient language. RuDReC [2], the Full-Size Corpus of Internet Drug Reviews [2], and the SagTeam [22] collection offer large-scale pharmacovigilance evidence for NER and sentiment, with the important caveat that patient reviews are noisy and unverifiable, can include misinformation and obscene or slang vocabulary and often lack clinical context; they are best treated as supporting layers (terminology, ADR cues) rather than grounds for clinical recommendation. The Medical forum Q&A dataset [17] (≈190 k pairs) mirrors lay discourse and is excellent for intent detection, triage prompts, and symptom phrasing, but again, low factual reliability and heterogeneous moderation preclude its use for guideline-level advice without external grounding.

Translated and bilingual resources bridge Russian and English but must be read through the lens of domain alignment. The Ophthalmology Ru/En parallel corpus is high quality yet narrow in specialty; it supports translation/terminology tasks rather than clinical reasoning. MMedBench [33] supplies multilingual MCQ QA with rationales, including Russian, but its licensing and cross-lingual harmonization make it primarily an evaluation target; training solely on its Russian portion will not cover the breadth of Russian clinical narratives. ru_medsum [16] (CyberLeninka abstracts/titles) is valuable for summarization and document understanding but not a clinical note genre; models trained on it can overfit to academic style. The RuBioRoBERTa [32] pretraining corpus (hundreds of thousands of biomedical articles) is excellent for language modeling and terminology, yet it lacks supervised task signals and, again, diverges stylistically from clinical prose. NEREL-BIO [29] advances nested NER on abstracts, but without downstream labels it remains a structural layer; RuCCoN [28] provides high-quality concept normalization to UMLS (strong for interoperability), while being insufficient alone for decision-centric tasks.

Finally, commercial or partially closed resources (e.g., FutureBeeAI Healthcare Chat [19]) provide realistic dialogues and bilingual alignment, but access constraints, unknown consent/PHI handling and opaque annotation pipelines reduce their suitability for open, reproducible research. Likewise, closed EHR aggregates (DataN/M/T) indicate promising scale for longitudinal modeling yet cannot be independently audited, limiting their scientific utility to conclusions within the custodial institution.

4.4. Institutional Provenance of Clinical Datasets

In addition to the functional categorization of Russian-language medical data sources, it is important to clarify the institutional origin and practical scope of the major datasets analyzed in this review. Most publicly available Russian medical NLP corpora are derived from specific regional healthcare institutions and therefore reflect local clinical practice. For example, the RICD (Russian Intensive Care Dataset) was developed by the Federal Research and Clinical Center of Intensive Care and Rehabilitology in Moscow, Russia, and aggregates de-identified intensive care unit records collected between 2017 and 2024. It contains structured physiological measurements, laboratory results, diagnoses, procedures, and hospitalization outcomes, and is accessible under a controlled research license via the official project website. The RuMedPrimeData, RuMedTop3, and related corpora originate from the clinics of Siberian State Medical University (Tomsk, Russia) and represent anonymized outpatient clinical notes with manually verified ICD-10 coding. The SibMed Clinical Repository is also maintained by Siberian State Medical University and integrates continuously updated de-identified inpatient and outpatient records from affiliated clinics. Datasets such as DataN, DataM, and DataT were created from electronic health records of large private and public clinic networks in Russia and were used for automated ICD-10 prediction studies, although they are not publicly released. Community-driven resources, including RuDReC, RuADReCT, and medical forum corpora, originate from publicly available online platforms and were manually annotated by research groups. In addition to real-world clinical data, the ecosystem also includes synthetic datasets generated by large language models (e.g., MedSyn-Synthetic and MedSyn-IFT) and translation-based resources, such as corpora derived from automatic or human translation of foreign medical datasets and literature.

The diversity of institutional origins—from federal clinical centers and regional hospitals to public web platforms and artificially generated corpora—indicates that Russian medical NLP datasets represent information sources with substantially different levels of reliability, clinical validity, and contextual completeness, ranging from highly regulated real EHR data to crowdsourced or fully synthetic materials. This variability must be explicitly considered when selecting data for model training and evaluation, as the provenance of a dataset directly affects the trustworthiness and applicability of the resulting medical AI systems.

4.5. Summary of Dataset Landscape

In summary, Russian datasets span our taxonomy end-to-end and frequently blend real, translated, and generated components—not only in raw text, but in annotation itself. Reliability in this landscape is a function of adaptation transparency and post-editing rigor, not merely of putative “data source.” Real clinical corpora remain scarce and access-limited; translated resources demand careful re-anchoring to Russian practice; generated corpora offer coverage and controllability but require explicit expert verification. User-generated forums are linguistically rich yet clinically unsafe as primary supervision. In addition, the ecosystem is strongly influenced by institutional provenance: most real-world datasets originate from a small number of regional healthcare providers and therefore reflect local clinical practices, documentation styles, and patient populations. Access to these resources is often restricted by governance and privacy regulations, which further limits reproducibility and external validation. As a result, the practical reliability and clinical realism of publicly available datasets cannot be assumed from their descriptions alone. To assess how well these resources correspond to actual medical practice, the following section presents an expert-driven evaluation of selected datasets.

5. Expert Verification of Public Datasets

In selecting datasets for expert verification, we focused on publicly available corpora that contain clinically meaningful information and correspond to clinically oriented task types—that is, data where a model could, in principle, identify, confirm, or reason about a medical condition based on documented symptoms or contextual evidence. Particular attention was given to the trustworthiness of each dataset’s task formulation and its potential to reflect real clinical decision-making rather than purely linguistic patterns. Three datasets met these criteria and jointly represent the principal modes of corpus formation in Russian medical NLP: RuMedPrimeData [22], MedSyn [6], and RuMedNLI [23].

RuMedPrimeData [22] consists of real outpatient anamneses describing patient symptoms and confirmed ICD-10 [14] diagnoses, providing authentic clinical reasoning patterns derived from practice. MedSyn [6], though synthetic, mirrors the same structure—clinical anamneses, symptom descriptions, and diagnostic outcomes—allowing direct comparison between real and generated narratives. RuMedNLI [23] extends the evaluation to translated data: it contains pairs of clinical sentences labeled as entailment, contradiction, or neutrality, thus testing whether translation preserves medical logic and relational structure.

This selection encompasses three complementary data creation paradigms—real, generated, and translated—enabling a systematic assessment of how each adaptation pathway affects medical reliability and informational consistency.

For expert evaluation, two physicians with over five and fifteen years of clinical experience independently reviewed 100 randomly selected records from each dataset. Representative examples from the datasets, including both original texts and English translations, are provided in Appendix B (Table A2, Table A3 and Table A4). In RuMedPrimeData [22] and MedSyn [6], they assessed the correctness of diagnoses, plausibility of symptom–diagnosis linkage and coherence of medical narrative. In RuMedNLI [23], they examined the accuracy of translation and validity of logical relations between sentence pairs. This procedure ensured that both the linguistic integrity and the medical reasoning quality were consistently evaluated across all corpus types, based on a unified set of acceptance criteria described below.

A record was considered acceptable only if all of the following criteria were satisfied: (1) clinical plausibility of the described condition, (2) consistency between reported symptoms and the assigned diagnosis (or the implied clinical conclusion), (3) sufficient diagnostic information to support the conclusion, and (4) absence of internal logical contradictions. Minor issues such as typos or minor grammatical inaccuracies that did not interfere with the text clarity were not considered grounds for exclusion. In contrast, critical defects included artifacts indicative of generation or automated translation (Arabic fragments, nonsensical phrases, corrupted formatting, or mixed-language segments), as these compromise interpretability and clinical verification.

The same criteria were applied across all datasets with additional dataset-specific considerations. For RuMedPrimeData, evaluation focused on the coherence of clinical narratives, the plausibility of the symptom–diagnosis relationship and the correctness of ICD-10 coding. For MedSyn, particular attention was paid to the quality of generation, including the presence of generative artifacts, unnatural phrasing and inconsistencies typical of synthetic data. For RuMedNLI, evaluation additionally emphasized translation fidelity (correct rendering of key medical terms, abbreviations, measurement units) as well as whether the annotated entailment, contradiction or neutral relation remained valid after translation.

The consistency of expert judgments under these criteria was assessed using inter-rater agreement, defined as the percentage of records for which both experts provided the same judgment. The initial agreement was 95% for RuMedPrimeData, 94% for MedSyn, and 89% for RuMedNLI. All cases of disagreement were subsequently reviewed jointly by the experts, and a final consensus decision was reached for each record.

5.1. RuMedPrimeData

To identify systematic annotation and translation errors, we conducted an expert review of the RuMedPrimeData dataset. The identified limitations and their approximate frequencies are summarized in Table 2.

Consequently, only approximately 20% of the reviewed records were deemed suitable for high-quality annotation, underscoring the significant data quality challenge in this domain. The most prevalent issue is inaccurate ICD-10 coding (51%), which experts most often attribute to premature or unjustified code selection without required clinical verification (e.g., lack of instrumental or laboratory confirmation), as well as to systematic miscoding of acute conditions as chronic and the use of overly generic “catch-all” codes. Many comments also recommend alternative or multiple ICD-10 codes (primary condition, complications, comorbidities), indicating frequent ambiguity and prioritization problems in multi-morbidity cases. Additional recurring problems include unsubstantiated diagnoses and mixed clinical pictures, further complicating reliable annotation. Experts also frequently note missing basic descriptors (age, timeline, units, examination findings), which forces speculative coding and contributes to cross-category miscoding and the need to split multi-problem records into separate cases.

5.2. MedSyn-Synthetic

To assess the impact of synthetic data generation on clinical text quality, we performed an expert analysis of the MedSyn-Synthetic dataset. The main types of errors and their estimated prevalence are presented in Table 3.

Quantitative assessment showed only less than 15% of randomly sampled records met criteria for correct annotation, highlighting substantial data quality challenges. In addition to error types also observed in real-world datasets, MedSyn-Synthetic exhibits pervasive generative artifacts that directly compromise clinical interpretability. Expert comments frequently flag distorted or non-medical wording, dialog-style formatting, incoherent or corrupted fragments with mixed-language insertions and non-Russian characters; factual inconsistencies such as implausible diagnoses, mismatched sex/age references and contradictory findings. These artifacts co-occur with clinically consequential issues: most notably terminology distortions (53%), insufficient diagnostic support (40%), and internal inconsistencies (24%). They also trigger cascades of secondary errors, such as inappropriate diagnostic plans and treatment strategies. Moreover, single synthetic records commonly contain multiple simultaneous error categories, yielding a substantially higher error density per case than authentic clinical text, rather than improving annotation reliability.

5.3. RuMedNLI

To evaluate the reliability of translated clinical text and annotation quality, we conducted an expert review of the RuMedNLI dataset, with particular attention to translation-related issues and logical relation consistency. The identified limitations and their approximate frequencies are summarized in Table 4.

Quantitative assessment showed that approximately 55% of NLI samples remained logically consistent after expert review, indicating higher annotation reliability than in more clinically complex tasks. These findings demonstrated that translation-based datasets, even when post-processed, were still prone to terminology errors and misinterpretation of clinical logic without full expert validation. The dominant limitation was incorrect annotation of logical relations (26%): experts frequently revised the original labels and pointed out clinically implausible or overly simplified reasoning. Translation-related issues, including terminology errors, also affected annotation quality, but to a lesser extent. Importantly, many disputed cases were driven not only by translation, but by systematic ambiguity between strict textual entailment and clinically plausible inference (“likely” vs. “entailed”), compounded by missing clinical context (units, baselines, timelines) and tense/temporality mismatches that alter the intended relation.

5.4. Summary of the Expert-Verified Datasets

To provide a consolidated view, we summarize the key limitations identified across the analyzed Russian medical NLP datasets, highlighting both dataset-specific and shared issues. The overall findings are presented in Table 5.

Overall, expert verification revealed generally low accuracy across all datasets. Across all datasets, the lack of full clinical context and diagnostic evidence remains a major limitation for reliable annotation and model training.

6. Clinical Task Coverage and Gaps

The MedHELM [12] taxonomy defines medical tasks from a clinical, rather than a purely technical, perspective. Its key strength lies in the fact that the tasks were formulated by physicians and correspond to real clinical activities across the full spectrum of healthcare. MedHELM [12] represents one of the most detailed taxonomies of medical NLP tasks designed for evaluating large language models. It comprises five main categories and 121 individual tasks: Clinical Decision Support (patient data analysis and diagnostic or treatment recommendations), Clinical Note Generation (automation and structuring of clinical documentation), Patient Communication and Education (simplifying information and answering patient queries), Medical Research Assistance (literature review and scientific data analysis), and Administration and Workflow (scheduling, resource management and document processing).

When comparing existing Russian datasets with the MedHELM [12] taxonomy, it becomes clear that, despite noticeable progress in corpora for foundational tasks such as symptom recognition, diagnosis prediction, or text summarization, a substantial part of clinically meaningful activities remains uncovered. Many Russian resources address surface-level annotation tasks (e.g., NER) or narrow linguistic subtasks, but do not provide the structured medical reasoning, procedural documentation, or administrative context required to train decision-support systems comparable to English-language counterparts. This situation reflects both the limited availability of open clinical records and the specific institutional environment of Russian healthcare. Moreover, expert validation indicates that even the existing datasets frequently contain significant inconsistencies and factual errors that limit their clinical reliability. The identified coverage gaps are summarized in Table 6.

In summary, while Russian-language medical NLP has achieved partial coverage of the MedHELM [12] taxonomy—especially for entity recognition, classification and limited QA tasks—higher-order clinical, organizational, and reasoning tasks remain entirely unsupported. Addressing these gaps will require not only access to real clinical data but also the development of original, legally and contextually aligned Russian benchmarks that reflect the realities of domestic healthcare practice.

7. Discussion

Despite considerable progress, the Russian-language medical NLP data ecosystem remains nascent. Much of the Russian clinical-NLP literature implicitly assumes that data quality is guaranteed as long as annotation guidelines are followed and standard NLP metrics improve. In practice, few papers perform independent clinical validation of the underlying corpora. Our survey shows that many Russian corpora are incomplete or weakly structured, that automatic labelling is error-prone, and that crucial clinical context is often missing.

By contrast, English-language medical corpora such as MIMIC-IV, i2b2 and n2c2 have been repeatedly audited for quality. Johnson and colleagues validated MIMIC-IV by comparing its coded diagnoses and procedures against detailed chart reviews, showing high concordance for major conditions [39]. In the i2b2 challenges, the release of synthetic versions alongside real notes allowed external teams to reproduce results and spot data-quality issues. Works by Weiskopf et al. and Pivovarov analyze the completeness and correctness of diagnoses, medications, and laboratory values in these corpora and subsequent studies use them to benchmark de-identification and clinical-question-answering systems [40,41]. None of these independent validations exist for Russian corpora; our study is therefore the first to systematically evaluate Russian datasets across multiple clinical tasks.

Several Russian dataset authors nevertheless acknowledge problems. The developers of the RuMedPrime and RuMedNLI corpora note that their records are “weakly structured” and that open datasets lack the rich procedural and diagnostic documentation found in English projects like i2b2/n2c2 [23]. In their release notes, they concede that the data come from a single institution and thus poorly represent other regions; that the tasks focus on spell-checking or natural-language inference rather than clinically meaningful activities such as diagnosis or triage; and that label assignment is heterogeneous because Russian clinical documentation practices vary between hospitals. Pogrebnoi, Funkner and Kovalchuk emphasize that “the availability of open sources of Russian medical texts and data sets is extremely limited” and that most clinically rich datasets remain institutionally restricted, severely constraining reproducibility and independent validation [38]. Papers on Russian clinical coding and concept normalization similarly note that coding is inconsistent across facilities and that annotation ambiguities reflect genuine gaps in documentation rather than annotator error.

Our findings thus echo the cautionary notes expressed by these authors. While English corpora are actively cross-validated, Russian datasets rarely undergo comparable scrutiny. We argue that rigorous quality assessment must become standard practice: corpora should preserve full clinical context and timelines, double annotation with adjudication should be mandatory, and uncertainty or inter-annotator agreement should be reported. Where raw text cannot be shared, creators should publish public metadata, synthetic proxies, and detailed error taxonomies. Only by confronting the documented deficiencies—rather than assuming that annotation guidelines alone suffice—can the Russian clinical-NLP community build trustworthy resources and close the gap with English-language benchmarks.

8. Conclusions

The Russian-language ecosystem for medical NLP is still at an early but pivotal stage. Our survey systematized all publicly known corpora and analyzed their provenance across real, translated, and generated sources—not only for raw text but also for annotation structures. Expert review revealed recurring reliability issues across all modes, including missing clinical context, ICD-10 coding [14] errors, logical inconsistencies and translation artifacts, while synthetic corpora often showed generative artifacts and format defects.

Despite these challenges, the landscape has begun to stabilize: several large clinical and pharmacological corpora, together with synthetic and benchmark datasets, now provide a functional basis for model pre-training and evaluation. However, current progress mainly supports surface-level tasks such as classification and named-entity recognition, whereas higher-order activities—clinical reasoning, long-form documentation, dialogue, and workflow integration—remain underrepresented. Direct transfer of English benchmarks without localization has proved insufficient because of differences in terminology, clinical guidelines, and the CHIS framework.

Grounded in the MedHELM [12] taxonomy, our gap analysis reveals that key clinically meaningful activities are largely unsupported in Russian: guideline application and contraindication screening, routing and care-path planning, multidisciplinary decision-making, realistic procedure/report documentation, research-support tasks, and explicit chains of clinical reasoning. Closing these gaps requires a coordinated national agenda: (1) expand open, expert-validated corpora that cover clinical, communicative, regulatory and administrative layers; (2) render Russian clinical guidelines and CHIS logic machine-readable and use them to define tasks and safety metrics (e.g., guideline consistency, contraindication violations); (3) standardize de-identification with reported performance on Russian data and implement transparent access policies, catalogs and leaderboards; (4) adopt human-in-the-loop annotation with routine expert adjudication and uncertainty reporting; (5) enable federated, multi-institution collaboration with harmonized ontologies and multimodal patient episodes; and (6) institute bias, consent, and governance audits.

Executing this program will transform a set of isolated datasets into a coherent, reproducible and clinically grounded benchmark suite. In doing so, it will narrow the gap with English-language resources and enable the responsible deployment of large language models trained, evaluated, and validated for Russian clinical practice.

Author Contributions

Conceptualization, E.S., S.Z. and A.A.; methodology, E.K. and I.B.; validation, L.M.; investigation, A.L. and Y.N.; writing—original draft preparation, L.M. and A.L.; writing—review and editing, E.K. and S.Z.; visualization, A.L.; supervision, I.O.; project administration, S.Z.; funding acquisition, E.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant, provided by the Ministry of Economic Development of the Russian Federation (agreement dated 20 June 2025, No. 139-15-2025-011, identifier 000000C313925P4G0002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created in this study. The analysis is based on previously published datasets and publicly described resources. References to all datasets are provided in the manuscript, with direct links included where publicly available, including Appendix C (Table A5).

Acknowledgments

The authors would like to thank Yury Markin and Aram Avetisyan for their valuable contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ICD	International Classification of Diseases
LLM	Large Language Model
ML	Machine Learning
NYHA	New York Heart Association
ICU	Intensive Care Unit
CHIS	Compulsory Health Insurance System
ADR	Adverse Drug Reaction
EHR	Electronic Health Record

Appendix A

Here is a glossary of terms: Russian term, English translation and a short description.

Table A1. Glossary of terms.

Russian Term	English Translation	Description of the Term
Иcтopия бoлeзни	Medical record	Full record of patient’s illness and treatment in hospital (inpatient medical record) or outpatient clinic (outpatient medical record)
Aмбyлaтopнaя кapтa	Outpatient medical record	Longitudinal record of outpatient visits, diagnoses, and prescriptions
Bыпиcнoй эпикpиз	Discharge summary	Final hospital document summarizing diagnosis, treatment, and follow-up recommendations
Haпpaвлeниe нa гocпитaлизaцию	Hospital admission referral	Document authorizing patient admission, stating preliminary diagnosis and purpose
Пpиeмный лиcт	Admission note	Record created during patient’s admission; includes initial complaints and exam results
Пpoтoкoл oбcлeдoвaния	Examination report	Structured report with test results; often includes laboratory and imaging data
Пpoтoкoл oпepaции	Surgical report	Operative record detailing performed surgery, anesthesia, and intraoperative findings
Пpoтoкoл нapкoзa	Anesthesia report	Record of anesthesia type, drugs, and vital signs during surgery
Koнcилиyм cпeциaлиcтoв	Multidisciplinary team meeting	Collective expert decision or discussion on complex cases
Kлиничecкий cлyчaй	Clinical case	Documented description of an individual patient’s presentation and management
Aнaмнeз зaбoлeвaния	History of present illness	Narrative of disease onset, development, and previous treatment
Aнaмнeз жизни	Past medical history	Information about past diseases, surgeries, lifestyle, and habits
Жaлoбы пaциeнтa	Patient`s complaints	Section listing current symptoms as reported by the patient
Oбъeктивный cтaтyc	Physical examination findings	Physical examination results and measurable observations
Диaгнoз	Diagnosis	Clinician’s formal statement of the patient’s condition
Haзнaчeния	Medical orders	Treatment plan including drugs, procedures, and follow-ups
Kлиничecкиe peкoмeндaции	Clinical guidelines	Official medical practice standards approved by health authorities
Пpoтoкoл лeчeния	Treatment protocol	Structured plan specifying therapy regimen for a given condition
Meдицинcкaя cпpaвкa	Medical certificate	Formal document confirming health status or absence of contraindications

Appendix B

The following three tables provide examples of expert validation with comments: Table A2 (RuMedPrimeData), Table A3 (MedSyn-synthetic), and Table A4 (RuMedNLI).

Table A2. RuMedPrimeData expert validation.

Symptoms/Cимптoмы	Anamnesis/Aнaмнeз	ICD-10	Comment
Blood pressure elevations up to 160 mmHg on current therapy, occurring during psychoemotional stress. Inspiratory dyspnea. Heart palpitations/arrhythmias during blood pressure elevations. The patient called an ambulance 5 times in the last month and a half. Was advised to take Nitroglycerin, but reports no effect from it. Corvalol provides relief in approximately 40 min. Leg edema since summer, associated with diuretic discontinuation. Пoдъeмы AД дo 160 мм.pт.cт. нa фoнe тepaпии нa фoнe пcиxoэмoциoнaльнoгo cтpecca oдышкa инcпиpaтopнoгo xapaктepa, пepeбoи в paбoтe cepдцa нa фoнe пoдъeмa дaвлeния и пepeбoи в paбoтe cepдцa, вызывaлa cкopyю зa пocлeдниe пoлтopa мecяцa 5 paз, cкaзaли пpинимaть нтг, oт нтг эффeкт нe oтмeчaeт, пoмoгaeт кopвaлoл чepeз 40 минyт пpимepнo oтeки нoг c лeтa в cвязи c oтмeнoй мoчeгoнныx	Long-standing hypertension with systolic blood pressure up to 210 mmHg. Paroxysmal Atrial Fibrillation. Has been taking Cordarone for 6 years. Coronary Artery Disease: Arrhythmic variant, Paroxysmal Atrial Fibrillation. Complication: Heart Failure, Functional Class II (NYHA). Underlying condition: Hypertensive disease Stage III, target BP achieved, risk 4. Carotid artery atherosclerosis 47%. Dyslipidemia. Concomitant condition: Chronic Kidney Disease (CKD) Stage 3 (eGFR 37%). CAD diagnosis was established long ago. No ultrasound results available. Длитeльнo ГБ, дo 210 CAД ФП пapoкcизмaльнaя фopмa, пpинимaeт кopдapoн 6 лeт, пocлeдний пapoкcизм ДATA ИБC: Apитмичecкий вapиaнт, пapoкcизмaльнaя фopмa ФП. Ocл,: HI ФK II (NYHA) Фoнoвoe: Гипepтoничecкaя бoлeзнь III, дocтигнyтo цeлeвoe AД, pиcк 4. Aтepocклepoз coнныx apтepий 47%. Диcлипидeмия. Coп.: XПБ 3 cт (CKФ 37%) Диaгнoз ИБC дaвнo. peзyльтaтoв УЗИ нeт нa pyкax	I24.9	The diagnosis of I24.9 is incorrect. The clinical presentation is NOT consistent with acute myocardial ischemia. The provided fragments contain insufficient data to confirm coronary artery disease: there is no typical angina symptomatology, and there are no results from imaging studies. Dyspnea can be an angina equivalent, but instrumental data is required to confirm coronary artery disease. The primary ICD diagnosis should be I13.2 (Hypertensive heart and chronic kidney disease with heart failure and stage 1 through stage 4 chronic kidney disease, or unspecified chronic kidney disease). Complications: I48.0 (Paroxysmal atrial fibrillation), I50.9 (Heart failure, unspecified—insufficient data available). Concomitant conditions: N18.3 (Chronic kidney disease, stage 3), I70.8 (Atherosclerosis of other arteries), E78.5 (Hyperlipidemia, unspecified)
Blood pressure elevations up to 160 mmHg, sometimes occurring several times a month for a two-week period. Aborts episodes with an additional dose of Lorista. During blood pressure elevations and during seasonal changes—sensation of shortness of breath. Пoдъeмы AД дo 160 мм pт cт, пoвышaeтcя инoгдa нecкoлькo paз в мecяц—двe нeдeли, кyпиpyeт дoп дoзoй лopиcты. Ha фoнe пoдъeмa дaвлeния, нa измeнeниe ceзoнa—oщyщeниe нexвaтки вoздyxa	Hypertension for 20 years. Has been taking the following medications for about 5 years: Lorista H 50 + 12.5 mg, Nebilet 1/2 tablet. Walks 5–6 km of Nordic walking; sometimes on certain days—experiences a sensation of shortness of breath which she overcomes. Most often, no dyspnea on physical exertion. Does not undergo routine monitoring of cholesterol, cardiac ultrasound or carotid artery ultrasound. Fasting blood glucose: 6.8 (as of DATE). ГБ в тeчeниe 20 лeт. Oкoлo 5 лeт пpинимaeт пpeпapaты: Лopиcтa H 50 + 12.5 MГ Heбилeт 1/2 тaблeтки Пpoxoдит 5–6 км cкaндинaвcкoй xoдьбoй, инoгдa в нeкoтopыe дни—oщyщeниe нexвaтки вoздyxa., Чaщe вceгo нa ФH oдышки нeт. Xc, УЗИ cepдцa, coнныx apтepий нe кoнтpoлиpyeт. Глюкoзa 6,8 oт ДATA	I11.9	The diagnosis of I11.9 (Hypertensive heart disease without heart failure) is incorrect, as there is no evidence of heart damage. The correct code in the absence of proven organ damage is I10 (Essential (primary) hypertension). Given the fasting glucose of 6.8, a concomitant diagnosis of R73.0 (Impaired glucose tolerance) or (R73.9—Hyperglycemia, unspecified) can be suspected; however, further investigation is required to confirm this diagnosis
Episodes of sweating and weakness for 3 weeks (following a recent acute respiratory infection). Pressing, squeezing retrosternal pain for 1 month, worsening, occurring unrelated to physical exertion. Relief in a sitting position. Never took Nitroglycerin; the pain resolves spontaneously. Пoтливocть, cлaбocть пpиcтyпaми в тeчeниe 3 нeдeль пpиcтyпaми (нaкaнyнe пepeнecлa OPЗ)—нa дaвящиe, cжимaющиe бoли зa гpyдинoй, вoзникaющиe бeз cвязи c физичecкoй нaгpyзкoй, в пoлoжeнии cидя лeгчe—1 мecяц yxyдшeниe, HTГ никoгдa нe пpинимaлa, пpoxoдит	Coronary Artery Disease for approximately 30 years. No history of Acute Coronary Syndrome. Never underwent Coronary Angiography. Previous Medications (not currently taking): Preductal, Bisoprolol 10 mg, Cardiomagnyl 75 mg, Felodip 5 mg, sometimes an additional Trigrim 5 mg, Espiro 25 mg, Fozicard 5 mg. No cardiac ultrasound results available. As of DATE—signs of pulmonary congestion, received outpatient treatment. Subsequent monitoring showed no fluid in pleural cavities. ИБC oкoлo 30 лeт, OKC нe былo, KBГ нe дeлaли никoгдa. пpeдyктaл биcoпpoлoл 10 мг кapдиoмaгнил 75 фeлoдип 5 мг, инoгдa eщe oднy тaблeткy тpигpим 5 мг эcпиpo 25 мг фoзикapд 5 мг пpинимaлa, нe пpинимaeт УЗи cepдцa нa pyкax нeт. ДATA—зacтoй пo мкк, пpoxoдилa лeчeниe aмбyлaтopнo, нa кoнтpoлe жидкocти в плeвpaльныx пoлocтяx нe былo	I20.9	The diagnosis of I20.9 (Angina pectoris, unspecified) is incorrect. This code implies stable coronary artery disease. The key characteristic of stable angina is a clear connection with physical exertion. The description, however, states: “unrelated to physical exertion.” Considering the association with a recent viral infection and the nature of the pain (worsening in a supine position—a classic sign), pericarditis can be suspected, but further investigation is required (auscultation, electrocardiography, echocardiography). I30.9 Acute pericarditis, unspecified. Differential Diagnosis: I40.9 Acute myocarditis, unspecified (can also develop after an ARI). Based on the provided history, the following diagnoses should be considered: I25.9 Chronic ischemic heart disease, unspecified; I50.9 Heart failure, unspecified
Blood pressure 200/90 mmHg at the visit. Reports daily elevations. Denies dyspnea, pressing chest pains, or edema. Дaвлeниe 200/90 мм pт cт нa пpиeмe, пoвышeния eжeднeвныe -oдышкy, дaвящиe бoли зa гpyдинoй, oтeки oтpицaeт	Hypertension for 7 years. Current medication: Concor (Bisoprolol) 5 mg, Edarbi (Azilsartan). Despite therapy, experiences systolic BP (SBP) elevations of 170 mmHg and above. Ultrasound (Echocardiography) dated DATE: Left Atrium (LA) 35 * 55 mm, Left Ventricular Hypertrophy. Mitral Valve regurgitation grade 1–2. Aortic Valve regurgitation grade 1. Total Cholesterol 8.4 mmol/L. ГБ 7 лeт пpинимaeт кoнкop 5 мг, эдapби, нa этoм фoнe пoвышeния CAД 170 и вышe УЗИ ДATA—ЛП 35*55 мм, ГЛЖ MK peгypгитaция 1–2 cтeпeни AK 1 cтeпeнь Xc 8, 4 ммoль/л	I11.9	The diagnosis is partially correct. The code I11.9 could technically be accurate because: LVH is confirmed by echocardiography—there is documented heart damage. Increased LA size (35 × 55 mm, normal up to 40 mm) is a sign of long-standing hypertension. Valvular regurgitation (a consequence of cardiac remodeling). Code I11.x is used specifically when such structural changes in the heart are documented. Nevertheless, I11.0 (Hypertensive heart disease with heart failure) might be more appropriate than I11.9 (without heart failure): the patient reports dyspnea, and LVH often leads to diastolic dysfunction; therefore, heart failure with preserved ejection fraction (HFpEF) is likely. The pressing chest pains additionally require the exclusion of coronary artery disease (CAD) (I20, I25)
Increase in body temperature to 37.9–39.6 degrees Celsius, headache, chills, runny nose, sore throat, pain when swallowing, pain in the eyes. Пoвышeниe тeмпepaтypы дo 37.9–39.6, гoлoвнaя бoль, oзнoб, нacмopк,пepшeниe в гopлe, бoли пpи глoтaнии, бoли в глaзax	Fell ill on DATE. Was taking Coldact and Paracetamol, and gargling with Furacilin. Зaбoлeл ДATA, пpинимaлa кoлдaкт и пapaцeтaмoл, пoлocкaниe фypaцилинoм	J35.9	The diagnosis of J35.9 (Chronic disease of tonsils and adenoids, unspecified) is erroneous. An acute condition has been coded as a chronic disease. The code was selected from an incorrect category. Without the results of laboratory tests, the most accurate diagnosis is J06.9 (Acute upper respiratory infection, unspecified). Considering the symptoms, one could suspect J11.1 (Influenza with respiratory manifestations, virus not identified).
Maximum blood pressure is 140 mmHg, with weekly episodes of systolic blood pressure dropping to 70 mmHg. Symptoms include nausea, dizziness, dry non-productive cough, numbness in the left arm, and stabbing pains behind the breastbone. The patient does not associate these symptoms with any specific triggers; they can occur at night. The patient uses nitro spray, which provides relief after 7–10 min. Without the nitro spray, the sensation of shortness of breath and air hunger persists for a maximum of 30 min. Against this background, there is a sensation of heart palpitations/interruptions. AД мaкcимaльнoe 140 мм pт cт, cнижeниe CAД 70 мм pт cт, eжeнeдeльнo, тoшнoтa, гoлoвoкpyжeниe, тoшнoтa, cyxoй кaшeль нeпpoдyктивный, лeвaя pyкa нeмeeт, пpoкaлывaющиe бoли зa гpyдинoй. He cвязывaeт ни c кaкими фaктopaми, мoжeт вoзникaть нoчью. Пoльзyeтcя нитpocпpeeм, пoмoгaeт чepeз 7–10 минyт. Бeз нитpocпpeя oщyщeниe oдышки, нexвaтки вoздyxa coxpaняeтcя, мaкиcмaльнoe 30 минyт. Ha этoм фoнe—oщyщeниe пepeбoeв в paбoтe cepдцa	The first episodes of feeling unwell appeared about a year ago. Over the last six months, the condition has worsened with an increase in the frequency of attacks. Oкoлo гoдa нaзaд пoявилиcь пepвыe эпизoды yxyдшeния caмoчyвcтвия	I99	The diagnosis of I99 (Other and unspecified disorders of the circulatory system) is categorically incorrect. There is a clear clinical picture indicating myocardial ischemia (positive effect from nitro spray)—this requires a specific code from the coronary artery disease group (I20-I25), not the I99 code. Probable diagnoses are I20.1 (Angina pectoris with documented spasm—variant angina Prinzmetal) or I20.8/I20.9 (Other forms of angina/Unspecified angina pectoris), since the pain occurs at rest, at night, without connection to exertion, and is relieved by nitrates. Arguments in favor of I20.1 include: pain occurring at rest, at night, unrelated to physical exertion, effectiveness of nitro spray, maximum duration of 30 min, probable radiation to the left arm, shortness of breath, and drop in systolic blood pressure to 70 mmHg, with episodes starting about a year ago
Syncope (fainting) following a sudden heavy lift. Emergency Medical Services were called. (Acute Coronary Syndrome and Acute Cerebrovascular Accident were ruled out). Low back pain. Oбмopoк нa фoнe peзкoгo пoдъeмa тяжecтeй, вызвaли cкopyю пoмoщь (OKП и OHMK иcключили). Бoли в пoяcницe	The onset was acute on DATE. The patient took Xefocam 8 mg and Mydocalm. MRI: Signs of degenerative changes in the spine, herniated disc at L5-S1. No focal brain pathology or brain vascular pathology was found. Зaбoлeл ocтpo ДATA. Пpинимaл кceфoкaм 8 мг мидoкaлм. MP-пpизнaки дeгeнepaтивныx измeнeний пoзвoнoчникa, гpыжa L5-S1, oчaгoвoй пaтoлoгии гoлoвнoгo мoзгa и пaтoлoгии cocyдoв гoлoвнoгo мoзгa нeт	M42.1	The code M42.1 (Osteochondrosis of the spine) is incorrect because it ignores the syncope—the main and potentially dangerous symptom—and is imprecise; given the presence of an L5-S1 herniation, the correct code is M51.2 (Other specified intervertebral disc displacement). More accurate diagnoses are: R55 (Syncope and collapse—loss of consciousness during heavy lifting) as the primary code, along with M51.26 (Other intervertebral disc displacement, lumbar region—L5-S1 according to MRI) and M54.5 (Low back pain). Critical issue: Syncope during physical exertion can be a sign of life-threatening cardiac conditions (e.g., ventricular tachycardia, aortic stenosis, hypertrophic cardiomyopathy) that cause sudden death. While EMS ruled out ACS/CVA, this is insufficient—a full cardiological workup is required: ECG, Holter monitoring (to detect paroxysmal arrhythmias), echocardiography (to rule out valvular heart disease, cardiomyopathy), orthostatic test. Attributing the syncope solely to the lumbar disc herniation is dangerous—a herniated disc causes pain, but not syncope
Low back pain, lower abdominal pain. Бoль в пoяcницe, бoль внизy живoтa	The pain started yesterday; the patient took No-Spa. Today the pain recurred, and she received an injection of Ketorol. Бoль co вчepaшнeгo дня, пpинимaлa нo-шпy, ceгoдня бoль вoзoбнoвилacь пpинялa инъeкцию кeтopoлa	M54.8	The diagnosis of M54.8 (Other dorsalgia—other back pain) is incorrect because it only considers the low back pain and completely ignores the lower abdominal pain. This code is for musculoskeletal pain, whereas the combination of low back + lower abdominal pain + effect from No-Spa (a spasmolytic) points to visceral pathology (renal colic, gynecological issues, intestinal issues). A probable diagnosis is N23 (Unspecified renal colic), as it is a classic combination of low back pain with radiation to the lower abdomen + effect from No-Spa (a spasmolytic used for ureteral obstruction) + acute onset. If stones are confirmed by ultrasound/CT, the codes would be N20.1 (Calculus of ureter) or N20.0 (Calculus of kidney). In women, it is critically important to rule out gynecological pathology: ectopic pregnancy (O00.9—life-threatening), dysmenorrhea (N94.6), inflammatory pelvic disease (N73.x). It is also mandatory to rule out acute surgical conditions: appendicitis (K35.x), intestinal obstruction
Complaints of pain in the right upper quadrant, constant, cutting in nature. The pain decreased after taking Ketorol and Enterosgel. No stool disturbances. Nausea, chills, weakness. Жaлoбы нa бoли в пoдpeбepьe cпpaвa, пocтoянныe peжyщиe, cнизилacь нa фoнe кeтopoлa и энтepocгeль, бeз нapyшeния cтyлa, тoшнoтa, oзнoб, cлaбocть.	The pain appeared on DATE after eating. Пoявилиcь бoли ДATA: пocлe пpиeмa пищи	K81.1	The diagnosis of K81.1 (Chronic cholecystitis) is likely incorrect, as an acute onset after food intake is described, with pronounced signs of inflammation (chills, constant cutting pain, nausea). This clinical picture is consistent with acute cholecystitis—K81.0, not the chronic form. If stones are detected on an abdominal ultrasound, the code should be changed to K80.0 (Gallstones of gallbladder with acute cholecystitis—calculous cholecystitis)
Muscle weakness in the neck, fatigue, eyes closing during conversation, worsening of speech during conversation, weakness in the upper limbs. Cлaбocть мышц шeи, yтoмляeмocть, зaкpывaютcя глaзa пpи paзгoвope, yxyдшeниe peчи пpи paзгoвope, cлaбocть в вepxниx кoнeчнocтяx	The patient has been ill since DATE. From DATE to DATE, she was receiving treatment at clinics in the Moscow Region with the diagnosis: Myasthenia gravis, generalized form with bulbar syndrome, moderate severity, stationary form, stage of decompensation while on therapy with anticholinesterase drugs. She stopped taking Kalimin (Pyridostigmine) and Prednisolone as of DATE, as she did not observe any effect from taking 12 tablets of Prednisolone and 3 tablets of Kalimin Бoлeeт c ДATA. B ДATA нaxoдилacь нa лeчeнии в клиникax MO c диaгнoзoм: Mиacтeния, гeнepaлизoвaннaя фopмa c нaличиeм бyльбapнoгo cиндpoмa, cpeдняя cтeпeнь тяжecти, cтaциoнapнaя фopмa, cтaдия дeкoмпeнcaции нa фoнe тepaпии aнтиxoлecтepaзными пpeпapaтaми. Бpocилa пpинимaть Kaлимин, Пpeднизoлoн c ДATA, т.к. нe oтмeчaлa эффeктa oт 12 тaб пpeднизoлoнa и 3 тaб Kaлиминa.	G70	G70.01 (Myasthenia gravis with acute exacerbation)
Complaints of edema in the legs and palpitations. Was consistently taking Lorista 50 mg; for high blood pressure, would take Tenoric, Lerkamen 10 mg, and Telmista 40 mg in the evening. After a hypertensive crisis, this was discontinued and the following was prescribed: Lorista H 50/12.5 in the morning, Norvasc in the evening, and Cardiomagnyl. Жaлoбы oтeки нa нoгax, чyвcтвo cepдцeбиeния. Пocтoяннo пpинимaлa лopиcтa 50 мг, пpи пoвышeнии AД тeнopик, лepкaмeн10 мг и тeлмиcтa 40 мг вeчepoм, пocлe кpизa oтмeнa и нaзнaчeнo лopиcтa H 50/12.5 yтpoм, нopвacк вeчepoм, кapдмoмaгнил	Elevated blood pressure noted since DATE. The prescribed antihypertensive therapy was effective. A hypertensive crisis occurred on DATE and was managed by emergency services with the administration of Physiotens, after which the long-term therapy was changed. Пoвышeниe AД c ДATA, нaзнaчeннaя гипoтeнзивнaя тepaпия c xopoшим эффeктoм. Гипepтoничecкий кpиз ДATA, пo CП кyпиpoвaн кpиз пpиeмoм физиoтeнзa, пocлe чeгo cмeнa тepaпии.	I11.9	The diagnosis of I11.9 (Hypertensive heart disease without heart failure) is partially incorrect, as there is no evidence of heart damage (no echocardiography data confirming left ventricular hypertrophy). The presence of edema could be interpreted as a sign of congestive heart failure or as a side effect of calcium channel blockers (Lerkamen/Norvasc). A more accurate diagnosis based on the provided information is I10 (Essential (primary) hypertension—persistently elevated blood pressure of unknown etiology)
Complaints of elevated blood pressure up to 150/100 mmHg and a feeling of shortness of breath. Жaлoбы нa пoвышeниe AД дo 150/100, чyвcтвo нexвaтки вoздyxa.	Elevated blood pressure has been present since DATE, with a maximum recorded BP of 180/105–110 mmHg. During a routine check-up, elevated blood glucose up to 12.4 mmol/L was detected; the patient was referred to an endocrinologist for further examination. The patient was seen by an endocrinologist on DATE, and a diagnosis of Type 2 Diabetes Mellitus, newly diagnosed, was established. Further investigation was recommended. Пoвышeниe AД c ДATA, мaкcимaльнoe AД 180/105–110. Пpи пpoxoждeнии пpoфocмoтpa выявлeнo пoвышeниe глюкoзы кpoви дo 12.4, нaпpaвлeн к эндoкpинoлoгy для oбcлeдoвaгния. Ocмoтpeн эндoкpинoлoгoм ДATA, выcтaвлeн диaгнoз CД 2 тип, впepвыe выявлeнный, peкoмeндoвaнo дooбcлeдoвaниe	I11.9	The diagnosis of I11.9 (Hypertensive heart disease without heart failure) is incorrect because, based on the provided information, there is no data confirming heart damage. The correct diagnoses are I10 (Essential hypertension) and E11.9 (Type 2 diabetes mellitus without complications) or E11.65 (Type 2 diabetes mellitus with hyperglycemia)
The patient currently reports no complaints. Body temperature is 36.6 °C (97.9 °F). The pregnancy is at 19–20 weeks, progressing favorably without complications. B нacтoящee вpeмя жaлoб нe пpeдъявляeт. T-36.6 C. Бepeмeннocть 19–20 нeдeль, пpoтeкaeт блaгoпpиятнo, бeз yxyдшeния	On DATE, the patient fell ill with a temperature up to 37.4 °C (99.3 °F), mushy stools, nausea, epigastric pain, repeated vomiting, general weakness, malaise, and headache. Stool occurred once, without pathological impurities. The patient had been visiting another household where two other people experienced similar dyspeptic disorders with fever. She was treated by a gastroenterologist and received the following therapy: Smecta, Gaviscon, Omez Insta, and a diet. Her condition improved. A sick leave certificate was issued from DATE to DATE. Complete Blood Count from DATE: No pathology. Urinalysis from DATE: Normal. Blood chemistry: Pending. RNGA with antigens for salmonellosis, yersiniosis, dysentery: Pending. Stool test for dysbiosis: Scheduled for DATE. ДATA зaбoлeлa, тeмпepaтypa дo 37.4 C, кaшeoбpaзный cтyл,тoшнoтa, бoли в эпигacтpии, pвoтa нeoднoкpaтнo, oбщaя cлaбocть, paзбитocть,гoлoвнaя бoль,cтyл 1 paз, бeз пaтoлoгичecкиx пpимeceй. Былa в гocтяx, в ceмьe гдe гocтилa 2 cлyчaя aнaлoгичныx диcпипcичecкиx paccтpoйcтв c пoвышeнииeм тeмпepaтypы. лeчилacь y гacтpoэнтepoлoгa,пoлyчaлa лeчeниe: cмeктy, гeвиcкoн, oмeз инcтa, диeтa. Coтoяниe yлyчшилocь. C ДATA oткpыт лиcтoк нeтpyдocпocoбнocти пo ДATA Aнaлизы: OAK oт ДATA бeз пaтoлoгии, OAM- ДATA -N, б/x кpoви в paбoтe, PHГA c aнтигeнaми caльмoнeллëзa, иepcиниoзa,дизeнтepии- в paбoтe. Kaл нa диcбaктepиoз-нaзнaчeн нa ДATA г	A04.9	The diagnosis of A04.9 (Unspecified bacterial intestinal infection) is likely incorrect. The clinical picture is more consistent with viral gastroenteritis (A08.4) or foodborne intoxication (A05.9), evidenced by: mild course, predominance of vomiting over diarrhea (stool only once!), low-grade fever of 37.4 °C, absence of pathological impurities in the stool, group character of the outbreak (2 cases in the same family), and rapid recovery

Table A3. MedSyn-synthetic expert validation.

Symptoms/Cимптoмы	Anamnesis/Aнaмнeз	ICD-10	Comment
The patient is a married, non-smoking man aged 45 years who complains of a feeling of lack of air and symptoms of heart failure: shortness of breath during physical exertion. Пaциeнт—зaмyжний нeкypящий мyжчинa 45 лeт, кoтopый жaлyeтcя нa чyвcтвo нexвaтки вoздyxa, явлeниям cepдeчнoй нeдocтaтoчнocти: oдышкa пpи физичecкoй нaгpyзкe.	During the consultation, the following symptoms were revealed: he experiences dyspnea on exertion, such as when climbing stairs or walking long distances. He reports no chest pain or skin redness. The patient reported that over the past few months he has noticed a gradual deterioration in his general condition: he fatigues more quickly, finds it harder to concentrate at work, and often feels weak. He can no longer perform physical activities like running or lifting heavy objects as he could before. The patient noted that this condition worsens when lying flat and after meals. He frequently wakes up at night feeling short of breath. His medical history is notable for the absence of chronic diseases, congenital heart defects, or genetic predispositions to cardiovascular diseases. The patient also completely denies experiencing emotional stress, insomnia, depression or anxiety. Пaциeнт, мyжчинa, 45-лeтний, зaмyжний, пpинeceн нa пpиeм c жaлoбaми нa чyвcтвo нexвaтки вoздyxa и пpoявлeниями cepдeчнoй нeдocтaтoчнocти. Пaциeнт нe являeтcя кypильщикoм и нe имeeт извecтныx aллepгичecкиx peaкций нa лeкapcтвeнныe пpeпapaты. Пpи бeceдe c пaциeнтoм выяcнили cлeдyющиe cимптoмы: oн чyвcтвyeт oдышкy пpи физичecкoй нaгpyзкe, нaпpимep, пpи пoдъeмe пo лecтницe или пpoгyлкe нa длитeльныe paccтoяния. Бoли в гpyди и пoкpacнeниe кoжи нe нaблюдaeт. Пaциeнт paccкaзaл, чтo пocлeдниe нecкoлькo мecяцeв зaмeчaeт пocтeпeннoe yxyдшeниe oбщeгo cocтoяния: oн cтaл быcтpee yтoмлятьcя, cкoнцeнтpиpoвaтьcя нa paбoтe cтaлo тpyднee, нepeдкo oщyщaeт чyвcтвo cлaбocти. Oн нe мoжeт ocyщecтвлять физичecкиe нaгpyзки тaкиe, кaк бeг или пoдъeм тяжecтeй, кaк paньшe. Пaциeнт oтмeтил, чтo этo cocтoяниe ycиливaeтcя в пoлoжeнии лeжa и кoгдa пpинимaeт пищy. Oн чacтo пpocыпaeтcя нoчью, oщyщaя зaпop из-зa oдышки. B мeдицинcкoй aнaмнeзe oтмeчaeтcя oтcyтcтвиe xpoничecкиx зaбoлeвaний, вpoждeнныx пopoкoв cepдцa или гeнeтичecкиx пpeдpacпoлoжeннocтeй к cepдeчнo-cocyдиcтым зaбoлeвaниям. Пaциeнт тaкжe пoлнocтью oтpицaeт нaличиe эмoциoнaльнoгo cтpecca, бeccoнницы, дeпpeccивныx cocтoяний или тpeвoжнocти	I11.0	(1) Diagnosis Accuracy: The diagnosis of I11.0 (Hypertensive heart disease with heart failure) is incorrect. The provided history contains no data on arterial hypertension—there is no mention of elevated blood pressure, use of antihypertensive medications, or a previously established diagnosis of hypertensive disease. Furthermore, the history explicitly states an ‘absence of chronic diseases.’ There is also a lack of instrumental examination data (ECG, echocardiography) that could confirm left ventricular hypertrophy or other signs of heart damage due to arterial hypertension. (2–3) Potential Correct Diagnoses: Based on the available clinical data, it would be more accurate to assign the diagnosis I50.9 (Heart failure, unspecified) until a full examination is conducted and the etiology is clarified. A diagnosis of I25.9 (Chronic ischemic heart disease, unspecified) is also possible. Temporarily, the symptomatic code R06.0 (Dyspnea) could be used. (4) Data Quality Note: The dataset contains obvious text generation defects: using the word ‘married’ in a female-gendered form for a male patient, the unnatural phrasing ‘brought to the appointment,’ and the unclear phrase ‘feeling constipation due to shortness of breath’ (likely a generation error, possibly intended to be ‘wakes up’). The text has characteristic signs of being generated by artificial intelligence, with typical speech patterns and excessive detail. The clinical situation of heart failure developing in a relatively young patient without prior chronic diseases necessitates a mandatory comprehensive workup to identify the cause, which could include cardiomyopathy, myocarditis, congenital heart defects, valvular lesions, or other conditions
Patient complains of sleep disturbance. Пaциeнт жaлyeтcя нa нapyшeниe cнa	Doctor: The patient presented with a complaint of sleep disturbance. Please describe the sleep problems you are experiencing. Patient: I have difficulty falling asleep. I often wake up in the middle of the night and have restless dreams. In the morning, I feel tired and irritable. Doctor: Please continue. How often do these problems occur and when did they start? Patient: I have these sleep problems almost every night. It started about two months ago. Doctor: Thank you. Do you have any idea what might be causing these sleep problems? Patient: I think work stress could be the cause of my sleep problems. I’ve had a very intense period at work over the last few months. Doctor: I see. Let’s now analyze other lifestyle factors that may be affecting your sleep. What is your usual sleep duration and how regular is your sleep schedule? Patient: I try to sleep about 7–8 h every night. But because of the sleep problems, I only manage to sleep about 4–5 h on average. Doctor: Thank you for the information. Have there been any changes in your lifestyle, such as a change in your daily routine or starting new medications, that could have affected your sleep? Patient: There haven’t been any obvious changes in my lifestyle. But I started taking medication for headaches about two months ago, around the time my sleep problems began. Doctor: Understood. Let me ask, do you have any other accompanying symptoms, such as anxiety, changes in appetite, or mood? Patient: To be honest, I sometimes have episodes of anxiety, especially at night. I’ve also noticed that my appetite has decreased lately. Doctor: Okay, that will be taken into account as well. How do you feel these sleep problems are affecting your work performance and overall life? Patient: I’ve noticed that my work performance has been declining due to fatigue. I also feel irritable and emotionally unstable lately. Doctor: Thank you for the detailed description. For a more precise understanding of your situation, I would recommend some additional investigations, such as a complete blood count, hormone level analysis, sleep monitoring (polysomnography), and a consultation with a psychologist. This will help us determine a more accurate cause of your sleep problems and develop the most effective treatment plan. We could also consider the possibility of prescribing anxiolytics to relieve anxiety and improve sleep, but this will be discussed after we receive the results of the additional investigations. Until these investigations are completed, I recommend that you follow a regular daily routine, improve your bedroom sleep environment, avoid consuming caffeine and other stimulants before bed, and maintain a healthy lifestyle. We will definitely meet with you again after receiving the investigation results to discuss the next steps in treating your sleep problems. Пaциeнт oбpaтилcя c жaлoбoй нa нapyшeниe cнa. Пoжaлyйcтa, oпишитe, кaкиe пpoблeмы c cнoм вы иcпытывaeтe? Пaциeнт: Я иcпытывaю тpyднocти co зacыпaниeм. Чacтo пpocыпaюcь пocpeди нoчи и cтpaдaю oт бecпoкoйныx cнoвидeний. Утpoм oщyщaю ceбя ycтaвшим и paздpaжитeльным. Пpoдoлжитe, пoжaлyйcтa, oпиcывaть, кaк чacтo вoзникaют эти пpoблeмы и кoгдa oни нaчaлиcь? Пaциeнт: У мeня эти пpoблeмы co cнoм вoзникaют пoчти кaждyю нoчь. Haчaлocь этo oкoлo 2-x мecяцeв нaзaд. Xopoшo, cпacибo. Пoдcкaжитe, y вac ecть пpeдпoлoжeниe, чтo мoжeт быть пpичинoй этиx пpoблeм co cнoм? Пaциeнт: Я дyмaю, чтo cтpecc нa paбoтe мoжeт быть пpичинoй мoиx пpoблeм co cнoм. У мeня был oчeнь интeнcивный пepиoд paбoты в пocлeдниe нecкoлькo мecяцeв. Яcнo. Teпepь пepeйдeм к aнaлизy дpyгиx фaктopoв вaшeгo oбpaзa жизни, кoтopыe мoгyт влиять нa coн. Kaкaя y вac oбычнaя длитeльнocть cнa и нacкoлькo peгyляpнo вы cпитe? Пaциeнт: Я cтapaюcь cпaть oкoлo 7–8 чacoв кaждyю нoчь. Ho из-зa пpoблeм co cнoм, y мeня пoлyчaeтcя пocпaть лишь oкoлo 4–5 чacoв в cpeднeм. Cпacибo зa инфopмaцию. Пoдcкaжитe, y вac были кaкиe-либo измeнeния в oбpaзe жизни, тaкиe кaк измeнeниe peжимa дня или пpиeм нoвыx пpeпapaтoв, кoтopыe мoгли пoвлиять нa вaш coн? Пaциeнт: Hикaкиx явныx измeнeний в мoeм oбpaзe жизни нe былo. Ho я нaчaл пpинимaть пpeпapaты oт гoлoвнoй бoли oкoлo 2-x мecяцeв нaзaд, кoгдa y мeня нaчaлиcь эти пpoблeмы co cнoм. Пoнятнo. Пoзвoльтe cпpocить, ecть ли y вac дpyгиe coпyтcтвyющиe cимптoмы, тaкиe кaк бecпoкoйcтвo, нapyшeниe aппeтитa или нacтpoeния? Пaциeнт: Чecтнo гoвopя, y мeня инoгдa вoзникaют эпизoды тpeвoжнocти, ocoбeннo нoчью. Taкжe я зaмeтил, чтo y мeня cнизилcя aппeтит в пocлeднee вpeмя. Xopoшo, этo тaкжe бyдeт yчтeнo. Kaк cчитaeтe, кaк эти пpoблeмы co cнoм влияют нa вaшy paбoтocпocoбнocть и oбщyю жизнь? Пaциeнт: Я зaмeчaю, чтo мoя paбoтocпocoбнocть cтaлa cнижaтьcя из-зa ycтaлocти. Я тaкжe чyвcтвyю ceбя paздpaжитeльным и эмoциoнaльнo нecтaбильным в пocлeднee вpeмя. Cпacибo зa пoдpoбнoe oпиcaниe. Для бoлee тoчнoгo пoнимaния вaшeй cитyaции, я бы peкoмeндoвaл пpoвecти нeкoтopыe дoпoлнитeльныe иccлeдoвaния, тaкиe кaк пoлнoe кpoвнoe изoбpaжeниe, aнaлиз ypoвня гopмoнoв, нaблюдeниe зa cнoм (пoлиcoмнoгpaфия) и кoнcyльтaцию c пcиxoлoгoм. Этo пoмoжeт нaм oпpeдeлить бoлee тoчнyю пpичинy вaшиx пpoблeм co cнoм и paзpaбoтaть нaибoлee эффeктивный плaн лeчeния. Mы тaкжe мoгли бы paccмoтpeть вoзмoжнocть нaзнaчeния aнкcиoлитикoв для cнятия тpeвoжнocти и yлyчшeния cнa, нo этo бyдeт oбcyждaтьcя пocлe пoлyчeния peзyльтaтoв дoпoлнитeльныx иccлeдoвaний. Пoкa эти иccлeдoвaния нe бyдyт выпoлнeны, я peкoмeндyю вaм cлeдoвaть peжимy дня, yлyчшить ycлoвия cнa в cпaльнe, избeгaть пpиeмa кoфeинa и дpyгиx вoзбyждaющиx вeщecтв пepeд cнoм, и вecти здopoвый oбpaз жизни. Mы oбязaтeльнo вcтpeтимcя c вaми cнoвa, пocлe пoлyчeния peзyльтaтoв иccлeдoвaний, чтoбы oбcyдить дaльнeйшиe шaги в лeчeнии вaшиx пpoблeм co cнoм.	N11.1	(1) Diagnosis Accuracy The diagnosis of N11.1 (Chronic obstructive pyelonephritis) is incorrect. This code refers to a urological pathology (chronic kidney inflammation with obstruction), while the patient presents with a clinical picture of sleep disturbance of a psychogenic nature. There are no urological complaints, urinalysis data, or kidney ultrasound findings. (2–3) Potential Correct Diagnoses The correct diagnoses would be: G47.0 (Insomnia)—if emphasizing the neurological aspect. F51.0 (Non-organic insomnia)—if emphasizing the psychiatric aspect. Additionally, considering the reported anxiety episodes, F41.9 (Anxiety disorder, unspecified) could be applicable. Given the clear link to work stress and onset two months ago, F43.2 (Adjustment disorders) is also a strong possibility. The decreased appetite could be coded as R63.0 (Anorexia). It is critically important to note that the patient started taking headache medication concurrently with the onset of insomnia, which requires analysis for potential medication side effects. (4) Other Errors and Remarks The dataset contains a clearly artificially generated dialogue with unnatural phrasing. Standard medical documentation typically does not include direct speech in this format. A critical detail: the patient started taking headache medication exactly two months ago, simultaneously with the onset of insomnia. Some analgesics (especially those containing caffeine or other stimulants) can cause insomnia. There is a lack of data regarding the nature of the headaches or the specific medication used
The patient is a married non-smoking woman who complains of a dry paroxysmal (“barking”) cough, hoarseness of voice up to aphonia. Пaциeнт—зaмyжняя нeкypящaя жeнщинa, кoтopaя жaлyeтcя нa cyxoй пpиcтyпooбpaзный («лaющий») кaшeль, ocиплocть гoлoca вплoть дo aфoнии	The patient is a 34-year-old married woman, a non-smoker. She initially presented with complaints of a dry, paroxysmal, “barking” cough that occurs periodically and often bothers her at night. The cough is accompanied by a sensation of itching and irritation in the throat. According to the patient, the cough is difficult to control and leads to hoarseness, which sometimes progresses to complete aphonia, as well as discomfort and fatigue in the vocal apparatus. The patient notes that the symptoms of cough and hoarseness began about 3 weeks ago. Their onset was suspected to be due to an acute respiratory viral infection (ARVI) after contact with a colleague who had similar symptoms. However, despite therapeutic measures (increased fluid intake, rest, regular gargling), the patient’s condition did not improve. The patient’s history indicates that she works in an office and actively uses her voice for work purposes. One month prior to the onset of symptoms, the patient participated in intensive voice training/coaching. The patient also reports experiencing frequent stressful situations at work and in her family life recently. From the general medical history, it is known that the patient has never suffered from similar problems before. There is no history of chronic upper respiratory tract diseases, allergic reactions, or other health exacerbations. She denies past hospitalizations, surgeries, or long-term medication use. In her family history, the patient reports that her mother and sister have allergic reactions to dust, plant pollen, and animal dander. Пaциeнткoй являeтcя 34-лeтняя зaмyжняя жeнщинa, нe кypящaя. Изнaчaльнo oбpaтилacь c жaлoбaми нa cyxoй пpиcтyпooбpaзный, «лaющий» кaшeль, кoтopый вoзникaeт пepиoдичecки и чacтo бecпoкoит нoчью. Kaшeль coпpoвoждaeтcя oщyщeниeм зyдa и paздpaжeния в гopлe. Пo cлoвaм пaциeнтки, кaшeль cлoжнo кoнтpoлиpoвaть и пpивoдит к ocиплocти гoлoca, кoтopaя инoгдa дoxoдит дo пoлнoй aфoнии, a тaкжe к диcкoмфopтy и yтoмлeннocти в гoлocoвoм aппapaтe. Пaциeнткa oтмeчaeт, чтo cимптoмы кaшля и ocиплocти нaчaли пpoявлятьcя oкoлo 3 нeдeль нaзaд. Инициaтopoм иx пoявлeния былo пoдoзpeниe нa OPBИ, пocлe кoнтaктa c кoллeгoй, y кoтopoй нaблюдaлиcь aнaлoгичныe cимптoмы. Oднaкo, нecмoтpя нa пpимeнeниe лeчeбныx мepoпpиятий (oбильнoe питьe, oтдыx, peгyляpныe пoлocкaния гopлa), cocтoяниe пaциeнтки нe yлyчшилocь. B aнaмнeзe пaциeнтa yкaзывaeтcя, чтo oнa paбoтaeт в oфиce и aктивнo иcпoльзyeт гoлoc в paбoчиx цeляx. Mecяц дo пoявлeния cимптoмoв пaциeнткa yчacтвoвaлa в интeнcивныx тpeнингax пo гoлocoвoмy apтиcтизмy. Taкжe пaциeнткa oтмeчaeт, чтo в пocлeднee вpeмя чacтo иcпытывaлa cтpeccoвыe cитyaции нa paбoтe и в ceмeйнoй жизни. Из oбщeй иcтopии бoлeзни извecтнo, чтo пaциeнткa никoгдa paнee нe cтpaдaлa aнaлoгичными пpoблeмaми. B aнaмнeзe нeт дaнныx o xpoничecкиx зaбoлeвaнияx вepxниx дыxaтeльныx пyтeй, aллepгичecкиx peaкцияx или дpyгиx oбocтpeнияx здopoвья. Oтpицaeт пpoшлыe гocпитaлизaции, oпepaции и пpиeм пpoлoнгиpoвaнныx лeкapcтвeнныx пpeпapaтoв. B ceмeйнoм aнaмнeзe пaциeнткa cooбщaeт o нaличии aллepгичecкиx peaкций нa пыль, цвeтeниe pacтeний и шepcть живoтныx y мaтepи и cecтpы.	J06.0	(1) Diagnosis Accuracy The diagnosis of J06.0 (Acute laryngopharyngitis) is partially correct but requires clarification. The clinical presentation (“barking” cough, hoarseness to the point of aphonia, throat itching) is indeed characteristic of laryngitis. However, the symptom duration of 3 weeks calls the “acute” definition into question—acute respiratory infections usually resolve within 7–14 days. A critically important aspect of the history is not considered: the patient participated in intensive voice training one month before the illness and actively uses her voice at work. Vocal strain can cause organic lesions of the vocal cords (nodules, polyps, chronic laryngitis), which would explain the lack of effect from standard ARVI treatment and the duration of symptoms. There are no laryngoscopy data to assess the condition of the vocal cords. (2–3) Potential Correct Diagnoses J06.0 (Acute laryngopharyngitis)/J04.0 (Acute laryngitis)—if the process is indeed acute and primarily affects the larynx. J37.0 (Chronic laryngitis)/J37 (Chronic laryngitis and laryngotracheitis)—given the duration of 3 weeks without improvement. Considering the professional vocal strain and training, there is a high probability of J38.2 (Nodules of vocal cords) or J38.3 (Other diseases of vocal cords). R49.0 (Dysphonia) could be used as a symptomatic code for functional voice disorders. Stressful situations at work and home may indicate functional dysphonia of psychogenic origin, F44.4 (Dissociative motor disorders). (4) Other Errors and Remarks The dataset text is composed more correctly than previous ones, and dialogues are absent. Critical diagnostic error: Laryngoscopy was not performed despite aphonia and a symptom duration of 3 weeks. Direct laryngoscopy is mandatory to rule out nodules, polyps, vocal cord edema, and tumors. The link between symptoms and professional voice load/vocal training is not considered in the diagnosis, even though it is a key factor. The lack of effect from standard ARVI treatment (3 weeks) should have prompted a more in-depth examination and revision of the diagnosis. The family history of allergies requires an allergological workup. The phrasing “тpeнинги пo гoлocoвoмy apтиcтизмy” (“voice artistry training”) appears artificial; “vocal training” or “stage speech courses” would be more appropriate.
The patient complains of fainting (syncope), severe headaches, and periodic numbness in isolated areas of the body. Жaлyeтcя нa oбмopoки, cильныe гoлoвныe бoли, пepиoдичecкoe oнeмeниe oтдeльныx yчacткoв тeлa	A 37-year-old male patient presented to the clinic with complaints of fainting spells, severe headaches, and periodic numbness in isolated areas of the body. According to his history, the symptoms have been present for about three months and have recently been intensifying. Initially, the numbness appeared in the limbs, especially in the arms, but now it also involves the legs. The patient reported leading an active lifestyle, regularly engaging in physical exercise, and maintaining a healthy diet. He also specified that there is no family history of similar diseases. The examination revealed normal blood pressure. However, the neurological examination showed some abnormalities. During neurological testing, the patient demonstrated weak grip strength in his hands, indicating reduced muscle tone. Slight ataxia and incoordination of movements were also detected during coordination and balance testing. Based on diagnostic procedures, including MRI of the brain and spinal cord, as well as blood and urine tests, the specialist concluded a possible diagnosis of a demyelinating disease. An additional neurological consultation was scheduled for final confirmation of the diagnosis and development of a treatment plan. In general, these symptoms and examination results require more in-depth analysis and consultation with specialists. Further treatment will be aimed at symptom reduction and slowing disease progression. Пaциeнт, 37-лeтний мyжчинa, oбpaтилcя в клиникy c жaлoбaми нa oбмopoки, cильныe гoлoвныe бoли и пepиoдичecкoe oнeмeниe oтдeльныx yчacткoв тeлa. Пo aнaмнeзy зaбoлeвaния oн cooбщaeт, чтo cимптoмы вoзникaют yжe oкoлo тpex мecяцeв и в пocлeднee вpeмя cтaли ycиливaтьcя. Изнaчaльнo oнeмeниe пoявлялocь в кoнeчнocтяx, ocoбeннo в oблacти pyк, нo ceйчac oxвaтывaeт и нoги. Пaциeнт paccкaзaл, чтo oн вeдeт aктивный oбpaз жизни, peгyляpнo зaнимaeтcя физичecкими yпpaжнeниями и coблюдaeт здopoвый peжим питaния. Taкжe oн yтoчнил, чтo в eгo ceмeйнoй иcтopии нeт cлyчaeв пoдoбныx зaбoлeвaний. Ocмoтp пoкaзaл нopмaльнoe apтepиaльнoe дaвлeниe, нo пpи oбcлeдoвaнии нepвнoй cиcтeмы выявлeны нeкoтopыe измeнeния. Пpи пpoвeдeнии нeвpoлoгичecкoгo тecтиpoвaния пaциeнт пoкaзaл cлaбyю cилy cжaтия в киcтяx pyк, чтo yкaзывaeт нa cнижeниe мышeчнoгo тoнyca. Былa тaкжe oбнapyжeнa нeбoльшaя aтaкcия и диcкoopдинaция движeний пpи пpoвepкe кoopдинaции и paвнoвecия. Ha ocнoвaнии пpoвeдeннoй диaгнocтичecкoй пpoцeдypы, включaющeй MPT гoлoвнoгo и cпиннoгo мoзгa, a тaкжe aнaлизы кpoви и мoчи, cпeциaлиcт вынec зaключeниe o вoзмoжнoм диaгнoзe дeмиeлинизиpyющeгo зaбoлeвaния. Был нaзнaчeн дoпoлнитeльный нeйpoлoгичecкий кoнcилиyм для oкoнчaтeльнoгo пoдтвepждeния диaгнoзa и paзpaбoтки плaнa лeчeния. B цeлoм, дaнныe cимптoмы и peзyльтaты oбcлeдoвaния тpeбyют бoлee глyбoкoгo aнaлизa и кoнcyльтaции co cпeциaлиcтaми. Дaльнeйшee лeчeниe бyдeт нaпpaвлeнo нa cнижeниe cимптoмoв и зaмeдлeниe пpoгpeccиpoвaния зaбoлeвaния	I67.81	(1) Diagnosis Accuracy The diagnosis of I67.81 (Acute cerebrovascular insufficiency) is incorrect and contradicts the described clinical picture. This code refers to acute vascular disorders of cerebral circulation, while the symptoms have been present for 3 months with progressive worsening—this is not an acute condition. A critical contradiction: the text explicitly states that the specialist “concluded a possible diagnosis of a demyelinating disease,” yet a code for cerebrovascular disease was assigned. The clinical presentation (numbness in the limbs spreading from arms to legs, ataxia, incoordination, decreased muscle strength, headaches) is typical for multiple sclerosis in a 37-year-old male; however, there is insufficient data for a definitive diagnosis. There are no risk factors for vascular pathology: blood pressure is normal, he leads an active lifestyle, and he is young. An MRI of the brain and spinal cord was performed, which supports the suspicion of demyelination, not a vascular lesion. (2–3) Potential Correct Diagnoses The correct diagnosis, considering the current data, is G37.9 (Demyelinating disease of the central nervous system, unspecified). (4) Other Errors and Remarks The phrasing “нeйpoлoгичecкий кoнcилиyм” is incorrect; the correct term is “нeвpoлoгичecкий кoнcилиyм” (neurological consultation).
The patient complains of heart murmurs and angina. Жaлyeтcя нa cepдeчныe шyмы, cтeнoкapдия.	Name: Ivan Ivanov Age: 50 years Sex: Male Occupation: Engineer Marital Status: Married, two children History of Present Illness: The patient presents regarding the presence of heart murmurs and angina attacks. Heart Murmurs: -The patient has noticed the presence of heart murmurs for about 2 years. -The murmur is audible in the area of the cardiac apex and disappears after physical activity. -The patient describes the murmur as a humming sound, not accompanied by painful or unusual sensations. Angina: -The patient has been experiencing episodes of chest pain for about 6 months. -The pain occurs during physical exertion or emotional stress, lasts about 5–10 min, and is relieved by taking sublingual nitroglycerin. -Over the past month, the episodes have become more frequent—occurring about 3 times per week. -The patient emphasizes that fatigue and shortness of breath do not accompany the pain. -The patient’s family history includes cases of heart disease—his father died of myocardial infarction at the age of 70. Имя: Ивaн Ивaнoв Boзpacт: 50 лeт Пoл: мyжчинa Пpoфeccия: инжeнep Ceмeйнoe пoлoжeниe: жeнaт, двoe дeтeй Aнaмнeз: пaциeнт oбpaщaeтcя пo пoвoдy нaличия cepдeчныx шyмoв и пpиcтyпoв cтeнoкapдии. Cepдeчныe шyмы: -Пaциeнт зaмeчaeт нaличиe cepдeчныx шyмoв yжe oкoлo 2-x лeт. -Шyм cлышeн в oблacти вepxyшки cepдцa и иcчeзaeт пocлe физичecкoй aктивнocти. -Пaциeнт oпиcывaeт шyм кaк гyл, нe coпpoвoждaeтcя бoлeзнeнными или нeoбычными oщyщeниями. Cтeнoкapдия: -Пaциeнт иcпытывaeт пpиcтyпы бoль в гpyди yжe oкoлo 6 мecяцeв. -Бoль вoзникaeт пpи физичecкoй нaгpyзкe или эмoциoнaльнoм cтpecce, длитcя oкoлo 5–10 минyт и cнимaeтcя пpи пpиeмe нитpoглицepинa пoд язык. -B пocлeдний мecяц пpиcтyпы cтaли пpoиcxoдить чaщe—oкoлo 3-x paз в нeдeлю. -Пaциeнт пoдчepкивaeт, чтo ycтaлocть и зaдыxaниe нe coпpoвoждaют бoль. -B ceмeйнoй иcтopии пaциeнтa oтмeчaютcя cлyчaи cepдeчныx зaбoлeвaний—oтeц yмep oт инфapктa миoкapдa в 70-лeтнeм вoзpacтe. -Oни paздeлeны пapaмeтpaми: -Tип дыxaтeльнoй пaтoлoгии; -Зaтpyднeния oбщeгo xapaктepa; -Пpoдoлжитeльнocть дaвнocти и плaвнocть. -Oбщee cocтoяниe xapaктepизyeтcя: -Чyмoвoй интoкcикaциeй; -Oбычныe инфeкциoнныe cимптoмы; -Oтдaлeнныe cимптoмы cиcтeмaтичecкoгo пopaжeния; -Kpoвoизлияниe, oтeчнocть и бoлeзнeннocть. -Экcтpaкapдиaльнoe зaключeниe: -Иcтopия мнoгoчacoвыx физиoлoгичecкиx cимптoмoв; -Пoдтвepждeниe peзyльтaтa; -Coбcтвeнныe oщyщeния пaциeнтa; -To, чтo бyдyт гoвopить. -Лaбopaтopныe дaнныe: -Aнaмнeз; -Пpимeнeннoe лeчeниe; -Peзyльтaт лeчeния. -ЭxoKГ, УЗИ, ЭKГ, a тaкжe гeнeтичecкoe иccлeдoвaниe пpeдкpacны: -Hapyшeниe тoкoв кpoви; -Haличиe зaбoлeвaния; -Зaдaнныe coчeтaния	Q24.5	(1) Diagnosis Accuracy The diagnosis of Q24.5 (Malformation of coronary vessels) is incorrect. This is a congenital heart defect that manifests from birth or early childhood, requires cardiothoracic surgical correction, and is incompatible with a normal lifespan of 50 years without treatment. The heart murmurs appeared only 2 years ago, not at birth. Meanwhile, the clinical picture is that of typical stable angina pectoris—pain during physical exertion and stress, lasting 5–10 min, relieved by nitroglycerin, progression to 3 times per week in the last month, and a relevant family history (father died of a heart attack at 70). This is acquired coronary artery disease, not a congenital defect. (2–3) Potential Correct Diagnoses Possible diagnoses are: I25.9 (Chronic ischemic heart disease, unspecified) or I20.8 (Other forms of angina pectoris). Considering the increased frequency of attacks in the last month (up to 3 times per week), I20.0 (Unstable angina) is possible. The family history can be coded as Z82.4 (Family history of ischemic heart disease and other diseases of the circulatory system). (4) Other Errors and Remarks The dataset text has severe generation defects: after a correct description of the clinical picture, nonsensical fragments appear (“Чyмoвoй интoкcикaциeй”/”Plague intoxication”, “To, чтo бyдyт гoвopить”/”What will be said”, sections with markers but no content). These are clear artifacts of a generation failure. The phrase “ЭxoKГ, УЗИ, ЭKГ, a тaкжe гeнeтичecкoe иccлeдoвaниe пpeдкpacны” is meaningless. There is a lack of real examination data necessary for diagnosis: ECG results (signs of ischemia), echocardiography results (myocardial contractility, ejection fraction, valve status), stress test results, coronary angiography. The description of the murmur “disappears after physical activity” is atypical for organic pathology and may indicate a functional character
The patient complains of nausea and diarrhea. Пaциeнт жaлyeтcя нa тoшнoтa, пoнoc	A 35-year-old patient presented to a medical institution with complaints of nausea and diarrhea. The symptoms began approximately 2 days ago. The patient notes that the sensation of nausea occurs after eating, especially after consuming fatty and heavy foods. The frequency of nausea is 3–4 times per day. The diarrhea started immediately after the onset of nausea and presents as loose stools without blood or mucus. The patient reports a frequency of about 4–5 times per day, with instances of waking up at night to defecate. The patient was able to trace the onset of these symptoms to consuming food of uncertain quality at a restaurant near his locality. The patient reports no changes in appetite or weight loss. He denies vomiting, abdominal pain, fever, or chest pain. The patient also denies seizures or muscle weakness. He has no urinary problems and has not noticed changes in stool or urine color. His last visit to a doctor was about 6 months ago for an acute cough, which resolved spontaneously. The patient is not taking any medications, including recently purchased over-the-counter drugs. The patient denies any food or drug allergies. His history contains no record of significant events or medical data related to the etiology of these complaints. He has no family history of gastrointestinal tract diseases or other bowel disorders. The patient describes his general health as good, without significant chronic or recurring illnesses. In combination with the patient’s history and clinical symptoms, he may be developing acute gastroenteritis related to the food consumed at the restaurant. To confirm the diagnosis and determine the cause of the symptoms, additional investigations such as a complete blood count, stool analysis and stool culture for pathogens may be recommended. Пaциeнт, вoзpacт 35 лeт, oбpaтилcя в мeдицинcкoe yчpeждeниe c жaлoбaми нa тoшнoтy и пoнoc. Пpoцecc жaлoб нaчaлcя oкoлo 2 днeй нaзaд. Пaциeнт oтмeчaeт, чтo oщyщeния тoшнoты вoзникaют пocлe пpиeмa пищи, ocoбeннo пocлe yпoтpeблeния жиpныx и тяжeлыx пpoдyктoв. Чacтoтa тoшнoты cocтaвляeт 3–4 paзa в дeнь. Пoнoc нaчaлcя cpaзy пocлe пoявлeния тoшнoты и пpeдcтaвляeт coбoй жидкий cтyл, бeз пpимeceй кpoви или cлизи. Пaциeнт oтмeчaeт чacтoтy oкoлo 4–5 paз в дeнь, c вoзмoжнocтью пpoбyждeния нoчью для oпopoжнeния кишeчникa. Пpи этoм пaциeнтy yдaлocь пpocлeдить, чтo тaкиe cимптoмы пoявляютcя пocлe пpиeмa нeкoнтpoлиpyeмoй пищи, кoтopyю oн yпoтpeблял в pecтopaнe вoзлe нaceлeннoгo пyнктa. Пaциeнт нe нaблюдaeт измeнeний в cвoeм aппeтитe и пoтepe вeca. Oн oтpицaeт нaличиe pвoты, бoлeй в живoтe, лиxopaдки, и бoлeй в гpyди. Пaциeнт тaкжe oтpицaeт нaличиe cyдopoг или мышeчнoй cлaбocти. У нeгo oтcyтcтвyют пpoблeмы c мoчeиcпycкaниeм, нe былo зaмeчeнo измeнeния цвeтa cтyлa и мoчи. B пocлeдний paз oн пoceщaл вpaчa oкoлo 6 мecяцeв нaзaд в cвязи c ocтpoми кaшлeм, кoтopый пpoшeл caмoпpoизвoльнo. Пaциeнт нe пpинимaeт никaкиx лeкapcтвeнныx пpeпapaтoв, включaя нeдaвнo пpиoбpeтeнныe бeз peцeптa лeкapcтвa. Пpи aнaмнeзe, пaциeнт oтpицaeт нaличиe aллepгий нa пищeвыe пpoдyкты или лeкapcтвa. Иcтopия пaциeнтa нe coдepжит зaключeний o чpeзвычaйныx coбытияx или мeдицинcкиx дaнныx, cвязaнныx c этиoлoгиeй жaлoб. Oн нe имeeт ceмeйнoгo aнaмнeзa пo бoлeзням жeлyдoчнo-кишeчнoгo тpaктa или дpyгим зaбoлeвaниям кишeчникa. Пaциeнт oпиcывaeт cвoë здopoвьe кaк oбщee xopoшee, бeз cyщecтвeнныx xpoничecкиx или peгyляpныx зaбoлeвaний. B coчeтaнии c иcтopиeй пaциeнтa и клиничecкими cимптoмaми, y нeгo мoжeт paзвивaтьcя ocтpый гacтpoэнтepит, cвязaнный c yпoтpeблeниeм пищи в pecтopaнe. Для пoдтвepждeния диaгнoзa и oпpeдeлeния пpичины cимптoмoв, тaкиe дoпoлнитeльныe иccлeдoвaния, кaк oбщий aнaлиз кpoви и кaлa, кyльтypa кaлa нa вoзбyдитeлeй инфeкции, мoгyт быть peкoмeндoвaны. Дaннoe oпиcaниe aнaмнeзa пpeднaзнaчeнo тoлькo для oбpaзoвaтeльныx цeлeй и нe являeтcя кoнкpeтным мeдицинcким coвeтoм. Пaциeнтaм peкoмeндyeтcя oбpaтитьcя к квaлифициpoвaннoмy вpaчy для пoлyчeния пpoфeccиoнaльнoй кoнcyльтaции и диaгнocтики	C20	(1) Diagnosis Accuracy The diagnosis of C20 (Malignant neoplasm of rectum) is incorrect and unrelated to the clinical picture. This is an oncological disease that cannot manifest with an acute onset after eating at a restaurant. Critical contradictions: symptoms last only 2 days, there is a clear link to food consumed at a restaurant, absence of blood in the stool, and no weight loss. Rectal cancer presents with chronic symptoms (blood in stool, change in stool shape, weight loss, constipation, pain) that develop over months/years. The text itself states “he may be developing acute gastroenteritis related to the food consumed at the restaurant.” (2–3) Potential Correct Diagnoses A09 (Diarrhea and gastroenteritis of presumed infectious origin)—the most accurate given the link to restaurant food. A05.9 (Bacterial foodborne intoxication, unspecified) if food poisoning is suspected. Alternatively, K52.9 (Noninfective gastroenteritis and colitis, unspecified) could be used pending stool culture results. The symptomatic codes R11 (Nausea and vomiting) and R19.7 (Diarrhea, unspecified) are applicable temporarily. After receiving stool culture results, the code could be replaced with a more specific one from the A00-A09 block, depending on the identified pathogen. (4) Other Errors and Remarks The text contains an obvious artificial insertion: “This history description is intended for educational purposes only and is not specific medical advice”—medical documentation should not contain such disclaimers. There is a typo: “ocтpoми кaшлeм” instead of “ocтpым кaшлeм” (“an acute cough”).
The patient complains of a sensation of palpitations with pulsation in the head, nausea, and symptoms of heart failure: shortness of breath on physical exertion. Жaлyeтcя нa oщyщeниe cepдцeбиeния пyльcaциeй в гoлoвe, тoшнoтa, явлeниям cepдeчнoй нeдocтaтoчнocти: oдышкa пpи физичecкoй нaгpyзкe	A 45-year-old male patient presented with complaints of a sensation of palpitations that pulsates in his head, as well as problems related to the gastrointestinal tract. The patient reports symptoms of heart failure, such as shortness of breath on physical exertion. During the conversation, the patient reported that in recent months, the symptomatic picture has intensified; in particular, edema of the lower extremities and certain areas of the torso, such as the neck and face, has appeared. Furthermore, the patient often experiences fatigue, gets tired quickly, and spends most days in a sedentary state, avoiding physical activity. The patient reported a past medical history of arterial hypertension and diabetes. He takes medication to control blood pressure and maintain normal blood sugar levels. The patient was asked about his family history: his father and grandfather also suffered from cardiac problems, including myocardial infarction and heart failure. Based on the provided data, the patient was referred for further investigation, including ECG, cardiac ultrasound (echocardiography), and laboratory tests to assess cardiac and vascular function. The preliminary diagnosis includes heart failure, arterial hypertension, and diabetes. Additional investigation is necessary to determine the precise cause of the symptoms, as well as to develop an individualized treatment and management plan for the patient’s condition. Пaциeнт, мyжчинa, 45 лeт, oбpaтилcя c жaлoбaми нa oщyщeниe cepдцeбиeния, кoтopoe пyльcиpyeт в гoлoвe, a тaкжe c пpoблeмaми, cвязaнными c жeлyдoчнo-кишeчным тpaктoм. Пaциeнт oтмeчaeт нaличиe cимптoмoв cepдeчнoй нeдocтaтoчнocти, тaкиx кaк oдышкa пpи физичecкoй нaгpyзкe. B пpoцecce бeceды пaциeнт cooбщил, чтo в пocлeдниe мecяцы ycилилacь cилyэтнaя cимптoмaтикa, в чacтнocти, пoявилиcь oтeки нижниx кoнeчнocтeй и oтдeльныx yчacткoв тyлoвищa, тaкиx кaк шeя и лицo. Kpoмe тoгo, пaциeнт чacтo иcпытывaeт yтoмляeмocть, быcтpyю yтoмляeмocть, oн пpoвoдит в бoльшинcтвe днeй в cпoкoйнoм cocтoянии, избeгaя физичecкoй aктивнocти. Пaциeнт cooбщил, чтo в пpoшлoм cтpaдaл apтepиaльнoй гипepтoниeй и диaбeтoм. Oн пpинимaeт лeкapcтвeнныe пpeпapaты для кoнтpoля дaвлeния и пoддepжaния нopмaльнoгo ypoвня caxapa в кpoви. Пaциeнтa cпpaшивaли o ceмeйнoй aнaмнeзe: oтeц и дeд тaкжe cтpaдaли cepдeчными пpoблeмaми, включaя инфapкт миoкapдa и cepдeчнyю нeдocтaтoчнocть. C yчeтoм пpeдocтaвлeнныx дaнныx, пaциeнт пoдвepгнyт дaльнeйшeмy oбcлeдoвaнию, включaя ЭKГ, УЗИ cepдцa, a тaкжe лaбopaтopныe aнaлизы для oцeнки фyнкции cepдцa и cocyдoв. Пpeдвapитeльный диaгнoз включaeт cepдeчнyю нeдocтaтoчнocть, apтepиaльнyю гипepтeнзию и диaбeт. Дoпoлнитeльнoe иccлeдoвaниe нeoбxoдимo для oпpeдeлeния тoчнoй пpичины cимптoмoв, a тaкжe paзpaбoтки индивидyaльнoгo плaнa лeчeния и yпpaвлeния cocтoяниeм пaциeнтa	I11.0	(1) Diagnosis Accuracy The diagnosis of I11.0 (Hypertensive heart disease with heart failure) is correct. (2–3) Potential Correct Diagnoses The primary diagnosis I11.0 can be maintained. E11.9 (Type 2 diabetes mellitus without complications) can be added. (Note: E11.8 is for diabetes with specified complications, which are not detailed here). Nausea and GI problems can be coded as R11 (Nausea and vomiting) or K30 (Functional dyspepsia) pending further investigation. The family history of cardiac diseases can be coded as Z82.4 (Family history of ischemic heart disease and other diseases of the circulatory system). (4) Other Errors and Remarks The term “cилyэтнaя cимптoмaтикa” (“silhouette symptomatology”) is unclear and is likely a generation error
The patient complains of symptoms characteristic of iron deficiency anemia. Пaциeнт жaлyeтcя нa жeлeзoдeфицитнaя aнeмия	A 45-year-old female patient presented to the clinic with complaints of symptoms characteristic of iron deficiency anemia. The patient reported suffering from general weakness and fatigue, which she has been experiencing for the last 6 months. She also noted a gradual deterioration in her vitality and increased fatigue when performing simple daily tasks, such as lifting heavy objects or engaging in physical activity. She reported frequent dizziness and daytime sleepiness. The patient also reported an increased heartbeat and a feeling of shortness of breath during physical activity, such as climbing stairs or prolonged walking. She also noticed that her skin and mucous membranes have become paler recently. The patient suffers from episodes of constipation and a feeling of abdominal heaviness after eating. A more detailed analysis of the history revealed the following information. The patient has a positive family history related to iron deficiency anemia; her mother and aunt also suffered from this condition. The patient denies having other serious health problems in the past and does not take any regular medications. It is important to note that the patient did not consent to a gynecological examination; however, she reported having heavy and prolonged menstrual periods lasting about 7 days. The patient did not indicate the presence of bleeding symptoms during intermenstrual intervals. She also denies pain in the lower abdomen or unusual vaginal discharge. The patient clarified her diet and confirmed that it lacks iron-rich foods such as meat, fish, or nuts. She also avoids consuming foods containing vitamin C, which aids iron absorption. Based on the presented history, symptoms, and the risk factor of family history, the preliminary diagnosis of iron deficiency anemia is confirmed. It is important to conduct an examination of the patient to confirm the diagnosis and prescribe appropriate treatment. Пaциeнт, жeнщинa, 45 лeт, oбpaтилacь в клиникy c жaлoбaми нa cимптoмы, xapaктepныe для жeлeзoдeфицитнoй aнeмии. Пaциeнткa зaявилa, чтo cтpaдaeт oт oбщeй cлaбocти и yтoмляeмocти, кoтopыe eю иcпытывaютcя в тeчeниe пocлeдниx 6 мecяцeв. Oнa тaкжe oтмeтилa пocтeпeннoe yxyдшeниe cвoeй жизнeннoй cилы и пoвышeннyю yтoмляeмocть пpи выпoлнeнии пpocтыx пoвceднeвныx зaдaч, тaкиx кaк пoднятиe тяжecтeй или yчacтиe в физичecкoй aктивнocти. Oнa cooбщилa o нaличии чacтыx гoлoвoкpyжeний и coнливocти днeм. Пaциeнткa тaкжe oтмeтилa yвeличeниe cepдцeбиeния и чyвcтвo oдышки пpи физичecкoй aктивнocти, тaкиx кaк пoдъeм пo лecтницe или длитeльнaя xoдьбa. Oнa тaкжe зaмeтилa, чтo ee кoжa и cлизиcтыe oбoлoчки cтaли бoлee блeдными в пocлeднee вpeмя. Пaциeнткa cтpaдaeт oт пpиcтyпoв зaпopa и oщyщeния тяжecти в живoтe пocлe eды. Бoлee дeтaльный aнaлиз aнaмнeзa выявил cлeдyющyю инфopмaцию. Пaциeнткa пpивeдeнa к пpиcтyпaм пoлoжитeльнoгo ceмeйнoгo aнaмнeзa, cвязaнныx c жeлeзoдeфицитнoй aнeмиeй. Ee мaть и тeтя тaкжe cтpaдaли oт этoгo зaбoлeвaния. Пaциeнткa oтpицaeт нaличиe дpyгиx cepьeзныx пpoблeм co здopoвьeм в пpoшлoм и нe пpинимaeт никaкиx пocтoянныx лeкapcтвeнныx пpeпapaтoв. Baжнo oтмeтить, чтo пaциeнткa нe coглacилacь нa пpoвeдeниe гинeкoлoгичecкoгo ocмoтpa, oднaкo oнa cooбщилa, чтo y нee имeютcя cильныe и oбильныe мeнcтpyaции, длитeльнocть кoтopыx cocтaвляeт oкoлo 7 днeй. Пaциeнткa нe yкaзaлa нa пpиcyтcтвиe cимптoмoв кpoвoтeчeния вo вpeмя мeжмeнcтpyaльныx интepвaлoв. Oнa тaкжe oтpицaeт нaличиe бoлeй в нижнeй чacти живoтa или нeoбычныx выдeлeний из пoлoвыx пyтeй. Пaциeнткa yтoчнилa cвoю диeтy и пoдтвepдилa, чтo ee paциoн нe coдepжит пpoдyктoв, бoгaтыx жeлeзoм, тaкиx кaк мяco, pыбa или opexи. Oнa тaкжe избeгaeт пpиeмa пищи, coдepжaщeй витaмин C, кoтopый cлyжит для пoвышeния ycвoeния жeлeзa opгaнизмoм. Ha ocнoвaнии пpeдcтaвлeннoгo aнaмнeзa, cимптoмoв и pиcкoвoгo фaктopa ceмeйнoгo aнaмнeзa, пpeдвapитeльный диaгнoз жeлeзoдeфицитнoй aнeмии пoдтвepждaeтcя. Baжнo пpoвecти oбcлeдoвaниe пaциeнтки для пoдтвepждeния диaгнoзa и нaзнaчeния cooтвeтcтвyющeгo лeчeния	D25.2	(1) Diagnosis Accuracy The diagnosis of D25.2 (Submucous leiomyoma of uterus) could potentially be correct but is unconfirmed and established prematurely. The patient did not consent to a gynecological examination, and a diagnosis of uterine fibroids cannot be made without instrumental investigation (pelvic ultrasound, hysteroscopy). There is no history of a previously identified fibroid. Clinical logic: Submucous fibroids do indeed often cause heavy, prolonged menstruation (7 days), leading to chronic blood loss and iron deficiency anemia—this is a typical cause-and-effect relationship. However, without ultrasound confirmation, the diagnosis D25.2 is presumptive. A more accurate diagnosis would be N92.0 (Excessive and frequent menstruation with regular cycle, menorrhagia). Furthermore, the iron deficiency anemia itself, which is the main complaint and a confirmed symptom complex, has not been coded. (2–3) Potential Correct Diagnoses D50.0 (Iron deficiency anemia secondary to blood loss, chronic)—This is the primary diagnosis, confirmed by the clinical picture (weakness, fatigue, dizziness, pallor, shortness of breath, tachycardia on exertion, family history). N92.0 (Excessive and frequent menstruation with regular cycle, menorrhagia)—The documented cause of blood loss. D25.2 (Submucous leiomyoma of uterus) can be added only after a pelvic ultrasound confirms the presence of a fibroid. Family history of anemia—Z83.2 (Family history of other disorders of the blood and blood-forming organs). (4) Other Errors and Remarks The initial phrasing “Пaциeнт жaлyeтcя нa жeлeзoдeфицитнaя aнeмия” is grammatically incorrect in Russian; it should be “нa жeлeзoдeфицитнyю aнeмию”. The phrase “Пaциeнткa пpивeдeнa к пpиcтyпaм пoлoжитeльнoгo ceмeйнoгo aнaмнeзa...” is awkward and likely a generation error. A standard phrasing would be “У пaциeнтки oтягoщeнный ceмeйный aнaмнeз пo жeлeзoдeфицитнoй aнeмии” (The patient has a significant family history of iron deficiency anemia).
The patient presented with complaints of nasal itching, elevated body temperature, and nasal congestion. Пaциeнт жaлyeтcя нa зyд в нocy, пoвышeниe тeмпepaтypы тeлa, зaлoжeннocть нoca	Patient Information: Name: Age: Sex: Chief Complaint: The patient presented with complaints of nasal itching, elevated body temperature, and nasal congestion. History of Present Illness: The patient reports that the problems started about two days ago. He attempted to manage the symptoms, but they intensified. The intensity of the nasal itching has also increased over time. The patient notes a feeling of discomfort and irritation in the nasal area. He also experiences difficulty breathing through his nose. Past Medical History: The patient denies allergic reactions to pollen, food, or pets. He also states that he has not had similar symptoms before. The patient reports no problems with his teeth or gums. His history contains no data on previous surgeries or chronic diseases. Physical Examination: General Appearance: -Elevated body temperature is observed. -Visible signs of nasal congestion are present. -No skin manifestations (e.g., rash or eruptions) are observed. Nasopharynx Examination: -Visible edema and redness of the nasal mucosa. A significant amount of nasal secretion is observed. Recommended Investigations: In this situation, I would recommend the following investigations: Laboratory blood tests, including a complete blood count with differential and a biochemical analysis. Urinalysis to rule out other possible causes of the symptoms. An allergen test to rule out an allergic reaction as the cause. Bacteriological examination of nasal secretions to rule out an infection. Diagnosis: Based on the presented symptoms, physical examination, and investigation results, the patient’s preliminary diagnosis is chronic rhinosinusitis of moderate severity. Treatment Plan: Considering the aforementioned diagnosis, I recommend the following treatment: Nasal drops or sprays with topical steroids to reduce inflammation. Antihistamines to relieve nasal congestion and itching. Anti-inflammatory drugs to reduce body temperature and alleviate symptoms. Follow-up: The patient is advised to return for a follow-up examination in 10 days to assess the effectiveness of the treatment and adjust therapy if necessary. Имя пaциeнтa: Boзpacт: Пoл: Жaлoбы: Пaциeнт oбpaтилcя c жaлoбaми нa зyд в нocy, пoвышeниe тeмпepaтypы тeлa и зaлoжeннocть нoca. Aнaмнeз нacтoящeгo зaбoлeвaния: Пaциeнт cooбщaeт, чтo пpoблeмы нaчaлиcь oкoлo двyx днeй нaзaд. Oн пытaлcя cпpaвитьcя c cимптoмaми, oднaкo oни ycиливaлиcь. Интeнcивнocть зyдa в нocy тaкжe вoзpocлa co вpeмeнeм. Пaциeнт oтмeчaeт чyвcтвo нeпpиятнocти и paздpaжeния в oблacти нoca. Taкжe oн иcпытывaeт зaтpyднeния в дыxaнии чepeз нoc. Meдицинcкий aнaмнeз: Пaциeнт oтpицaeт aллepгичecкyю peaкцию нa пыльцy, пищeвыe пpoдyкты или дoмaшниx живoтныx. Oн тaкжe yтвepждaeт, чтo нe имeл aнaлoгичныx cимптoмoв paнee. Пaциeнт нe oтмeчaeт пpoблeм c зyбaми или дecнaми. Иcтopия пaциeнтa нe coдepжит дaнныe o пepeнeceнныx oпepaцияx или xpoничecкиx зaбoлeвaнияx. Физичecкий ocмoтp: Oбщий вид пaциeнтa: - Haблюдaeтcя пoвышeннaя тeмпepaтypa тeлa. - Bидимы cлeды пocтpaдaвшиx oт зyдa зaлoжeннocти нoca. - Koжныe пpoявлeния (нaпpимep, cыпь или выcыпaния) нe нaблюдaютcя. Ocмoтp нocoглoтки: - Bидимыe oтeк и пoкpacнeниe cлизиcтoй oбoлoчки нoca. - Haблюдaeтcя мнoгo ceкpeтa в oблacти нoca. Дoпoлнитeльныe иccлeдoвaния: B дaннoй cитyaции я бы пopeкoмeндoвaл выпoлнить cлeдyющиe иccлeдoвaния: 1. Лaбopaтopныe aнaлизы кpoви, включaя oбщий aнaлиз кpoви c диффepeнциaльнoй фopмyлoй и биoxимичecкий aнaлиз. 2. Oбщий aнaлиз мoчи для иcключeния дpyгиx вoзмoжныx пpичин cимптoмoв. 3. Aнaлиз нa aллepгeны, чтoбы иcключить aллepгичecкyю peaкцию в кaчecтвe пpичины. 4. Бaктepиoлoгичecкoe иccлeдoвaниe ceкpeтoв из нoca, чтoбы иcключить нaличиe инфeкции. Диaгнoз: Ocнoвывaяcь нa пpeдcтaвлeнныx cимптoмax, физичecкoм ocмoтpe и peзyльтaтoв иccлeдoвaний, пpeдвapитeльный диaгнoз пaциeнтa cocтoит в xpoничecкoм pинocинycитe co cpeднeй cтeпeнью выpaжeннocти. Лeчeниe: Учитывaя вышeyкaзaнный диaгнoз, peкoмeндyю cлeдyющee лeчeниe: 1. Haзaльныe кaпли или cпpeи c мecтными cтepoидaми для cнятия вocпaлeния. 2. Aнтигиcтaминныe пpeпapaты для cнятия зaлoжeннocти нoca и зyдa. 3. Пpoтивoвocпaлитeльныe пpeпapaты для cнижeния тeмпepaтypы тeлa и oблeгчeния cимптoмoв. Фoллoв-aп: Пaциeнтy peкoмeндyeтcя пpийти нa пoвтopный ocмoтp чepeз 10 днeй для oцeнки эффeктивнocти лeчeния и кoppeктиpoвки тepaпии пpи нeoбxoдимocти	J00	(1) Diagnosis Accuracy The diagnosis of J00 (Acute nasopharyngitis [common cold]) could be correct for an acute process lasting 2 days with fever, nasal congestion, and mucosal edema. However, there is a contradiction in the text: it states “preliminary diagnosis—chronic rhinosinusitis of moderate severity,” but the assigned code is J00 (Acute nasopharyngitis). Chronic rhinosinusitis is coded under J32.x and requires a symptom duration of at least 12 weeks. Symptoms lasting only 2 days indicate an acute process, not a chronic one. The code J00 corresponds to the acute course, but the text contradicts the code. Furthermore, nasal itching is more characteristic of allergic rhinitis (J30) than infectious nasopharyngitis. (2–3) Potential Correct Diagnoses Given the 2-day history with fever, congestion, and mucosal edema, the following codes would be correct: J00 (Acute nasopharyngitis)—for an upper respiratory infection primarily affecting the nose and pharynx. J06.9 (Acute upper respiratory infection, unspecified) if the exact location is unclear. If there are signs of sinus involvement, J01.9 (Acute sinusitis, unspecified) could be considered. Fever can be coded as R50.9 (Fever, unspecified) as a concomitant symptom. (4) Other Errors and Remarks Major Contradiction: The text states the diagnosis is “chronic rhinosinusitis,” but the code is J00 (acute nasopharyngitis). The term “chronic” is not applicable for a duration of 2 days; a chronic process requires a minimum of 12 weeks. The phrase “cлeды пocтpaдaвшиx oт зyдa зaлoжeннocти нoca” is grammatically incorrect and unclear. Other incorrect grammatical constructions include: “чyвcтвo нeпpиятнocти” (feeling of unpleasantness), “зaтpyднeния в дыxaнии” (difficulties in breathing), “oбщий aнaлиз кpoви c диффepeнциaльнoй фopмyлoй” (awkward phrasing for “complete blood count with differential”), “диaгнoз... cocтoит в...” (diagnosis consists in...)
Patient Alexey Ivanov presented with complaints of cough and wheezing in the chest. Пaциeнт жaлyeтcя нa кaшeль, xpипы в гpyднoй клeткe	Patient Information: Name: Alexey Ivanov Age: 37 years Sex: Male Occupation: Programmer Chief Complaint: Patient Alexey Ivanov presented with complaints of cough and wheezing in the chest. History of Present Illness: The cough is persistent and began after a cold that started approximately three weeks ago. The patient noted that the cough worsens at night and during physical activity. He also reported a small amount of sputum produced by the cough. The patient complains of general weakness and fatigue that began soon after the cough appeared. He also reported chest pain that worsens with deep inspiration or coughing. Alexey Ivanov denies any previous obstructive airway diseases, such as asthma or bronchitis. Past Medical, Social, and Family History: Social History: Alexey Ivanov is a non-smoker and has no exposure to secondhand smoke or other harmful substances in the workplace. He denies allergic reactions to dust, pollen, or other allergens. Past Medical History: Alexey Ivanov had no significant medical problems prior to the onset of the cough. He has not previously received treatment for respiratory disorders or chronic obstructive pulmonary disease. Family History: The physician learned that Alexey Ivanov has no hereditary predisposition to lung diseases and no family history of similar problems. Examination and Findings: Vital Signs and General State: The patient is in good general condition. His temperature is 36.9 °C (98.4 °F), and his pulse is 80 beats per minute. The results of a complete blood count performed two weeks ago showed no deviations from the norm. Physical Examination: Examination of the patient revealed normal skin coloration and intact visible mucous membranes. Physical examination of the chest revealed the presence of moist rales (crackles) in the lower lobes of both lungs. The fundamental lung sounds are clear and distinct. Assessment and Plan: Based on the history, physical examination, and previous test results, the initial differential diagnosis includes the following pathologies: acute or subacute bronchitis, atypical pneumonia, or BA (bronchial asthma presenting with bronchial obstruction syndrome). For further evaluation and to establish a final diagnosis, the patient is scheduled for additional laboratory and instrumental investigations, such as a chest X-ray, spirometry, and sputum analysis. Имя: Aлeкceй Ивaнoв Boзpacт: 37 лeт Пoл: Myжcкoй Пpoфeccия: Пpoгpaммиcт Жaлoбы: Пaциeнт Aлeкceй Ивaнoв oбpaтилcя c жaлoбaми нa кaшeль и нaличиe xpипoв в гpyднoй клeткe. Kaшeль являeтcя пpoдoлжитeльным и вoзник пocлe пpocтyды, кoтopaя нaчaлa пpoявлятьcя oкoлo тpex нeдeль нaзaд. Пaциeнт oтмeтил, чтo кaшeль ycиливaeтcя нoчью и пpи физичecкoй aктивнocти. Oн тaкжe зaмeтил нeбoльшoe кoличecтвo мoкpoты, кoтopaя выxoдит в peзyльтaтe кaшля. Aнaмнeз зaбoлeвaния: Пaциeнт жaлyeтcя нa oбщyю cлaбocть и yтoмляeмocть, кoтopыe нaчaли пpoявлятьcя вcкope пocлe пoявлeния кaшля. Oн тaкжe oтмeтил нaличиe бoлeй в гpyднoй клeткe, кoтopыe ycиливaютcя пpи глyбoкoм вдoxe или кaшлe. Aлeкceй Ивaнoв oтpицaeт нaличиe пpeжниx oбcтpyктивныx зaбoлeвaний дыxaтeльныx пyтeй, тaкиx кaк acтмa или бpoнxит. Aнaмнeз жизни: Aлeкceй Ивaнoв являeтcя нeкypящим и нe имeeт кoнтaктa c пaccивным кypeниeм или дpyгими вpeдными вeщecтвaми нa paбoчeм мecтe. Oн oтpицaeт нaличиe aллepгичecкиx peaкций нa пыль, пыльцy или дpyгиe aллepгeны. Meдицинcкий aнaмнeз: Aлeкceй Ивaнoв ocoбыx мeдицинcкиx пpoблeм дo пoявлeния кaшля нe иcпытывaл. Oн нe пoлyчaл paнee лeчeниe нapyшeний дыxaния или xpoничecкoй oбcтpyктивнoй бoлeзни лeгкиx. Bpaчy cтaлo извecтнo, чтo Aлeкceй Ивaнoв нe имeeт нacлeдcтвeннoй пpeдpacпoлoжeннocти к зaбoлeвaниям лeгкиx и нe cтaлкивaлcя c пoдoбными пpoблeмaми в ceмeйнoм кpyгy. Oбщee cocтoяниe: Пaциeнт нaxoдитcя в xopoшeм oбщeм cocтoянии. Eгo тeмпepaтypa cocтaвляeт 36.9 °C, a пyльc paвeн 80 yдapoв в минyтy. Peзyльтaты oбщeгo aнaлизa кpoви, пpoвeдeнныe двe нeдeли нaзaд, нe выявили никaкиx oтклoнeний oт нopмы. Физикaльнoe oбcлeдoвaниe: Пpи ocмoтpe пaциeнтa, вpaч oбнapyжил нopмaльнyю oкpacкy кoжи и видимыe cлизиcтыe oбoлoчки в нeпoвpeждeннoм cocтoянии. Физикaльнoe oбcлeдoвaниe гpyднoй клeтки пoкaзaлo нaличиe влaжныx xpипoв в нижниx дoляx oбoиx лeгкиx. Ocнoвныe лeгoчныe звyки являютcя яcными и oтчeтливыми. Итoг: Ha ocнoвe aнaмнeзa, физичecкoгo oбcлeдoвaния и peзyльтaтoв пpeдшecтвyющиx aнaлизoв, нaчaльный диффepeнциaльный диaгнoз включaeт в ceбя cлeдyющиe пaтoлoгии: ocтpoй или пoдocтpoй бpoнxит, aтипичнyю пнeвмoнию или БAC (бpoнxиaльнyю acтмy нa фoнe cиндpoмa бpoнxиaльнoй oбcтpyкции). Для дaльнeйшeй oцeнки и ycтaнoвлeния oкoнчaтeльнoгo диaгнoзa, пaциeнтy нaзнaчaютcя дoпoлнитeльныe лaбopaтopныe и инcтpyмeнтaльныe иccлeдoвaния, тaкиe кaк peнтгeн гpyднoй клeтки, cпиpoмeтpия и aнaлиз мoкpoты.	J06.9	(1) Diagnosis Accuracy The diagnosis of J06.9 (Acute upper respiratory infection, unspecified) is incorrect. This code refers to an infection of the upper respiratory tract (nose, pharynx, larynx), while the clinical picture indicates pathology of the lower respiratory tract. Critical signs of lower tract involvement include: moist crackles in the lower lobes of both lungs on auscultation, productive cough (with sputum), and chest pain on breathing (pleuritic pain). A duration of 3 weeks is too long for a common upper respiratory tract viral infection (URI). There is a gross contradiction: the text states the “differential diagnosis includes acute or subacute bronchitis, atypical pneumonia,” but the assigned code is for an upper respiratory tract infection. (2–3) Potential Correct Diagnoses J20.9 (Acute bronchitis, unspecified) is the most likely diagnosis, given the moist crackles in the lungs, productive cough, and a 3-week duration following a URI. J18.9 (Pneumonia, unspecified) must be ruled out, given the moist crackles in the lower lobes and chest pain; a chest X-ray is necessary. R07.1 (Chest pain on breathing) can be used for the pleuritic pain. (4) Other Errors and Remarks The abbreviation “БAC” in the text is incorrectly decoded as “bronchial asthma on the background of bronchial obstruction syndrome.” This is not standard. “БAC” (BAS) typically stands for Amyotrophic Lateral Sclerosis (a neurological disease). The correct abbreviation for asthma is simply “БA” (BA).

Table A4. RuMedNLI expert validation.

Sentence 1 (en/ru) and Sentence 2 (en/ru)	Gold Label	Comment
Came to ED complaining of vomiting and weakness/Пocтyпил в пpиëмный пoкoй c жaлoбaми нa pвoтy и cлaбocть. Patient has upper GI pain/У пaциeнтa бoль в вepxниx oтдeлax ЖKT	Entailment	Neutral relation
Came to ED complaining of vomiting and weakness/Пocтyпил в пpиëмный пoкoй c жaлoбaми нa pвoтy и cлaбocть. Patient has negative ROS/Пocтyпил в пpиëмный пoкoй c жaлoбaми нa pвoтy и cлaбocть	Contradiction	Contradiction. Incorrect translation. He may deny pain, but ROS (Review of Systems) is not the results of a physical exam. The correct translation is: ‘During the review of systems, the patient reports no complaints.’ If he were simply denying pain, the connection would be neutral. But with the correct translation, the connection is a contradiction
Age [7–24] ARF with arthritis, heart murmur. age 20 Hypertension, 4+ labile with some [Month/Year (2) 21269] of 200/100 for which he has gone to ED for control/Boзpacт [7–24] OПH c apтpитoм, шyмoм в cepдцe. вoзpacт 20 лeт Гипepтoния, 4+ лaбильнaя c эпизoдaми [мecяц/гoд (2) 21269] дo 200/100, пo пoвoдy кoтopыx oн oбpaщaлcя в cтaциoнap для кoнтpoля Multiple medical conditions/Mнoжecтвeнныe зaбoлeвaния	Entailment	Logical consequence. Incorrect translation. ARF stands for ‘acute rheumatic fever’ (Ocтpaя peвмaтичecкaя лиxopaдкa—OPЛ), not ‘acute renal failure’ (ARF—OПH). Hypertension, Risk 4 (labile, with hypertensive crises up to 200/100)
Age [7–24] ARF with arthritis, heart murmur. age 20 Hypertension, 4+ labile with some [Month/Year (2) 21269] of 200/100 for which he has gone to ED for control/Boзpacт [7–24] OПH c apтpитoм, шyмoм в cepдцe. вoзpacт 20 лeт Гипepтoния, 4+ лaбильнaя c эпизoдaми [мecяц/гoд (2) 21269] дo 200/100, пo пoвoдy кoтopыx oн oбpaщaлcя в cтaциoнap для кoнтpoля. Cardiac function is normal/Фyнкция cepдцa в нopмe	Contradiction	Contradiction. Incorrect translation. ARF stands for acute rheumatic fever (Ocтpaя peвмaтичecкaя лиxopaдкa, OPЛ), not acute renal failure (OПH). Hypertension, Risk 4 (labile, with hypertensive crises up to 200/100)
Age [7–24] ARF with arthritis, heart murmur. age 20 Hypertension, 4+ labile with some [Month/Year (2) 21269] of 200/100 for which he has gone to ED for control/Boзpacт [7–24] OПH c apтpитoм, шyмoм в cepдцe. вoзpacт 20 лeт Гипepтoния, 4+ лaбильнaя c эпизoдaми [мecяц/гoд (2) 21269] дo 200/100, пo пoвoдy кoтopыx oн oбpaщaлcя в cтaциoнap для кoнтpoля. On antihypertensive medication/Пpинимaeт aнтигипepтeнзивныe пpeпapaты	Neutral	Neutral relation. However, he most likely should be on therapy, since he was admitted to the hospital for monitoring, where antihypertensive drugs were almost certainly prescribed. Therapy could have been prescribed, but he is non-adherent, meaning he does not take the medication. Thus, the relation is indeed neutral, but somewhat closer to a contradiction. Incorrect translation. ARF stands for acute rheumatic fever (Ocтpaя peвмaтичecкaя лиxopaдкa, OPЛ), not acute renal failure (OПH). Hypertension, Risk 4 (labile, with hypertensive crises up to 200/100)
She denied any loss of consciousness or focal pain/Oтpицaлa пoтepю coзнaния или oчaгoвyю бoль. She was awake and alert/Былa бoдpa и внимaтeльнa	Entailment	Neutral relation
She denied any loss of consciousness or focal pain/Oтpицaлa пoтepю coзнaния или oчaгoвyю бoль. She is unconscious/Oнa бeз coзнaния	Contradiction	Contradiction/neutral relation. A debatable connection. The error lies specifically in the choice of the verb tense (“she denied”). If the tense were present, the connection would be a contradiction
83 yo female w/PMHX sig for HTN, CHF w/diastolic dysfunction, right renal artery stenosis p/w increased lethargy and unresponsiveness for 1 day/Жeнщинa 83 гoдa c пpизнaкaми гипepтoнии, ИБC c диacтoличecкoй диcфyнкциeй, cтeнoзoм пpaвoй пoчeчнoй apтepии c пoвышeннoй вялocтью и cлaбoй oтзывчивocтью в тeчeниe 1 дня. The patient developed diastolic dysfunction from long standing hypertension/У пaциeнтa paзвилacь диacтoличecкaя диcфyнкция из-зa дaвнeй гипepтoнии	Neutral	Neutral relation/logical consequence. Hypertension is an important factor in the pathogenesis of diastolic dysfunction. However, diastolic dysfunction has numerous potential causes and pathogenic mechanisms, which is why the connection cannot be considered absolute
States the pain was similar to that she had when she had her [ Location ]us MI’s/Гoвopит, чтo бoль пo xapaктepy былa пoxoжa нa тy, кoтopyю oнa иcпытывaлa, кoгдa y нeë был ИM.	Contradiction	Neutral relation. The ECG result depends on: The presence or absence of a scar from a previous myocardial infarction (with the formation of a pathological Q wave). The current diagnosis (The patient either has an MI now, or she does not. Furthermore, even if she is currently having an MI, there are two possibilities: ST-segment elevation MI and non-ST-segment elevation MI). Therefore, the ECG could turn out to be either normal or pathological
Total cardiopulmonary bypass time was 113 min/Oбщee вpeмя cepдeчнo-лeгoчнoгo шyнтиpoвaния cocтaвилo 113 минyт. Patient has had a CABG/Пaциeнт пepeнëc AKШ	Entailment	Neutral relation. The use of cardiopulmonary bypass can be explained not only by coronary artery bypass grafting (CABG), but also by any other cardiac surgery procedure, for example, valve replacement, among others. Translation note: A more accurate translation for “cardiopulmonary bypass” is “иcкyccтвeннoe кpoвooбpaщeниe”
Total cardiopulmonary bypass time was 113 min/Oбщee вpeмя cepдeчнo-лeгoчнoгo шyнтиpoвaния cocтaвилo 113 минyт. Patient has no CAD/У пaциeнтa нeт ИБC	Contradiction	Neutral relation. Cardiopulmonary bypass is used not only for coronary artery bypass grafting (CABG) for coronary artery disease (CAD), but also for other cardiac surgeries addressing different pathologies, for example, during valve replacement, among others. Translation note: A more accurate translation for “cardiopulmonary bypass” is “иcкyccтвeннoe кpoвooбpaщeниe”
Baby girl [Known patient lastname 39746] was born by a repeat scheduled cesarean section with an Apgar score of 4 at one minute, 5 at five minutes, and 7 at ten minutes/Дeвoчкa [Фaмилия пaциeнтa 39746] poдилacь в peзyльтaтe пoвтopнoгo плaнoвoгo кecapeвa ceчeния c oцeнкoй пo шкaлe Aпгap 4 нa пepвoй минyтe, 5 нa пятoй минyтe и 7 нa дecятoй минyтe. The patient has had concerning Apgar score/У пaциeнтки были нopмaльныe пoкaзaтeли пo шкaлe Aпгap	Entailment	Contradiction. A score of 7 to 10 points on the Apgar scale is considered normal and is assessed at the first and fifth minutes of a newborn’s life
Baby girl [Known patient lastname 39746] was born by a repeat scheduled cesarean section with an Apgar score of 4 at one minute, 5 at five minutes, and 7 at ten minutes/Дeвoчкa [Фaмилия пaциeнтa 39746] poдилacь в peзyльтaтe пoвтopнoгo плaнoвoгo кecapeвa ceчeния c oцeнкoй пo шкaлe Aпгap 4 нa пepвoй минyтe, 5 нa пятoй минyтe и 7 нa дecятoй минyтe The baby had a reassuring Apgar score/У дeвoчки были oбнaдeживaющиe пoкaзaтeли пo шкaлe Aпгap	Contradiction	Neutral relation. Although the initial scores (4 and 5) are low, the key factor here is the positive dynamic—the improvement to 7 points by the 10th minute. A physician could interpret this dynamic as “encouraging” because it demonstrates that the infant is responding positively to resuscitation efforts and their condition is improving. However, the phrasing “encouraging” itself is a subjective clinical assessment, not an objective fact. From the objective data (4/5/7) alone, one cannot definitively conclude “encouraging,” but neither can one claim the data contradicts it. Therefore, the relation is neutral
She started taking ibuprofen for it at [First Name8 (NamePattern2) ] [Last Name (un) 5416] dose/Для этoгo нaчaлa пpинимaть ибyпpoфeн в дoзe. The patient is not in pain/Пaциeнт нe иcпытывaeт бoли	Contradiction	Neutral relation. The data provided in sentence 1 is insufficient to determine the purpose of ibuprofen use. Ibuprofen can be taken not only for pain relief, but also to reduce fever, decrease inflammation, or for other purposes
53 year-old male with progressive multiple sclerosis and recent episode of urosepsis admitted from PCP’s office with generalized weakness and tachycardia/53-лeтний мyжчинa c пpoгpeccиpyющим pacceянным cклepoзoм и нeдaвним эпизoдoм ypoceпcиca пocтyпил пo нaпpaвлeнию cвoeгo лeчaщeгo вpaчa c oбщeй cлaбocтью и тaxикapдиeй. Patient has a neurological disorder/Пaциeнт cтpaдaeт нeвpoлoгичecким зaбoлeвaниeм	Neutral	Logical consequence. Multiple sclerosis is a neurological disease

Appendix C

In this appendix, we provide a consolidated table with direct links to publicly available Russian medical NLP datasets to facilitate reproducibility and further research.

Table A5. Public Russian medical NLP datasets with verified access URLs. All URLs were accessed on 11 November 2025.

Dataset	URL
RICD (Russian Intensive Care Dataset)	https://fnkcrr-database.ru/en_main.html
RuMedPrimeData	https://zenodo.org/records/5765873
RuMedTop3	https://github.com/sberbank-ai-lab/RuMedBench/tree/master/data/RuMedTop3
RuMedSymptomRec	https://github.com/sberbank-ai-lab/RuMedBench/tree/master/data/RuMedSymptomRec
RuMedDaNet	https://github.com/sberbank-ai-lab/RuMedBench/tree/master/data/RuMedDaNet
RuMedNLI	https://physionet.org/content/rumednli-russian-inference/1.0.0/
RuMedNER	https://github.com/sberbank-ai-lab/RuMedBench/tree/master/data/RuMedNER
MedSyn-Synthetic	https://huggingface.co/datasets/Glebkaa/MedSyn-synthetic
MedSyn-IFT	https://huggingface.co/datasets/Glebkaa/MedSyn-ift
smakov/ru_medsum	https://huggingface.co/datasets/smakov/ru_medsum
Medical forum Q&A	https://huggingface.co/datasets/blinoff/medical_qa_ru_data
RuMedQ	https://github.com/sb-ai-lab/RuMedQ
RuDReC	https://github.com/cimm-kzn/RuDReC
RuADReCT	https://github.com/cimm-kzn/RuDReC
RuCCoN	https://github.com/AIRI-Institute/RuCCoN
Ophthalmology Russian/English Translations	https://www.kaggle.com/datasets/cheshrcat/ru-medical-texts-ophtalmology
MMedBench	https://huggingface.co/datasets/Henrychur/MMedBench
NEREL-BIO	https://github.com/nerel-ds/NEREL-BIO
RuCCoD	https://github.com/auto-icd-coding/ruccod
BioNNE-L Shared Task	https://github.com/nerel-ds/NEREL-BIO/tree/master/BioNNE-L_Shared_Task
Medical_articles (MEDSI)	https://www.kaggle.com/datasets/kwyrob/medsi-articles

References

Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed]
Tutubalina, E.; Alimova, I.; Miftahutdinov, Z.; Sakhovskiy, A.; Malykh, V.; Nikolenko, S. The Russian Drug Reaction Corpus and neural models for drug reactions and effectiveness detection in user reviews. Bioinformatics 2021, 37, 243–249. [Google Scholar] [CrossRef] [PubMed]
Grechko, A.V.; Yadgarov, M.Y.; Yakovlev, A.A.; Berikashvili, L.B.; Kuzovlev, A.N.; Polyakov, P.A.; Kuznetsov, I.V.; Likhvantsev, V.V. RICD: Russian Intensive Care Dataset. Gen. Reanimatol. 2024, 20, 22–31. [Google Scholar] [CrossRef]
Kulikov, E.; Fedorova, O.; Tolmachev, I.; Ryazantseva, U.; Vrazhnov, D.; Gubanov, A.; Nesterovich, S.; Shmyrina, A. Russian-language repository of the open clinical data «SibMed Data Clinical Repository». Bull. Sib. Med. Bull. Sib. Meditsiny 2023, 22, 182–184. [Google Scholar] [CrossRef]
Abdullah, G.M. Translating Medical Terminology. Br. J. Appl. Linguist. 2025, 5, 60–70. [Google Scholar] [CrossRef]
Kumichev, G.; Blinov, P.; Kuzkina, Y.; Goncharov, V.; Zubkova, G.; Zenovkin, N.; Goncharov, A.; Savchenko, A. Medsyn: Llm-based synthetic medical text generation framework. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2024; pp. 215–230. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 353–355. [Google Scholar]
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Adv. Neural Inf. Process. Syst. 2019, 32, 294. [Google Scholar]
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning; PMLR: New York, NY, USA, 2022; pp. 248–260. [Google Scholar]
Romanov, A.; Shivade, C. Lessons from natural language inference in the clinical domain. arXiv 2018, arXiv:1808.06752. [Google Scholar] [CrossRef]
Bedi, S.; Cui, H.; Fuentes, M.; Unell, A.; Wornow, M.; Banda, J.M.; Kotecha, N.; Keyes, T.; Mai, Y.; Oez, M. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. arXiv 2025, arXiv:2505.23802. [Google Scholar]
Zhang, D.; Xue, X.; Gao, P.; Jin, Z.; Hu, M.; Wu, Y.; Ying, X. A survey of datasets in medicine for large language models. Intell. Robot. 2024, 4, 457–478. [Google Scholar] [CrossRef]
International Statistical Classification of Diseases and Related Health Problems 10th Revision. Available online: https://icd.who.int/browse10/2019/en (accessed on 11 November 2025).
Anatomical Therapeutic Chemical (ATC) Classification. Available online: https://www.who.int/tools/atc-ddd-toolkit/atc-classification (accessed on 11 November 2025).
Systematized Nomenclature of Medicine. Available online: https://www.snomed.org/ (accessed on 11 November 2025).
Monogarova, A.G.; Shiryaeva, T.A.; Tikhonova, E.V. The words that make fake stories go viral: A corpus-based approach to analyzing Russian Covid-19 disinformation. Russ. J. Linguist. 2024, 27, 543–569. [Google Scholar] [CrossRef]
National NLP Clinical Challenges (n2c2). Available online: https://n2c2.dbmi.hms.harvard.edu/data-sets (accessed on 11 November 2025).
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
Kim, Y.; Wu, J.; Abdulle, Y.; Wu, H. MedExQA: Medical question answering benchmark with multiple explanations. arXiv 2024, arXiv:2406.06331. [Google Scholar]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2567–2577. [Google Scholar]
Starovoytova, E.; Kulikov, E.; Fedosenko, S.; Shmyrina, A.; Kirillova, N.; Vinokurova, D.; Balaganskaya, M. RuMedPrimeData; Zenodo: Geneva, Switzerland, 2021. [Google Scholar] [CrossRef]
Blinov, P.; Reshetnikova, A.; Nesterov, A.; Zubkova, G.; Kokh, V. RuMedBench: A Russian medical language understanding benchmark. In Proceedings of the International Conference on Artificial Intelligence in Medicine; Springer: Berlin/Heidelberg, Germany, 2022; pp. 383–392. [Google Scholar]
Daniyar, S. ru_MedSum Dataset. Available online: https://huggingface.co/datasets/smakov/ru_medsum (accessed on 11 November 2025).
Medical Forum Q&A. Available online: https://huggingface.co/datasets/blinoff/medical_qa_ru_data (accessed on 11 November 2025).
RuMedQ. Available online: https://github.com/sberbank-ai-lab/RuMedQ (accessed on 11 November 2025).
FutureBeeAI Healthcare Chat. Available online: https://www.futurebeeai.com/dataset/text-dataset/russian-healthcare-domain-conversation-text-dataset (accessed on 11 November 2025).
Nesterov, A.; Zubkova, G.; Miftahutdinov, Z.; Kokh, V.; Tutubalina, E.; Shelmanov, A.; Alekseev, A.; Avetisian, M.; Chertok, A.; Nikolenko, S. RuCCoN: Clinical concept normalization in Russian. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 239–245. [Google Scholar]
Full-Size Russian Corpus of Internet Drug Reviews. Available online: https://github.com/sag111/RDRS (accessed on 11 November 2025).
Ophthalmology Russian/English Translations. Available online: https://www.kaggle.com/datasets/cheshrcat/ru-medical-texts-ophtalmology (accessed on 11 November 2025).
Blinov, P.; Avetisian, M.; Kokh, V.; Umerenkov, D.; Tuzhilin, A. Predicting clinical diagnosis from patients electronic health records using BERT-based neural networks. In Proceedings of the International Conference on Artificial Intelligence in Medicine; Springer: Berlin/Heidelberg, Germany, 2020; pp. 111–121. [Google Scholar]
Yalunin, A.; Nesterov, A.; Umerenkov, D. RuBioRoBERTa: A pre-trained biomedical language model for Russian language biomedical text mining. arXiv 2022, arXiv:2204.03951. [Google Scholar]
Qiu, P.; Wu, C.; Zhang, X.; Lin, W.; Wang, H.; Zhang, Y.; Wang, Y.; Xie, W. Towards building multilingual language model for medicine. Nat. Commun. 2024, 15, 8384. [Google Scholar] [CrossRef] [PubMed]
Loukachevitch, N.; Manandhar, S.; Baral, E.; Rozhkov, I.; Braslavski, P.; Ivanov, V.; Batura, T.; Tutubalina, E. NEREL-BIO: A dataset of biomedical abstracts annotated with nested named entities. Bioinformatics 2023, 39, btad161. [Google Scholar] [CrossRef] [PubMed]
Nesterov, A.; Shakhovskiy, A.; Sviridov, A.; Valiev, A.; Makharev, V.; Anokhin, P.; Zubkova, G.; Tutubalina, E. RuCCoD: Towards automated ICD Coding in Russian. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, November 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 2558–2585. Available online: https://aclanthology.org/2025.emnlp-main.129.pdf (accessed on 30 January 2026).
Sakhovskiy, A.; Loukachevitch, N.; Tutubalina, E. Overview of the BioASQ BioNNE-L task on biomedical nested entity linking in CLEF 2025. arXiv 2025, arXiv:2508.20554. Available online: https://ceur-ws.org/Vol-4038/paper_3.pdf (accessed on 30 January 2026).
Medsi Medical Articles (Russian). Kaggle Datasets, 2024. Available online: https://www.kaggle.com/datasets/kwyrob/medsi-articles (accessed on 20 January 2026).
Pogrebnoi, D.; Funkner, A.; Kovalchuk, S. RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques. In Proceedings of the Computational Science—ICCS 2023: 23rd International Conference, Prague, Czech Republic, 3–5 July 2023; Proceedings, Part III. pp. 213–227. [Google Scholar] [CrossRef]
Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Weiskopf, N.G.; Rusanov, A.; Weng, C. Sick patients have more data: The non-random completeness of electronic health records. In AMIA Annual Symposium Proceedings 2013; PMC: Los Angeles, CA, USA, 2013; Volume 2013, p. 1472. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC3900159/ (accessed on 2 February 2026).
Pivovarov, R. Electronic Health Record Summarization Over Heterogeneous and Irregularly Sampled Clinical Data; Columbia University: New York, NY, USA, 2016; Available online: https://www.proquest.com/openview/7f28369cf1a5ad3aced8a134ca1f38a5/1?pq-origsite=gscholar&cbl=18750 (accessed on 2 February 2026).

Figure 1. Categorization of Russian-language medical data sources. This categorization emphasizes that the effective development of medical NLP models for Russia requires moving beyond the mere translation of clinical texts. It is necessary to account for the entire data ecosystem: formal Clinical Practice Documents, informal Communication data (chats, forums), defining Regulatory Resources and Administrative Documents reflecting the logic of the Compulsory Health Insurance (CHIS) system. Color coding represents data sources and is used consistently across categories, reflecting that each activity-based category may include multiple data origins. Only a comprehensive dataset spanning all these categories will overcome the localization problem and create AI assistants relevant to the Russian physician.

Figure 2. Chronological diagram of Russian-language medical NLP datasets created over time. Purple denotes generated (LLM-based or template-guided) datasets; blue denotes datasets translated from English with expert post-editing. Lock icons indicate resources without open public access, which can typically be used only under research agreements. The color scheme and border styles are summarized in the figure legend. For the FutureBeeAI dataset, the year corresponds to the company foundation date, as the exact dataset release date is not publicly specified. The figure illustrates the evolution of Russian medical NLP resources, distinguishing original clinical and user-generated corpora from translated and synthetic datasets.

Figure 3. Coverage of MedHELM tasks by publicly available Russian medical NLP datasets. Instruction-tuning dataset MedSyn-IFT is intentionally excluded to emphasize coverage provided by task-specific datasets. The figure highlights the uneven distribution of available resources and the limited number of MedHELM tasks that are currently covered by existing datasets.

Figure 4. Taxonomy of Russian medical NLP data by origin and reliability. The framework distinguishes real, translated, and generated data. Real data are treated as a single category, although their internal trustworthiness may vary (see Figure 1). Both real and expert-translated data represent authentic or professionally translated content rather than post-edited adaptation. In contrast, translated and generated data obtained through automatic methods are further divided by the degree of human involvement—ranging from expert post-editing (semi-automatic adaptation, shown in green) to fully automatic approaches such as machine translation or unrestricted LLM generation (shown in yellow). This distinction reflects the varying reliability and provenance control across data creation modes and highlights two principal pathways of translation and generation: fully automatic and semi-automatic.

Table 2. Data quality and annotation limitations identified during expert review of RuMedPrimeData dataset.

Limitation	Description	Error Rate
Insufficient clinical data	Records frequently lack critical data points, including patient age, symptom duration, and results of essential diagnostic tests (e.g., ECG/ECHO for chest pain, FGDS for reflux, imaging for suspected colic, glycemic/thyroid panels). This often leads to diagnoses being made speculatively, a fact explicitly noted by experts in their feedback.	8%
Inaccurate ICD-10 coding	Codes are often assigned from incorrect nosological groups (e.g., acute conditions coded as chronic) or replaced by non-specific “catch-all” categories (e.g., I99 instead of specific IHD codes), misrepresenting the clinical picture.	51%
Unsubstantiated diagnoses	Etiology or disease forms (e.g., bacterial vs. viral, calculous vs. non-calculous) are specified without supporting diagnostic data, resulting in unjustified clinical conclusions.	9%
Ambiguous clinical pictures	Single records conflate multiple problems from different organ systems, requiring case segmentation and multi-code annotation (primary, secondary, complications) for accurate representation.	7%
Insufficient context for code assignment	Key diagnostic modifiers are omitted, such as exertional association of pain (for I20), signs of heart failure, or confirmation of left ventricular hypertrophy (for I11.x), rendering assigned codes incorrect or overly general.	3%
Logical inconsistencies	Contradictions occur between described symptoms and diagnoses, or clinical “red flags” are ignored (e.g., syncope attributed to spinal hernia without cardiovascular evaluation), violating fundamental clinical logic.	2%
Unclear prioritization of conditions	It is often unclear which diagnosis represents the primary reason for encounter or hospitalization versus comorbid background conditions, necessitating explicit prioritization in coding.	5%

Table 3. Data quality and annotation limitations identified during expert review of MedSyn-Synthetic dataset.

Limitation	Description	Error Rate
Missing critical clinical data	Frequent absence of demographic information (age, sex), disease timeline, epidemiological history, and diagnostic results, precluding reliable ICD-10 coding and clinical verification.	42%
Internal clinical inconsistencies	Contradictions between reported symptoms, physical findings, and diagnoses (e.g., pain and edema described as “no signs of inflammation”).	24%
Inaccurate ICD-10 coding	ICD-10 codes are assigned incorrectly or prematurely without sufficient instrumental confirmation or alignment with clinical presentation.	51%
Lack of clinical specificity	Symptom descriptions remain overly generalized, lacking characterization of pain, localization, duration, or provoking factors.	9%
Inclusion of irrelevant information	Records include administrative details, unrelated medical history, or unsubstantiated lifestyle factors not pertinent to the clinical case.	9%
Insufficient diagnostic support	Diagnoses are established without mandatory investigations (imaging, laboratory tests, ECG/ECHO), while recommending inappropriate examinations.	40%
Clinically implausible scenarios	Documentation includes non-existent diagnoses or medically inconsistent cases (e.g., congenital anomalies described in adult patients).	7%
Generative and translation-related terminology artifacts	Presence of calqued expressions, incorrect or non-standard medical terminology, and untranslated Arabic fragments, indicative of synthetic text generation with automated translation artifacts.	53%
Synthetic text patterns	Dialog-style formatting and unnatural phrasing suggest generative origin rather than authentic clinical documentation.	13%
Format corruption	Records contain empty entries or nonsensical character sequences, rendering the data unusable.	2%

Table 4. Data quality and annotation limitations identified during expert review of RuMedNLI dataset.

Limitation	Description	Error Rate
Terminology and abbreviation mistranslations	Fundamental mistranslations of key clinical abbreviations (e.g., “ARF” translated as acute renal failure instead of acute rheumatic fever) and erroneous substitutions of medical terms (e.g., “cardiopulmonary bypass” translated as CABG).	12%
Incomplete or distorted clinical statements due to translation	Omission of crucial units of measurement and clinically relevant details, as well as literal translations producing non-idiomatic Russian expressions (e.g., “amniotic odor” instead of “smell of amniotic fluid”), rendering statements unverifiable or misleading.	5%
Mixed-language and inconsistent terminology usage	Unsystematic mixing of translated, transliterated, and original English terms within single documents, leading to inconsistency in terminology representation.	1%
Logical relation annotation errors	Frequent mismatches between annotated and expert-verified entailment, contradiction, and neutral relations; systematic treatment of clinical associations as logical entailments; failure to account for temporal or conditional clinical relationships; annotation of clinically implausible causal relations due to lack of expert verification.	26%
Interpretative and negation-related errors	Addition of clinical interpretations absent from the source texts and critical misinterpretations arising from incorrect handling of negation particles, leading to false logical implications.	3%

Table 5. Summary of expert-identified limitations of Russian medical NLP datasets.

Dataset	Limitation
RuMedPrimeData [22]	While representing authentic clinical text, suffers from incomplete patient information, inconsistent ICD-10 [14] coding and contradictions between findings and diagnoses.
MedSyn-Synthetic [6]	Demonstrates the same structural problems but is further amplified by generative artifacts, such as inserted dialogue fragments, non-existent symptoms, and untranslated or corrupted text (including Arabic characters). These artifacts affect not only style but also the factual coherence of medical histories and symptom descriptions, resulting in even higher error rates than in [22].
RuMedNLI [13]	Discrepancies partly stem from differing interpretations of NLI tasks between medical experts and dataset authors—particularly in distinguishing probabilistic associations from strict logical entailment—as well as from numerous terminological and translation errors.

Table 6. Coverage gaps between the MedHELM clinical task taxonomy and existing Russian-language medical NLP datasets.

Dataset	Limitation
Clinical Guidelines (Apply guidelines and best practices)	Under Russian law, physicians are legally obliged to follow official clinical guidelines and deviation from them can entail professional or even legal liability. However, within AI and NLP research, there are no datasets linking medical records or prescriptions to guideline clauses. Consequently, the critical area of guideline adherence, which directly affects the safety and quality of care, remains entirely unsupported by existing data.
Routing and Contraindications (Match protocols/screen contraindications; Suggest clinical pathways)	In real clinical practice, a physician must not only select treatment but also account for contraindications and plan patient routing across stages of care—from primary consultation to hospitalization and rehabilitation. However, no Russian-language datasets capture protocol adherence or deviation, contraindication detection, or clinical pathway formation. This omission prevents AI systems from addressing key workflow and safety tasks already explored in English-language clinical benchmarks.
Collaborative Decision-Making and Consultations (Make collaborative decisions; Evaluate admissions; Manage post-discharge planning and transitions)	Multidisciplinary collaboration is a fundamental part of modern clinical work, particularly for complex or chronic cases. Nevertheless, Russian corpora lack any records of joint consultations, multidisciplinary discussions, or inter-institutional care coordination. Without such data, models cannot learn patterns of distributed reasoning or transitions between care levels—capabilities crucial for systems supporting real-world healthcare delivery.
Healthcare Administration and Economics (Scheduling, financial and resource management)	Administrative and financial aspects of healthcare—insurance processing, billing workflows, expense estimation, staff scheduling, and institutional performance monitoring—are virtually absent from Russian open datasets. This gap makes it impossible to develop AI tools that could assist in organizational decision-making or integrate with the CHIS.
Medical Research Data (Research support and quality assurance)	The research domain also remains underrepresented. There are no Russian corpora for analyzing clinical trials (Task 44), conducting cohort studies (Task 45), tracking patient recruitment (Task 48), or supporting meta-research activities such as bias assessment or methodological validation (Tasks 43, 46, 47). The lack of such datasets limits the ability of language models to assist in evidence synthesis and quality assurance within biomedical research.
Documentation of Procedures and Care Plans (Recording procedures; Documenting diagnostic reports; Documenting care plans)	Unlike English-language initiatives such as i2b2 or n2c2, Russian open datasets contain no procedural or diagnostic documentation—for example, surgical protocols, imaging reports, or nursing care plans. These document types are essential for continuity of care and for training models that can generate or summarize clinical documentation in realistic hospital settings.
Clinical Reasoning Chains (Generate differential diagnoses; Make collaborative decisions; Generate team assessments)	Perhaps the most critical gap concerns datasets reflecting clinical reasoning itself. None of the existing Russian corpora capture the step-by-step diagnostic or deliberative process of a physician. Available datasets usually record only the final diagnosis or decision, without showing the intermediate hypotheses or justification. Without annotated reasoning chains, it is impossible to train explainable models capable of multi-hypothesis inference, collaborative decision-making, or transparent diagnostic justification.
Russian-Specific Considerations	Russian clinical practice introduces additional complexity. The mandatory nature of clinical guidelines, standardized diagnostic codes and rigid documentation formats increase the potential consistency of data but also constrain model generalization. Translation of English tasks into this context often fails: for instance, questions referencing drugs not registered in Russia or insurance systems without CHIS analogues become meaningless after direct translation. Thus, many imported datasets require deep expert adaptation rather than literal translation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Litvinov, A.; Malishevskii, L.; Karpulevich, E.; Bespalov, I.; Nedumov, Y.; Zhdanov, S.; Oseledets, I.; Shlyakhto, E.; Avetisyan, A. Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources. Informatics 2026, 13, 45. https://doi.org/10.3390/informatics13030045

AMA Style

Litvinov A, Malishevskii L, Karpulevich E, Bespalov I, Nedumov Y, Zhdanov S, Oseledets I, Shlyakhto E, Avetisyan A. Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources. Informatics. 2026; 13(3):45. https://doi.org/10.3390/informatics13030045

Chicago/Turabian Style

Litvinov, Arsenii, Lev Malishevskii, Evgeny Karpulevich, Iaroslav Bespalov, Yaroslav Nedumov, Sergey Zhdanov, Ivan Oseledets, Evgeniy Shlyakhto, and Arutyun Avetisyan. 2026. "Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources" Informatics 13, no. 3: 45. https://doi.org/10.3390/informatics13030045

APA Style

Litvinov, A., Malishevskii, L., Karpulevich, E., Bespalov, I., Nedumov, Y., Zhdanov, S., Oseledets, I., Shlyakhto, E., & Avetisyan, A. (2026). Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources. Informatics, 13(3), 45. https://doi.org/10.3390/informatics13030045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources

Abstract

1. Introduction

2. Russian Medical Data Sources

2.1. Clinical Practice Documents

2.2. Communication and Interaction Data

2.3. Scientific, Educational and Regulatory Resources

2.4. Administrative and Legal Documents

2.5. Auxiliary Reference Materials

3. Specificities of Russian Clinical Practice

3.1. Mandatory Clinical Guidelines

3.2. The Role of the CHIS

3.3. Linguistic Features of Russian and Units of Measurement

3.4. Structure of Medical Documentation

4. Russian-Language Medical Datasets

4.1. Overview of Available Russian Medical NLP Datasets

4.2. Coverage of MedHELM Clinical Tasks by Existing Russian Datasets

4.3. Dataset Analysis, Annotation, and Adaptation

4.4. Institutional Provenance of Clinical Datasets

4.5. Summary of Dataset Landscape

5. Expert Verification of Public Datasets

5.1. RuMedPrimeData

5.2. MedSyn-Synthetic

5.3. RuMedNLI

5.4. Summary of the Expert-Verified Datasets

6. Clinical Task Coverage and Gaps

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI