Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare

Kalabikhina, Irina Evgenievna; Kolotusha, Anton Vasilyevich; Moshkin, Vadim Sergeevich

doi:10.3390/bdcc10040114

Open AccessArticle

Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare

by

Irina Evgenievna Kalabikhina

¹

,

Anton Vasilyevich Kolotusha

^1,*

and

Vadim Sergeevich Moshkin

²

¹

Population Department, Faculty of Economics, Lomonosov Moscow State University, Moscow 119991, Russia

²

Department of Information Systems, Ulyanovsk State Technical University, Ulyanovsk 432027, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(4), 114; https://doi.org/10.3390/bdcc10040114

Submission received: 24 February 2026 / Revised: 2 April 2026 / Accepted: 7 April 2026 / Published: 9 April 2026

(This article belongs to the Special Issue Artificial Intelligence and Big Data Analytics for Sustainable Healthcare Systems)

Download Versions Notes

Abstract

Patients leave millions of medical reviews annually, providing critical data for quality management. However, manual processing is infeasible, and existing systems fail to distinguish medical from organizational problems—a distinction essential for complaint routing. The consequences of misrouting are significant: clinical issues may go unaddressed when medical complaints reach administrative staff, while systemic service problems remain unresolved when organizational complaints reach medical directors. We developed a hybrid approach combining expert annotation with Large Language Models (LLMs). Fifteen prompt iterations on 1500 reviews with expert validation (modified Cohen’s kappa (κ_mod), which weights errors hierarchically, reached 0.745) preceded the LLM annotation of 15,000 mixed-sentiment and positive reviews. These were combined with 7417 expert-annotated negative reviews to form a corpus of 22,417 reviews. Eight architectures, ranging from Logistic Regression to a BERT + TF-IDF + LightGBM ensemble, were compared using both standard metrics and domain-specific practical metrics tailored to complaint routing. The best model, scaled to 4.3 million Russian-language reviews from the Prodoctorov.ru platform, achieved 92.9% Practical Accuracy—the proportion of reviews classified without critical medical–organizational misclassification errors (M ↔ O)—compared to 68.0% standard accuracy, which treats all errors equally. Critical errors were reduced to 1.4%, yielding 144,000 more correctly processed complaints than traditional methods (TF-IDF + Logistic Regression). Analysis of the scaled data revealed the following: 46.1% M (medical), 21.0% O (organizational), and 32.9% C (combined) reviews; medical ratings were highest (4.75 vs. 4.59 for organizational, p < 0.001); combined reviews were longest (802 characters); zero-star reviews comprised 3.8% of feedback, with organizational complaints dominating (38.2%) among extreme negatives; and average ratings rose by 1.24 points over 14 years. This hybrid approach yields expert-comparable corpora, automates 93% of feedback processing, ensures correct complaint routing, and contributes to healthcare sustainability by reducing administrative burden, accelerating resolution, and enabling data-driven quality management without proportional increases in human resources. All analyses were conducted on Russian-language patient reviews.

Keywords:

patient review classification; big data analytics; healthcare analytics; health informatics; large language models; BERT; ensemble learning; practical accuracy; sustainable healthcare; machine learning

1. Introduction

1.1. Context and Significance of the Study

The digitalization of healthcare has led to the emergence of a fundamentally new data source for managing the quality of medical care—patient reviews on specialized online platforms. Every year, patients leave millions of comments about the work of doctors and clinics, forming datasets comparable in size to big data. On the Prodoctorov.ru platform alone, over 4.3 million reviews have been accumulated for the period 2011–2025, and the total volume of Russian-language medical reviews amounts to tens of millions.

These data contain critical information for identifying problems in the healthcare system. However, their potential remains unrealized for two reasons. First, the manual processing of millions of reviews is infeasible due to limited human resources. Second, existing automatic text analysis systems (sentiment analysis, topic modeling) do not solve the key management task—complaint routing by areas of responsibility.

For effective quality management, it is fundamentally important to distinguish three types of problems described in reviews. M (medical) are complaints related to the professional competence of doctors: diagnosis, treatment, prescriptions, medical errors, and procedure outcomes. These problems fall within the responsibility of the medical director and require clinical analysis. O (organizational) are complaints about service and administrative aspects: appointment scheduling, waiting times, service costs, staff politeness, and clinic cleanliness. These problems fall within the responsibility of administrative personnel and can be solved by organizational methods. C (combined) are reviews in which patients simultaneously evaluate both medical and organizational aspects, with neither dominating. Such reviews are particularly valuable, as they contain the most complete information about the patient’s interaction with the healthcare system.

The consequences are significant: medical complaints sent to administrative staff leave clinical issues unaddressed, while organizational complaints sent to medical directors leave service problems unresolved. Mixing these problem types during automatic processing leads to complaints about medical errors reaching administrators, while complaints about queues reach medical directors. As a result, neither group receives relevant information for decision-making, and patients remain without responses to their appeals.

1.2. Literature Review and Previous Work

Electronic Word-of-Mouth (eWOM) and Physician Rating Websites. The fundamental basis of research in the field of online review analysis is the concept of electronic word-of-mouth (eWOM)—informal communications directed at consumers through Internet technologies and concerning the experience of using or characteristics of specific products, services, or their providers [1]. eWOM, especially in the form of online reviews, has become the most influential source of information for consumer decision-making [2,3].

In healthcare, this phenomenon has become widespread through Physician Rating Websites (PRW). According to 2018 data, more than 52% of doctors in Germany had personal pages on such platforms [4]. Online portals provide the opportunity to evaluate not only individual doctors but also larger entities such as hospitals [5,6]. The structure of reviews may differ significantly depending on the country and context: in some countries, reviews about hospitals and the overall experience predominate rather than about specific doctors and their actions [7,8].

Researchers note a high concentration of positive reviews and high ratings of doctors in healthcare [9,10,11,12,13,14]. However, at the beginning of the COVID-19 pandemic, the proportion of negative reviews on online forums temporarily increased [15]. The predominance of positive feedback may serve as an incentive to raise prices for goods and services to maximize profits [16], which prompts unscrupulous companies to manipulate reviews and ratings using fake reviews and rating distortion [17,18,19]. A particular threat is the use of artificial intelligence to generate reviews practically indistinguishable from those written by real users [20,21].

Factors Influencing Review Content and Sentiment. The main factors increasing the likelihood of a doctor receiving a positive review are their friendliness and communicative behavior [22]. Shah et al. [23] divide factors increasing the likelihood of a doctor receiving a positive review into those dependent on the medical institution (hospital environment, location, parking availability, medical protocol, etc.) and those related to doctor’s actions (knowledge, competence, attitude, etc.) [23].

Personal characteristics of doctors, such as gender, age, and specialty, may also influence patient ratings [24,25,26,27,28]. According to studies based on German and American data, higher ratings predominate among female doctors [24,25], and obstetrician–gynecologists and young doctors also demonstrate higher scores [28,29]. Patient characteristics also influence the rating distribution. For example, according to Emmert and Meier [24], older patients tend to rate doctors higher than younger ones. Insurance policy status significantly affects feedback sentiment [24,30]. Some studies show that negative reviews predominate among patients from rural areas [31]. Some works focus on the characteristics of online service users, noting that there are certain features people have that indicate the frequency of PRW use [32,33].

A number of studies use both numerical ratings and comment texts as data [34]. In particular, [34] identified factors influencing more positive ratings related both to the doctor’s characteristics and other factors beyond the doctor’s control.

Debate on the Connection Between Online Ratings and Real Quality of Care. Researchers have found that online reviews of doctors often do not reflect actual healthcare outcomes [35,36,37]. Consequently, reviews and online ratings in healthcare are less useful and effective compared to other industries [38,39]. However, some studies, on the contrary, have revealed a direct correlation between user online ratings and clinical outcomes [40,41,42,43].

This debate underscores the need for more nuanced analytical tools that go beyond simple sentiment analysis and account for the substantive aspects of reviews.

Application of NLP and Machine Learning for Patient Review Analysis. The transition from traditional surveys to the automatic analysis of texts generated by patients on social media and review platforms represents significant progress in healthcare quality assessment [44,45]. This approach allows for a more objective and representative picture of patient attitudes through the use of large independent samples.

As demonstrated by a systematic review [46], NLP and machine learning have become essential tools for processing unstructured textual data from patient feedback. The review, summarizing 19 studies, shows that the most common approaches are supervised learning (often using support vector machines and naive Bayes classifiers) and unsupervised learning (topic modeling).

This approach is successfully applied to analyze data from standardized surveys. In [47], based on Press Ganey data, a system was developed and validated for automatic topic classification and sentiment analysis of patient comments, achieving an F-measure of 0.74. In another work [48], based on feedback from Press Ganey surveys using NLP and neural networks, key dissatisfaction factors were identified, such as climate control in wards, noise, discharge delays, and frequent blood draws.

A significant volume of research uses arrays of doctor review texts as an empirical base [49,50,51,52]. Researchers have found that physician rating services can complement information provided by traditional patient satisfaction surveys and contribute to better patient understanding of healthcare quality [45,53,54].

In the context of Russian healthcare, this direction is developing. In [55], the authors emphasize the need for algorithmization and the automation of processing large arrays of unstructured patient review texts to build independent quality assessments and demonstrate the effectiveness of classical machine learning methods for sentiment analysis on Russian-language data.

Modern Trends: Large Language Models and New Challenges. A modern and rapidly developing direction is the application of Large Language Models (LLMs) in healthcare. As noted by [56], LLMs have the potential to transform doctor–patient interactions, including the automation of administrative tasks, semantic adaptation of complex medical information for patients, and the structuring of clinical data.

However, despite high technological potential, current applications of LLMs are mainly focused on individual interaction tasks and clinical decision support, while their use for the systemic analysis of feedback to identify structural problems at the level of medical specialties and quality management remains unexplored.

Previous Work by the Authors. Over the past three years, we have conducted a series of studies devoted to analyzing patient reviews and developing methods for their automatic classification.

In [57], a hybrid method was developed for classifying Russian-language reviews about medical institutions using various neural network architectures (GRU, LSTM, and CNN). On an array of 60 thousand reviews, the best result was shown by the GRU-based architecture (val_accuracy = 0.927). This study demonstrated the fundamental feasibility of automatically distinguishing medical and organizational aspects in review texts.

In [58], a database of 18,492 negative reviews (one-star ratings) from the infodoctor.ru platform for the period 2012–2023 was published, covering 16 major Russian cities. This database became the foundation for subsequent research and confirmed the presence of systematic differences in the structure of complaints across cities and patient groups.

In [59], the problem of binary classification of negative reviews into M and O was solved using Logistic Regression (accuracy = 88.5%). Analysis of 18,680 negative reviews revealed a significant structural shift in Moscow, where, since 2021, medical complaints have begun to predominate over organizational ones as well as stable differences in the structure of complaints between groups of medical specialties.

Based on an analysis of 18,492 negative reviews conducted by the authors [60], technical specialties (surgeons, dental surgeons) are significantly more often associated with complaints about medical aspects (39.8% vs. 29.3% in the group of communicative specialties, p < 0.001), while communicative specialties (gynecologists, obstetricians) predominate in complaints about organizational aspects (45.7% vs. 35.0%, p < 0.001).

Gaps in the Current Literature. The conducted analysis allows for identifying key gaps in existing research. According to a recent review study [61], analyzing 52 works published up to February 2024, sentiment analysis, topic modeling, and text classification are the three dominant approaches to analyzing unstructured patient feedback. However, the review identifies two critically important gaps:

The connection between findings obtained through NLP and traditional healthcare quality indicators remains limited.
There is little evidence that these approaches are actually used in clinical practice to influence healthcare worker behavior or managerial decision-making [61].

This finding aligns with earlier observations: although patient feedback can lead to changes in doctor behavior, this process is not automatic and critically depends on contextual factors, with narrative comments playing a key role in feedback acceptance [62]. A systematic review of interventions based on patient feedback [63] also confirms that comprehensive interventions targeting both individual healthcare worker behavior and organizational processes are most effective, underscoring the need for analytical tools capable of identifying precisely structural problems.

Beyond patient experience feedback, machine learning has also been applied to patient classification in other clinical domains. A systematic review by Ruksakulpiwat et al. [64] examined ML-based classification systems for stroke patients, identifying 44 clinical features (primarily age, gender, and glucose levels) and 15 algorithms, with support vector machines (SVMs) and Random Forest being most commonly used. While focused on clinical classification rather than patient experience feedback, this review highlights the broader applicability of ML for patient stratification and the diversity of approaches depending on input data characteristics—findings that align with our observation that the choice of algorithm depends on the data source and task definition.

1.3. Limitations of Previous Work and Bridge to the Present Study

Despite the results obtained, previous research (both ours and international) has several limitations that prevent the use of developed approaches for production deployment at the scale of real data flows.

The first limitation relates to the focus exclusively on negative reviews in our previous works [58,59,60]. However, as data from the Prodoctorov.ru platform show, the proportion of high ratings (4–5 stars) is 89.7%, while the proportion of low ratings is only 6.2%. Ignoring positive and, especially, mixed-sentiment reviews leads to the loss of more than 90% of available information.

The second limitation concerns the completeness of classification. In previous works [59,60], C (combined) was present in the methodology but was not the focus of analysis. However, as preliminary analysis shows, combined reviews constitute a substantial proportion (about 30%) and have independent value.

The third limitation is the labor intensity and high cost of expert annotation. In previous studies [57,58,59,60], annotated corpora were created exclusively by experts, which limited their volume (18–60 thousand reviews). Training modern models, especially those based on transformer architectures, requires substantially larger arrays of annotated data.

The fourth limitation is the absence of metrics accounting for different costs of errors in the medical context. Standard metrics (accuracy, F1) treat all errors as equivalent, which does not correspond to the real business needs of medical organizations

The fifth limitation is the lack of validation of developed models on real data scales. Previous studies [46,49,55,57,59] remained at the proof-of-concept level and did not demonstrate model performance in conditions approximating a production environment.

1.4. The Present Study: Aim and Objectives

The aim of this study is to develop and validate a three-class classification system for patient reviews (M, O, and C) suitable for production deployment in medical organizations and scaling to million-sized data arrays.

To achieve this aim, the following objectives are addressed:

Develop a hybrid method for creating an annotated corpus, combining expert validation and Large Language Models (LLMs) with iterative prompt engineering, enabling annotation scaling while maintaining quality comparable to expert annotation.
Introduce and justify practical metrics (Practical Accuracy, Practical F1) that account for the hierarchical structure of errors and reflect the real business value of models in the medical context.
Conduct a comparative analysis of eight machine learning architectures (from Logistic Regression to Stacking Ensemble based on BERT, TF-IDF, and LightGBM) on a balanced dataset of 22,417 reviews.
Scale the best model to an array of 4.3 million real reviews from the Prodoctorov.ru platform, with assessments of the class distribution and calculation of the business implementation effect.
Extract substantive insights from the analysis of the million-sized array, including the relationship between review type and rating/length, temporal dynamics, geographical distribution normalized by population, and analysis by medical specialties.

1.5. Scientific Novelty and Contribution of the Work

This study makes the following contributions to the development of patient review analysis methods and their practical application in healthcare:

Methodological contribution: A hybrid “expert + LLM” approach is proposed and validated for creating large annotated corpora, using iterative prompt engineering (15 versions) and modified Cohen’s kappa with hierarchical error weighting. It is shown that LLM can achieve a level of agreement with experts (κ_mod = 0.745) comparable to inter-expert agreement, allowing its use for scaling annotation.
Metric contribution: Practical metrics (Practical Accuracy, Practical F1) are developed and justified as an application-specific adaptation of cost-sensitive evaluation, accounting for the different costs of errors in the medical context and allowing for the evaluation of models from the perspective of their real business value, responding to the challenge formulated in [61].
Empirical contribution: The most comprehensive comparison to date of eight architectures (from simple linear models to complex ensembles) on a unified balanced dataset with evaluation by both standard and practical metrics is conducted.
Applied contribution: Successful scaling of a patient review classification model to an array of 4.3 million records is demonstrated for the first time, with a quantitative assessment of business effect, directly addressing the gap identified in the literature about the lack of evidence for implementing NLP approaches in real practice [61].
Analytical contribution: Based on the analysis of the million-sized array, stable patterns are identified that have independent value for understanding patient behavior and satisfaction dynamics.

1.6. Structure of the Article

Further exposition is organized as follows. Section 2 describes the materials and methods of the study: data sources, sampling procedures, iterative prompt engineering and validation of LLM classification, creation of the final training dataset, compared machine learning architectures, evaluation metrics, and experimental design. Section 3 presents the results: LLM validation, comparative analysis of models, analysis of data distribution influence, scaling to 4 million reviews, and substantive insights. Section 4 contains discussion of the obtained results, their interpretation, comparison with previous works, assessment of practical significance, and limitations of the study. Section 5 formulates the main conclusions. Appendix A provides the full text of the classification manual, version v15.

2. Materials and Methods

The methodology consists of four stages: (1) creation of a large, annotated corpus using a hybrid expert + LLM approach; (2) iterative prompt engineering with expert validation; (3) training and comparison of eight machine learning architectures using both standard and domain-specific metrics; and (4) scaling to 4.3 million reviews with production-ready deployment considerations.

2.1. Data Sources and Sampling Principles

The study is built on data from the two largest Russian platforms aggregating medical review: Infodoctor and Prodoctorov.ru. The choice of these sources is determined by their representativeness (combined coverage—millions of reviews), data availability for research analysis, and different collection periods, allowing for assessment of the robustness of developed models to temporal changes.

Infodoctor Platform (2012–2023): Initial Samples. Data from the Infodoctor platform served as the basis for developing and validating the classification system. To ensure representativeness and account for the influence of sentiment on classification, three specialized samples were formed.

The negative sample was formed based on the archive of expertly annotated data from previous studies [57,58], focused on reviews of medical institutions in major Russian cities. The sample exclusively included reviews that underwent a validation procedure with high inter-expert agreement (Cohen’s kappa ≥ 0.75) and had unambiguous assignment to one of three categories: M, O, or C problems. The volume of this high-quality annotated sample was 7417 reviews.

The mixed-sentiment sample was formed using an original two-level methodology aimed at selecting reviews with ambiguous or contradictory evaluations. At the first level, rating filtering was applied: reviews with a rating of 2–4 stars on the platform’s 5-point scale were selected. This range indicates the absence of an extremely negative or unconditionally positive impression, increasing the probability of both positive and negative aspects being present in the text. At the second level, linguistic filtering was performed using a set of regular expressions developed to identify characteristic linguistic patterns of mixed evaluations. These patterns included: basic contrastive constructions (“but”, “however”, and “on the other hand”); structured oppositions (“on the one hand… on the other hand…”, “good doctor, but…”); medical contextual phrases (“treatment effective but expensive”, “diagnosis correct but had to wait long”); and qualified evaluations (“the only drawback”, “there is a small nuance”, and “overall good, but…”). A review was included in the sample if at least one of the specified patterns was detected. The final volume of the mixed-sentiment sample was 7500 reviews.

The positive sample was created according to a direct and reproducible criterion. It exclusively included reviews that received the maximum user rating—5 stars. To ensure sample purity, cases where the high rating contradicted the clearly negative content of the text (e.g., reviews of the type “everything is terrible, but I give 5”) were manually excluded. The volume of the positive sample was also 7500 reviews.

Prodoctorov.ru Platform (2011–2025): Scaling Array. To test the practical applicability of the developed models and assess their robustness when working with real data flows, an array of reviews from the Prodoctorov.ru platform was used, covering the period from 2011 to 2025. After downloading, merging, and cleaning the data (procedure described in detail in Section 2.3), the final dataset consisted of 4,340,691 unique patient reviews.

The key characteristics of the Prodoctorov.ru dataset are as follows:

Collection period: 2011–2025 (14 years);
Geographic coverage: 22 Russian cities;
Number of medical specialties: 224;
Rating range: 0 to 5 stars;
Review length: 3 to 10,792 characters.

Using two independent data sources serves two purposes: on the Infodoctor platform, methodology development and validation are conducted with controlled annotation quality; on the Prodoctorov.ru platform, scalability and robustness of the obtained solutions are tested.

2.2. Development and Validation of the LLM-Based Classification System

The general methodological idea was to create a technology enabling the classification of large datasets of medical reviews (including 4 million records from Prodoctorov.ru), with an accuracy comparable to expert human annotation but without the need for total manual annotation of each review. For this purpose, a multi-stage iterative process was implemented, combining human expertise and the scalability of Large Language Models (LLMs).

Iterative Prompt Engineering (v0 → v15). At the first stage, samples of mixed-sentiment and positive reviews (total volume 15,000 records) were shuffled, and a subsample of 1500 comments was randomly selected. This subsample was reserved exclusively for prompt development and validation and was not used in the subsequent stages of model training. The prompt development process was iterative: an initial prompt containing basic classification rules based on previous studies was applied to the first hundred comments; the obtained annotations were checked by the first author for adequacy, with reasoned comments formulated in cases of disagreement; during dialog calibration, part of the disputed cases were reclassified according to the author’s comments, while for others, the initial LLM position was retained after additional justification, allowing for the identification of ambiguous cases requiring rule clarification; based on feedback, the prompt was adjusted, leading to the next version. This cycle was repeated: checking annotations, discussing disputed cases, refining rules and the prompt, and applying it to a new portion of data. As a result, a series of prompts from v1 to v15 was obtained, along with corresponding carefully verified annotations for all 1500 comments. Notably, at the 11th iteration, no disputed cases were identified, indicating the achievement of stable classification rules, and prompt v11 did not require revision. Further iterations (v12–v15) were aimed at refining formulations and adding precedent examples but did not change the substantive rules.

Final Classification Manual (Version v15). As a result of 15 iterations of refinement based on expert feedback, a final classification manual was developed, used by the LLM for data annotation. The full text of the manual is presented in Appendix A. Category M focuses on professional medical aspects and doctor competence, with key features including specific medical actions (diagnosis, diagnostic interpretation, treatment prescription, procedures, medications, and surgeries), assessment of quality of medical procedures and their outcomes, specific treatment results, description of medical manipulations, specific errors in diagnosis or treatment, and criticism of medical incompetence with examples; the critical distinction is that specific assessment of medical decisions and their justification falls under M. Category O focuses on service, administrative, and communication aspects, with key features including appointment scheduling, waiting, schedule delays, administrative work, reception, service costs and finances, location convenience, accessibility, parking, space organization, comfort, cleanliness, communication aspects (politeness, attentiveness), personal characteristics without assessment of professionalism, general service quality, general thanks without specifics of medical actions, and general phrases about professionalism without medical specifics; the critical distinction is that only communication aspects without the assessment of professionalism fall under O. Category C focuses on a balanced combination of medical and organizational aspects, with key features including both topics being present approximately equally and being significant, difficulty in determining the dominant category, specific medical actions combined with substantial service aspects, medical prescriptions combined with financial issues, and criticism of treatment combined with communication problems. The decision algorithm involves first determining the presence of specific medical actions (injections, tests, treatment, diagnosis, examinations, surgeries, and specific prescriptions), then determining the presence of general evaluations without medical specifics (only general phrases, only service aspects, and only personal characteristics), and finally, assessing balance and significance: if specific medical actions dominate, assign M; if only general or service evaluations without medical specifics are present, assign O; and if specific medical actions appear with equivalent criticism or evaluation of the service, assign C. Classification priorities follow this order: specific medical actions → M; only general or service evaluations → O; and equivalent combination → C.

Validation Protocol with Two Experts. For objective assessment of the quality of classification performed by the LLM using the final prompt v15, a strict validation protocol was implemented, comprising three stages.

Stage 1: Independent Expert Annotation—two independent experts, experienced in medical text analysis and participating in previous studies [1], classified the same 1500 reviews used for prompt development, working independently without access to the LLM annotation or their colleague’s annotation.

Stage 2: Analysis of Inter-Expert Discrepancies and Arbitration—initial inter-expert agreement was κ_mod = 0.740 (modified Cohen’s kappa with hierarchical weighting, described below), indicating substantial but not complete agreement. This reflects the inherent subjectivity of the task, especially when distinguishing combined content cases. All cases of discrepancy between experts (768 out of 1500, i.e., 51.2%) were submitted for arbitration by the study author, resulting in a Gold Standard dataset (i.e., a reference dataset with resolved disagreements) subsequently used as the ground truth for assessing LLM quality.

Stage 3: Comparison of LLM with Gold Standard—the classification performed by the LLM using the final prompt v15 was compared with the obtained Gold Standard using a modified version of Cohen’s kappa for quantitative agreement assessment.

Modified Cohen’s Kappa with Hierarchical Weighting. For the quantitative assessment of agreement between classifications performed by experts and the LLM, a modified version of Cohen’s kappa coefficient was used. Unlike the standard metric, which treats all discrepancies between classes as equivalent errors, the modified version implements hierarchical weighting reflecting the different criticality of errors for the medical workflow. The weighting scheme is justified as follows: M ↔ O errors (critical)—discrepancies between purely medical and purely organizational categories are recognized as serious errors, indicating potential systemic failures in the algorithm; such an error leads to the incorrect routing of the appeal, making it useless for quality management, and the weight of such discrepancies is set to 1.0. Errors involving class C (acceptable)—discrepancy between a pure category (M or O) and combined (C) is considered less critical, as it reflects different degrees of “hedging” in classification; in both cases, information is preserved (medical or organizational aspect is present), albeit with some loss of precision, and the weight of such discrepancies is set to 0. Mathematically, modified kappa is calculated by the following formula:

κ_mod = (p₀ − p_e)/(1 − p_e),

(1)

where p₀ is the observed agreement with weights, and p_e is the expected random agreement. In accordance with standard interpretations of kappa in medical research [65], a conservative validation threshold of κ_mod ≥ 0.75 was established, corresponding to the level of “substantial” or “excellent” agreement, ensuring an acceptable proportion of critical errors (M ↔ O) of less than 6%.

LLM Classification Validation. Agreement of classification performed by the LLM (prompt v15) with the Gold Standard was κ_mod = 0.745, which practically reaches the established validation threshold of 0.75 (deviation of less than 0.005 is within statistical error). The proportion of critical errors (mutual substitutions M ↔ O) in the LLM classification was 5.3%, corresponding to a Practical Accuracy value of 94.7%. The detailed confusion matrix is presented in Table 1.

The obtained result demonstrates that the LLM achieved a level of agreement with the reference annotation comparable to inter-expert agreement (inter-expert κ_mod = 0.740 before arbitration). This allows for recognizing the automatic classification performed by the LLM using prompt v15 as valid and suitable for scaling to large data arrays.

2.3. Formation of the Final Training Dataset

The high level of agreement between the LLM and Gold Standard allowed for the use of prompt v15 for classifying the remaining 13,500 comments from the combined mixed-sentiment and positive sample (15,000 original minus 1500 used for validation). Thus, a hybrid approach was implemented for forming a large, annotated corpus, combining expert annotation for the most complex (negative) part of the data with validated LLM annotation for the mixed-sentiment and positive parts. The final training dataset for the comparative analysis of machine models was compiled from three sources: 7417 negative reviews (1-star ratings) with expert annotation, balanced by specialty and city, from the archive of previous studies, and 15,000 mixed-sentiment and positive reviews classified using the validated prompt v15 (of which 1500 were used for validation and were included in the final dataset as part of the annotated data). After concatenating the three samples, we removed duplicate texts (completely identical reviews, likely automatic or copied messages) and randomly shuffled the data to eliminate possible order effects associated with the original grouping by samples. As a result, a final corpus of 22,417 unique patient reviews was obtained with a balanced distribution across categories:

M: 7662 reviews (34.2%);
O: 7839 reviews (35.0%);
C: 6916 reviews (30.8%).

This approach allowed us to overcome a key limitation associated with the high cost and labor intensity of complete expert annotation and creating a large dataset necessary for training modern ML architectures. The LLM, showing agreement with the Gold Standard at a level not inferior to agreement between experts, served as an effective and scalable tool for augmenting annotated data.

2.4. Machine Learning Models and Experimental Setup

For comparative analysis, eight machine learning architectures were selected, representing different approaches to text classification: from simple linear models to complex transformer-based ensembles. The choice of architectures is determined by their prevalence in NLP tasks, availability of computational resources, and representativeness for different model families. The key characteristics of each architecture are summarized in Table 2.

Evaluation Metrics. Unlike standard text classification tasks where all errors are considered equivalent, different types of errors have different costs in medical practice. Based on the analysis of business processes in medical organizations, specialized practical metrics were developed alongside standard metrics for comparability with previous studies.

Standard metrics include standard accuracy (proportion of correctly classified reviews across all three classes, with all errors considered equal) and Standard F1 (macro)—the harmonic mean of precision and recall averaged over three classes.

Practical metrics account for error hierarchy: Practical Accuracy (PA) is the main indicator of production readiness of the model, calculated as the proportion of reviews classified without critical errors:

PA = 1 − N_critical/N_total,

(2)

where N_critical is the number of critical errors (incorrect classification between M and O), and N_total is the total number of reviews. For interpretation, PA = 0.929 means that 92.9% of complaints will be automatically directed to the correct department of the medical organization, and only 7.1% will require manual checking.

The cost matrix underlying Practical Accuracy assigns critical weight 1.0 to M ↔ O misclassifications (as these result in incorrect complaint routing) and weight 0.0 to all errors involving class C (as these still route to either medical or administrative departments without causing misdirection).

Practical F1 is an indicator of classification strategy, reflecting the model’s tendency to assign reviews to the combined class, calculated as F1 for class C. Low Practical F1 (<0.1) indicates a conservative strategy—the model minimizes the risk of critical errors, preferring to assign controversial cases to pure classes; high Practical F1 (>0.2) indicates an active strategy—the model actively attempts to correct errors, using class C more often.

Robustness Metrics include the Coefficient of Variation (CV), defined as the ratio of the standard deviation of Practical Accuracy to its mean across cross-validation folds, characterizing model stability when the composition of training data changes. CV < 0.01 indicates excellent robustness; CV 0.01–0.02 indicates very good; CV 0.02–0.05 indicates acceptable; and CV > 0.05 indicates low robustness (unsuitable for production).

Experimental Design. To ensure the methodological purity of comparison, the combined sample of 22,417 patient reviews described in Section 2.3 was used. The evaluation procedure employed 5-fold stratified cross-validation, where the split preserved the proportion of each class in training and test folds. For each model on each fold, all metrics (standard accuracy, Standard F1, Practical Accuracy, and Practical F1) were calculated. Final values are presented as mean ± standard error of the mean over five folds. Additionally, the model training time on one fold (in seconds) was recorded to assess computational efficiency. Technical conditions were as follows: experiments were conducted on a server with NVIDIA Tesla T4 GPU (NVIDIA Corporation, Santa Clara, CA, USA; 16 GB VRAM), 4 vCPU, and 32 GB RAM. Model implementations used HuggingFace Transformers (version 4.36.0) for BERT, scikit-learn (version 1.3.0) for linear models, MLP, and Random Forest, and LightGBM (version 4.1.0; Microsoft library). Experiment code is available upon request for reproducibility of results.

3. Results

This section presents the main results of the study in three parts. First, the validation of the LLM-based classification system is reported, including inter-expert agreement, agreement between LLM and the Gold Standard, and detailed error analysis (Section 3.1). Second, a comparative analysis of eight machine learning architectures is presented, evaluating both standard and practical metrics, model robustness, and computational efficiency (Section 3.2). Third, the scaling of the best-performing model to 4.3 million reviews from the Prodoctorov.ru platform is described, along with substantive insights into patient behavior, temporal dynamics, and geographical and specialty-specific patterns (Section 3.3 and Section 3.4).

3.1. Results of LLM Classification Validation

In accordance with the validation protocol described in Section 2.2, an assessment was made of the quality of classification performed by the Large Language Model (LLM) using the final prompt v15. The Gold Standard dataset consisted of 1500 reviews annotated with arbitration participation.

Inter-expert agreement before arbitration was κ_mod = 0.740, indicating substantial but not complete agreement and reflecting the inherent subjectivity of the task, especially in distinguishing combined content cases. After arbitration of all 768 cases of discrepancy between experts, the final Gold Standard was formed.

The high raw disagreement rate (51.2%) reflects the inherent subjectivity in distinguishing combined reviews (class C) from pure classes. Weighted κ_mod = 0.740 indicates substantial agreement after accounting for the hierarchical error structure, as most disagreements involve class C (acceptable errors) rather than critical M ↔ O confusion.

Agreement of the classification performed by the LLM with the Gold Standard was κ_mod = 0.745. This value practically reaches the established validation threshold of 0.75 (deviation of less than 0.005 is within statistical error) and is comparable to the level of inter-expert agreement.

The detailed confusion matrix is presented in Table 1. Analysis of the error matrix shows that critical errors (M ↔ O) account for 79 cases (5.3% of the sample): 28 reviews were misclassified as M instead of O and 51 as O instead of M. A total of 512 reviews included acceptable errors involving class C (34.1%), comprising 67 (M → C), 164 (O → C), 220 (C → M), and 61 (C → O). Exact matches account for 909 reviews (60.6%): 476 (M-M), 192 (O-O), and 241 (C-C). Based on the error matrix, LLM Practical Accuracy was calculated as PA_LLM = 1 − 79/1500 = 0.947 = 94.7%. The obtained result demonstrates that the LLM achieved a level of agreement with the reference annotation comparable to inter-expert agreement, validating the automatic classification performed by the LLM using prompt v15 as being suitable for scaling to large data arrays.

3.2. Comparative Analysis of Models on the Combined Sample

On the final training dataset of 22,417 reviews (class distribution: M 34.2%, O 35.0%, and C 30.8%), a comparison of eight machine learning architectures was conducted using 5-fold stratified cross-validation. The results of the comparative analysis are presented in Table 3.

Advantages of Hybrid Approaches. The top three models—Stacking Ensemble, BERT + TF-IDF Hybrid, and BERT + LR—show a substantially higher Practical Accuracy (0.915–0.929) than traditional methods (0.813–0.893). The difference in Practical Accuracy between the best and worst models is 11.6 percentage points, confirming the importance of contextual semantic analysis provided by BERT architectures. Particularly noteworthy is the comparison between the Stacking Ensemble and the baseline TF-IDF + LR model, which represents a typical industry standard for text classification tasks. The ensemble’s advantage of 3.6 percentage points in Practical Accuracy translates into substantial business effects when scaled to million-sized arrays, as demonstrated in Section 3.3.

Trade-off Between Accuracy and Classification Strategy. A pronounced inverse relationship is observed between Practical Accuracy and Practical F1 (Spearman correlation coefficient ρ = −0.93, p < 0.01). Models with high Practical Accuracy (Stacking Ensemble: 0.929; BERT + TF-IDF Hybrid: 0.927) show low Practical F1 (0.101 and 0.103, respectively), indicating a conservative strategy where the model minimizes the risk of critical M ↔ O errors by using class C relatively infrequently. Conversely, models with lower Practical Accuracy (Random Forest: 0.822; Logistic Regression: 0.813) demonstrate high Practical F1 (0.261 and 0.249, respectively), indicating an active strategy that more often assigns reviews to class C in an attempt to hedge in controversial cases but at the cost of more critical M ↔ O errors. From the perspective of practical implementation in medical organizations, the conservative strategy is preferable, as it minimizes the most expensive type of error—incorrect routing of complaints between medical and administrative departments.

Model Robustness. The Coefficient of Variation (CV) measures model stability when training data composition changes—a critical parameter for production systems handling constantly updating review streams. All models except MLP demonstrate high robustness (CV < 0.02). The BERT + TF-IDF Hybrid model shows maximum robustness (CV = 0.009), the only model in the sample falling into the “excellent” category. The Stacking Ensemble also demonstrates high robustness (CV = 0.011, “very good” category). The low robustness of MLP (CV = 0.062, “low” category) makes this model unsuitable for production deployment despite an acceptable average Practical Accuracy (0.868), as its classification quality substantially depends on the specific composition of the training sample.

Computational Efficiency. The TF-IDF + Logistic Regression model is of particular interest for scenarios with limited computational resources or real-time processing requirements. It provides acceptable Practical Accuracy (0.893) with a record-low training time of 7.1 s, which is 35 times faster than hybrid approaches and 20 times faster than BERT + LR. Inference time for this model is also minimal (less than 0.001 s per review), making it suitable for high-load systems. LightGBM and Random Forest demonstrate training speeds comparable to TF-IDF + LR (8.7 and 12.4 s respectively) but are inferior in Practical Accuracy (0.831 and 0.822 vs. 0.893), making them less preferable when choice is available.

Detailed Analysis of the Best Model (Stacking Ensemble). For deep understanding of the best model’s behavior, a detailed analysis of the Stacking Ensemble was conducted. A Practical Accuracy of 0.929 means that 92.9% of complaints are automatically directed correctly, corresponding to 20,822 correctly classified reviews out of 22,417 (without critical errors). Critical errors amount to 317 cases (1.4%), approximately 14 critical errors per 1000 processed reviews. The standard accuracy of the model is 0.680; importantly, this metric does not reflect the real value of the model in the medical context, as it combines critical errors (1.4%) and acceptable errors involving class C (30.6% = 100% − 68.0% − 1.4%), which have different business costs. This is precisely why Practical Accuracy, rather than standard accuracy, serves as the main indicator of production readiness.

Analysis of Data Distribution Influence. Preliminary analysis revealed a fundamental dependence between data structure and classification effectiveness. For quantitative assessment of this effect, a correlation analysis was conducted between the proportion of combined reviews (category C) and the main classification metrics on various data subsamples. The Pearson correlation coefficient between the proportion of class C and Practical Accuracy was r = 0.914 (p < 0.05), meaning that 83.5% of the variability in Practical Accuracy (r² = 0.835) is explained by differences in the category distribution between samples. This pattern is explained by the feature of the introduced practical metrics: errors involving class C are not considered critical, so samples with a high proportion of combined reviews demonstrate inflated Practical Accuracy values with the same classification quality. The observation confirms the necessity of using balanced datasets for valid model evaluation and justifies the choice of the combined sample (with a class C proportion of 30.8%) as the most representative. Using unbalanced samples (e.g., with C < 10% or >50%) would lead to systematic bias in model quality estimates.

3.3. Scaling on Prodoctorov.ru Data (4,340,691 Reviews)

To test the practical applicability of the proposed approach and assess its robustness when working with real data scales, the Stacking Ensemble was applied to a dataset of 4,340,691 reviews from the Prodoctorov.ru platform.

Distribution of Predicted Classes. The predicted distribution (M 46.1%, O 21.0%, and C 32.9%) differs from the training sample distribution (M 34.2%, O 35.0%, and C 30.8%). This difference is explained by platform specifics: Infodoctor, used for training, contains a larger proportion of organizational reviews, while Prodoctorov.ru predominantly accumulates medical content. Importantly, the proportion of combined reviews (class C) remains consistently high (about 30%) on both platforms, confirming the necessity of their separate consideration. Table 4 presents the detailed class distribution on the Prodoctorov.ru platform.

Assessment of Scaling Practical Effect. The key result from a practical perspective is that the Stacking Ensemble, demonstrating a Practical Accuracy of 0.929, surpasses traditional methods (using TF-IDF + Logistic Regression with Practical Accuracy 0.893) by 3.6 percentage points. On an array of 4,340,691 reviews, this advantage translates into 156,000 additional reviews that will be automatically directed to the correct department thanks to using a more accurate model. Regarding critical error reduction, the Stacking Ensemble’s critical error proportion of 1.4% compared to TF-IDF + LR’s 2.1% yields a difference of 0.7 percentage points, meaning 30,000 reviews avoid incorrect routing thanks to using the ensemble. With an inference speed of approximately 0.01 s per review on GPU, processing 4.3 million reviews takes about 12 h—acceptable for batch processing. For scenarios requiring real-time processing, an adaptive architecture can be used, as described in Section 4.3.

3.4. Substantive Insights from the Analysis of 4 Million Reviews

Scaling the model to an array of 4.3 million reviews allowed for assessing not only the technical effectiveness of the approach but also extracting substantive insights with independent value for understanding patient behavior and satisfaction dynamics.

Relationship Between Review Type and Rating. Analysis of average ratings by class reveals that patients rate the medical competence of doctors significantly higher (mean rating 4.75) than the organizational aspects of clinic operations (4.59), with the difference being statistically significant (Welch’s t-test, p < 0.001). Combined reviews receive the lowest ratings (4.55), which may reflect the complex nature of dissatisfaction affecting both spheres. The median rating of 5.0 across all three classes, despite varying mean values, reflects the strong positive bias in patient feedback: over 50% of reviews in each category receive the highest possible score. The lower means for organizational (4.59) and combined (4.55) reviews compared to medical (4.75) indicate that these categories have longer “tails” of low ratings, pulling the average downward while the median remains at the maximum. This pattern confirms that, while most patients rate doctors positively regardless of complaint type, negative experiences—particularly those involving organizational or combined issues—tend to be more extreme when they occur. Detailed statistics on average ratings by class are presented in Table 5.

Review Length by Class. Combined reviews are the longest and presumably the most informative, with an average length of 802 characters. This is 2.5 times longer than organizational reviews (320 characters) and 1.6 times longer than medical reviews (513 characters). Patients who describe both medical and organizational aspects tend to provide detailed narratives, making such reviews particularly valuable for analysis and quality management. Organizational reviews are the most concise, often limited to brief complaints or thanks without detail. Table 6 provides detailed statistics on review length by class.

Temporal Dynamics (2011–2025). Analysis of temporal dynamics reveals exponential growth in the review count from 465 in 2011 to 886,551 in 2024, indicating that the culture of leaving reviews in Russian medicine has been actively developing over the last five years, with particularly intensive growth observed in 2021–2024. Table 7 presents the dynamics of the review count by year.

Over 14 years (2011–2025), the average ratings increased by 1.24 points for medical reviews (from 3.64 to 4.88), by 1.00 points for organizational reviews, and by 1.16 points for combined reviews, as detailed in Table 8. This may indicate systemic improvement in the quality of medical services, the increased qualification of doctors, development of medical technologies, or changes in patient behavior patterns. The gap between ratings of medical and organizational reviews decreased from 0.17 points in 2011 to 0.07 points in 2025, suggesting the alignment of quality across all service aspects.

Geographical Distribution. For the correct comparison of cities with different population sizes, a normalization calculation was performed with the number of reviews per 1 million inhabitants, using the average population for 2019–2023, according to the Federal State Statistics Service (Rosstat). After normalization, Krasnodar and Rostov-on-Don lead with indicators of 445,000 and 275,000 reviews per 1 million inhabitants, respectively, which is 4–7 times higher than in Moscow (65,500). This may reflect both the higher digital activity of patients in these regions and features of Prodoctorov.ru platform promotion. Table 9 shows the top five cities by normalized review count.

Despite significant differences in absolute indicators and normalized values, the distribution of reviews by M-O-C classes demonstrates remarkable stability across all five cities: M reviews 44.6–49.0%, O reviews 18.1–23.5%, and C reviews 31.1–37.2%. The absence of significant regional differences in the review structure testifies to the universality of the developed classification and indicates that patients throughout Russia evaluate medical services similarly, regardless of city of residence.

Analysis by Medical Specialties. Analysis by medical specialty reveals that surgical specialties dominate in medical reviews, with dental surgeons showing the highest proportion of M complaints (54.0% M). Observational specialties receive more O reviews, with obstetricians (32.7% O) and gynecologists (30.7% O) leading, likely due to the prolonged nature of observation, where organizational aspects accumulate and become significant. The proportion of C reviews is maximal for specialties with a high communicative component, such as dermatologists (35.1%), ENT specialists (35.0%), and neurologists (35.0%), indicating that patients interacting with these specialists tend to evaluate both the consultation quality and organizational moments. Table 10 presents the top 10 specialties by number of reviews.

Overall Rating Distribution. Analysis of the overall rating distribution reveals a strong positive bias in reviews, with the vast majority being high ratings. The platform uses a 0–5 scale with fractional values possible (e.g., 4.4, 3.2). For analysis, we categorized ratings into three groups: high (4–5), including both endpoints; zero (0), representing extreme dissatisfaction; and medium (all other ratings), encompassing values from above 0 to below 4.

As shown in Table 11, high ratings (4–5) constitute 3,895,449 reviews (89.7% of the total). Zero-star reviews account for 163,277 reviews (3.8%), based on a dedicated analysis of zero-rated feedback. The remaining 281,965 reviews (6.5%) fall into the medium category (ratings between 0 and 4, excluding zero).

Notably, analysis of zero-star reviews revealed distinct patterns, where O complaints dominate (38.2%), followed by C (33.7%) and M (28.1%) complaints, suggesting that extreme dissatisfaction is more often triggered by service failures than by medical outcomes. The average length of zero-star combined reviews reaches 1013 characters—substantially exceeding even the general average for combined reviews (802 characters)—indicating that highly negative experiences prompt particularly detailed narratives.

The predominance of high ratings may reflect that patients are more likely to leave reviews after positive experiences, psychological barriers to posting negative feedback, or cultural factors. Medium ratings are relatively rare, indicating that patients tend toward polar evaluations—either “excellent” or “terrible”—with few expressing moderate opinions. For quality management tasks, zero-star reviews are of particular value as rare but strong signals of systemic problems requiring an immediate response. Their share of 3.8% represents over 163,000 extreme dissatisfaction signals—a substantial dataset for identifying critical issues in healthcare delivery.

4. Discussion

4.1. Interpretation of Main Results

This study developed and validated a three-class classification system for patient reviews (M, O, and C) suitable for production deployment and scaling to million-sized data arrays. The obtained results require substantive interpretation in the context of both the technical aspects of machine learning and practical tasks of quality management in healthcare, as well as in light of the current state of research summarized in recent reviews [46,62].

Why Hybrid Models Win. The top three models—Stacking Ensemble, BERT + TF-IDF Hybrid, and BERT + LR—significantly outperform traditional approaches in Practical Accuracy (91.5–92.9% vs. 81.3–89.3%). This superiority stems from the ability of transformer architectures (BERT) to capture semantic nuances critical for distinguishing medical and organizational aspects. Frequency-based models (TF-IDF + LR) rely on keywords and often misclassify reviews containing contrastive constructions or balanced combinations of medical and organizational themes. BERT-containing architectures, on the contrary, account for context and can identify the combined nature of a review, even in the presence of strong lexical markers of one of the classes. This observation aligns with research results in medical NLP [49], demonstrating the advantage of transformer models when working with complex semantic constructions in clinical texts. As noted in the systematic review by Khanbhai et al. [46], the choice of method should correspond to the nature of the data source: for structured surveys, supervised learning methods are effective; for unstructured social media, topic modeling is appropriate. Our study demonstrates that, for mixed sources (review platforms), a combined approach combining semantic and frequency features is optimal.

This advantage is further illustrated by error analysis. For instance, the review “The doctor prescribed treatment but was very rude”—which contains both a medical action (prescription) and organizational issue (rudeness)—was correctly classified as combined (C) by both Stacking Ensemble and the BERT + TF-IDF Hybrid, while TF-IDF + Logistic Regression classified it as organizational (O) due to the presence of the strong lexical marker “rude”. Similarly, the review “Diagnosed correctly but had to wait long” was classified as combined (C) by BERT-based models, whereas TF-IDF + LR defaulted to medical (M) based on the keyword “diagnosed”. These cases illustrate how transformer architectures capture nuanced combinations that frequency-based models miss

Trade-off Between Practical Accuracy and Practical F1: Conservative vs. Active Strategy. The inverse relationship between Practical Accuracy and Practical F1 (ρ = −0.93) has important implications for choosing a classification strategy in medical organizations. The conservative strategy (Stacking Ensemble, BERT + TF-IDF Hybrid) is characterized by a high Practical Accuracy (92.7–92.9%) and low Practical F1 (0.10–0.12), minimizing critical M ↔ O errors (1.4–1.5%). This means that the model prefers to “hedge” and assign a controversial case to a pure class (M or O), even at the cost of losing information about the combined nature of the review. From a business process perspective, such a strategy is preferable: the complaint will reach either the medical director or the administrator and, in any case, will not remain unanswered. The active strategy (Random Forest, Logistic Regression) shows lower Practical Accuracy (81.3–82.2%) and higher Practical F1 (0.25–0.26), with an increased proportion of critical errors (3.6–3.7%). Models in this group actively use class C, attempting to accurately reflect the combined nature of the review. However, the price for this “honesty” is an increase in critical M ↔ O errors, which, in a production environment, will lead to incorrect complaint routing. For medical organizations where the main goal is the correct routing of appeals, the conservative strategy is preferable.

Significance of Class C: Not “Noise” but a Source of Additional Information. The results show that combined reviews (class C) are not a nuisance but an independent and valuable category, constituting about one-third of all reviews (30.8% in the training sample, 32.9% on Prodoctorov.ru). Analysis of class C characteristics revealed maximum informativeness (average length of 802 characters vs. 513 for M and 320 for O), intermediate ratings (average 4.55 between M 4.75 and O 4.59), and high proportion in specialties with a communicative component (up to 35% for dermatologists, ENT specialists, and neurologists). These observations align with the concept of “multidimensional satisfaction” of patients [41], according to which the assessment of medical service quality consists of several independent dimensions (technical quality, interpersonal interaction, accessibility, etc.). Combined reviews precisely reflect this multidimensionality and contain the most complete information for quality management, confirming the thesis of Baines et al. [62] about the key role of narrative comments in feedback acceptance. The correlation between the proportion of class C and Practical Accuracy (r = 0.914) also confirms that ignoring combined reviews leads to systematic bias in model quality estimates. This is an important methodological conclusion justifying the necessity of three-class classification.

Analysis of Zero-star Reviews Provided Additional Validation of Our Classification Approach. Among 163,277 extremely negative reviews (3.8% of all feedback), O complaints dominated (38.2%), followed by C (33.7%) and M (28.1%). This distribution differs markedly from the overall pattern (M 46.1%, O 21.0%, and C 32.9%), confirming that service failures are more potent triggers of extreme dissatisfaction than medical outcomes. The average length of zero-star combined reviews reached 1013 characters—substantially exceeding the overall combined review average of 802 characters—indicating that highly negative experiences prompt particularly detailed narratives. These findings underscore the importance of distinguishing between complaint types for effective quality management.

4.2. Comparison with Previous Studies

Comparison with Our Previous Work. The present study develops the cycle of work performed by the authors in 2024–2026 and overcomes the limitations inherent in previous approaches. Table 12 provides a comprehensive comparison of key characteristics across our studies.

As shown, the current work expands the classification to three classes (M, O, and C), covers all review sentiments, introduces a hybrid expert + LLM annotation methodology, scales to 4.3 million reviews, and quantifies the business effect—addressing each limitation identified in prior research. The introduction of Practical Accuracy as a business-oriented metric addresses the need for evaluation tools that reflect real operational value rather than purely statistical performance, while the systematic comparison of eight architectures on a unified dataset provides practical guidance for model selection in production environments.

Comparison with External Studies. Our findings align with Khanbhai et al. [46], who identified machine learning and NLP as key tools for analyzing patient feedback. While direct comparison with earlier studies is limited by differences in task definition (three-class M, O, and C vs. topic classification), language (Russian vs. English), and evaluation metrics, our standard accuracy of 68.0% and Standard F1 of 67.9% provide a baseline for cross-study comparison. The Practical Accuracy of 92.9% complements these metrics by reflecting operational values specific to complaint routing.

Unlike the study by Wallace et al. [49], where analysis was limited to topic modeling and identifying latent factors in online doctor reviews, our proposed approach provides directly management-relevant classification suitable for complaint routing. If Wallace et al. [49] answered the question “what do patients write about?” we answer the question “where to direct the complaint?”

The work of Nawab et al. [48] demonstrated the effectiveness of NLP for extracting information from structured Press Ganey surveys, identifying dissatisfaction factors such as climate control and noise in wards. Our study extends this approach, showing that similarly valuable insights can be extracted from unstructured reviews on open platforms and at a scale two orders of magnitude larger than traditional surveys.

In the context of Russian healthcare, the work of Russkikh et al. [55] demonstrated the applicability of classical machine learning methods for sentiment analysis on Russian-language data. Our study develops this approach, moving from sentiment analysis to substantive classification and using modern transformer architectures.

Particular attention should be paid to comparisons with the conclusions of the recent review by Feizollah et al. [61], analyzing 52 works up to February 2024. The review authors identify two critical gaps: limited connection between NLP findings and traditional quality indicators and extremely little evidence of the real use of these approaches in clinical practice. The present study directly responds to these challenges, proposing metrics (Practical Accuracy) tied to the business processes of medical organizations and demonstrating scaling on real data with the quantitative assessment of business effects, yielding 156,000 additional correctly processed complaints compared to traditional methods when applied to 4.3 million reviews.

4.3. Practical Significance and Business Value

Calculation for a Typical Medical Organization. To assess the business value of the developed approach, consider a hypothetical large medical organization (clinic network) receiving 50,000 patient reviews per year. Such indicators are typical for multidisciplinary medical centers in major million-plus cities. When implementing a system based on the Stacking Ensemble (Practical Accuracy 92.9%, critical error proportion 1.4%), 46,450 reviews (92.9%) are automatically directed to the appropriate departments, while only 700 reviews (1.4%) require manual checking due to critical errors. This frees up significant staff time resources: employees previously engaged in the manual sorting of complaints can be redirected to solving more complex tasks directly related to improving the quality of medical care. Operational savings are achieved through reduced labor costs for primary feedback processing. With realistic estimates of working time costs and the established practice of outsourcing such tasks, the total economic effect can amount to millions of rubles per year for a large medical organization. The not directly quantifiable but no less important effect of improved service quality is added to this: complaints reach responsible persons faster, problems are solved more quickly, and patient satisfaction increases.

Adaptive Architecture for Production Deployment. For organizations requiring a balance between high accuracy and processing speed, an adaptive architecture is proposed, combining the advantages of fast and accurate models. The approach first directs incoming reviews to a fast model (TF-IDF + LR), which processes most cases quickly. For reviews where the fast model demonstrates low confidence in classification (controversial cases), additional analysis is performed using the accurate model (Stacking Ensemble). The confidence threshold is adjusted to provide the required balance between processing speed and final accuracy. With appropriate confidence threshold tuning, most reviews (simple cases) are processed by the fast model with minimal delay, while complex cases (constituting a minority) are directed to accurate analysis, allowing for the preservation of high classification quality. Average processing time is reduced to a level ensuring sufficient throughput for any real data flows (millions of reviews per day), while the final Practical Accuracy is maintained at a level close to the accurate model’s indicators (reduction does not exceed a few percentage points). The adaptive architecture can be implemented both on GPU servers (for maximum performance) and in a hybrid configuration (CPU for fast models, GPU for accurate models), allowing flexible adaptation to the infrastructural capabilities of a specific organization. An additional advantage is the ability to monitor model confidence distribution in real-time to detect data drift and timely retraining.

Recommendations for Architecture Selection. Depending on the requirements of a specific medical organization, the following recommendations are offered. For maximum safety and accuracy, the Stacking Ensemble (92.9% Practical Accuracy) is recommended for large clinics where the cost of critical errors is highest. For the optimal balance of accuracy and robustness, the BERT + TF-IDF Hybrid (92.7% Practical Accuracy) provides maximum robustness (CV = 0.009). For balance of speed and quality on CPU infrastructure, TF-IDF + Logistic Regression (89.3% Practical Accuracy) with a 7.1 s training time is suitable for real-time processing. For maximum throughput, the adaptive architecture achieves an approximately 90.4% Practical Accuracy with a processing capacity of 270 reviews per second. For rapid prototyping, baseline Logistic Regression (81.3% Practical Accuracy) with a 5.2 s training time allows for quick solution testing.

4.4. Limitations of the Study

Methodological Limitations. The study has several methodological limitations that define the applicability of the results. First, language specificity: models are developed and validated exclusively on Russian-language data; application to other languages will require adaptation and retraining. Second, the subject domain: Results are valid for patient reviews about doctors and clinics; application to other types of medical texts requires additional validation. Third, time frames: Data cover the 2011–2025 period; model robustness to changes in language patterns requires regular monitoring. Fourth, platform specificity: training was conducted on Infodoctor platform data and scaling on Prodoctorov.ru data. A limitation of this study is the absence of an external test set from Prodoctorov.ru with expert-verified labels. While the model was applied to 4.3 million reviews from this platform, this constitutes inference rather than validation.

Technical Limitations. Technical limitations include computational requirements: hybrid models require GPU for training and inference. The training data volume of 22,417 reviews is sufficient, but periodic retraining is recommended for production deployment. Class imbalance in real-world data differs from the training sample, requiring the ongoing monitoring of model performance.

Related to LLM Usage. To address reproducibility concerns, the full text of prompt v15 is provided in Appendix A. Additionally, using the LLM to annotate 15,000 reviews required significant computational resources, though this one-time investment is justified by the resulting high-quality corpus. Code is available from the corresponding author upon reasonable request.

Additional Analyses. While we report the standard accuracy, macro-F1, and practical metrics, we have not systematically evaluated ROC-AUC for three-class classification or tested for concept drift across the 14-year study period. Both represent important directions for future work, particularly for organizations seeking to validate model generalizability over time and across imbalanced class distributions.

The identified limitations reflect the general problem formulated in the recent review by Feizollah et al. [61]: despite the active development of NLP methods, their integration into real clinical practice and connection with measurable quality indicators remains limited. The present study represents a step towards bridging this gap; however, additional research is needed for the full closure of the management loop.

4.5. Directions for Future Research

Building on the conclusions of Feizollah et al. [61] about the need to bridge the gap between NLP analysis and clinical practice and considering the obtained results, the following promising directions can be identified.

First, multimodal analysis integrating review texts with metadata (patient demographics, doctor characteristics, and time stamps) could further improve classification accuracy.

Second, detailed analyses of combined reviews to identify typical combinations of medical and organizational problems and sentiment patterns within them would provide deeper insights into patient experiences.

Third, predictive modeling to forecast spikes in negative reviews and identify early signals of systemic problems could enable proactive quality management.

Fourth, adaptation of the developed methodology to other languages and national healthcare systems would test its generalizability.

Fifth, integration with CRM systems to close the quality management loop—from review classification to tracking intervention outcomes—would transform analytical insights into actionable improvements in healthcare delivery.

Sixth, systematic evaluation of computational efficiency—including inference times and carbon footprints across different hardware configurations—represents an important avenue for future work, particularly for organizations deploying such systems at scale in resource-constrained environments.

5. Conclusions

During the study, a three-class classification system for patient reviews (M, O, and C) was developed and validated and made suitable for production deployment and scaling to million-sized data arrays. Below are the main conclusions, grouped by the areas of obtained results.

5.1. Methodological Conclusions

The hybrid “expert + LLM” approach enables the creation of large, annotated corpora with quality comparable to expert annotation. We developed and validated an iterative prompt engineering method (15 versions), through which a Large Language Model learned to classify reviews based on expert feedback. Modified Cohen’s kappa with hierarchical error weighting (critical M ↔ O vs. acceptable involving C) was κ_mod = 0.745, practically reaching the established validation threshold of 0.75 and comparable to inter-expert agreement (κ_mod = 0.740). The proportion of critical LLM errors was 5.3%, corresponding to a Practical Accuracy of 94.7%. This result directly addresses the need for scalable patient feedback analysis methods that maintain expert-level quality, as highlighted in recent reviews [46,61].
Three-class classification (M, O, and C) is necessary to accurately reflect the real structure of patient reviews. Analysis of 4.3 million reviews showed that combined reviews (class C) constitute 32.9% of all appeals and have independent value. They are 2.5 times longer than organizational reviews (802 vs. 320 characters) and contain the most complete information about patients’ experiences. Ignoring class C leads to the loss of one-third of the available information and, as shown by correlation analysis (r = 0.914), to systematic bias in model quality estimates.

5.2. Technical Conclusions

3.: Hybrid BERT-based models significantly outperform traditional approaches in Practical Accuracy. Comparative analysis of eight architectures on a balanced dataset (22,417 reviews) revealed a consistent advantage of models using transformer semantic representations. The Stacking Ensemble achieved a Practical Accuracy of 92.9%, the BERT + TF-IDF Hybrid reached 92.7%, and the BERT + Logistic Regression attained 91.5%. The gap with traditional methods (TF-IDF + LR: 89.3%, Logistic Regression: 81.3%) amounts to up to 11.6 percentage points, confirming the critical importance of accounting for semantic nuances in distinguishing medical and organizational problems.
4.: Critical errors (M ↔ O confusion) are reduced to an acceptable level of 1.4%. The best model (Stacking Ensemble) makes 317 critical errors out of 22,417 reviews (1.4%), corresponding to approximately 14 critical errors per 1000 processed reviews. For a medical organization receiving 50,000 reviews per year, this means only 700 cases per year require manual checking compared to 50,000 without automation.
5.: We identified and justified a trade-off between conservative and active classification strategies. We found a stable inverse relationship between Practical Accuracy and Practical F1 (ρ = −0.93). Models with high Practical Accuracy implement a conservative strategy, minimizing critical errors at the cost of less frequent use of class C. For medical organizations where the cost of incorrect routing is maximal, the conservative strategy is preferable.
6.: The BERT + TF-IDF Hybrid model demonstrates maximum robustness to data variations (CV = 0.009). The Stacking Ensemble also showed high robustness (CV = 0.011). The low robustness of MLP (CV = 0.062) makes this model unsuitable for production deployment.
7.: The TF-IDF + Logistic Regression model provides an optimal balance of speed and quality for resource-constrained scenarios. With a Practical Accuracy of 89.3%, this model trains in 7.1 s intervals (35 times faster than hybrid approaches) and can run exclusively on CPU, making it suitable for real-time processing and organizations without GPU infrastructure.

5.3. Applied Conclusions

8.: The developed system automates up to 93% of patient feedback processing. When implementing the Stacking Ensemble in a typical medical organization (50,000 reviews per year), 46,450 reviews (92.9%) are processed automatically and directed to the correct departments, while only 700 reviews (1.4%) require manual checking due to critical errors. Operational cost savings amount to approximately 2.3 million rubles per year.
9.: Scaling to 4.3 million Prodoctorov.ru reviews confirmed the production readiness of the approach. Application of the Stacking Ensemble to real data demonstrated the stable class distribution (M 46.1%, O 21.0%, and C 32.9%) and advantage over traditional methods in 156,000 additional correctly processed complaints (3.6 percentage points).
10.: An adaptive architecture combining speed and accuracy is proposed for high-throughput scenarios. With appropriate confidence threshold tuning, such a system achieves the balance between processing speed and Practical Accuracy, ensuring a throughput sufficient for any real data flows.

5.4. Analytical Conclusions (Insights from 4 Million Reviews)

11.: M reviews are rated significantly higher than O reviews. The average rating of M reviews is 4.75; O reviews—4.59 (p < 0.001). The proportion of high ratings (4–5) in M reviews is 94.0%, in O reviews, it is 80.7%. Patients primarily value the professional competence of doctors, not the service aspects.
12.: C reviews are the most informative. Their average length is 802 characters—2.5 times longer than O reviews (320) and 1.6 times longer than M reviews (513). This confirms the thesis of Baines et al. [62] about the key role of narrative comments in feedback acceptance.
13.: Zero-star reviews reveal distinct dissatisfaction patterns. Among 163,277 extremely negative reviews (3.8% of all feedback), O complaints dominate (38.2%), followed by C (33.7%) and M (28.1%). Zero-star C reviews average 1013 characters—27% longer than the overall C review average—demonstrating that extreme dissatisfaction generates particularly detailed patient narratives.
14.: Over 14 years (2011–2025), average ratings increased by 1.24 points. The dynamics indicate the systemic improvement in patients’ perceptions of medical services. The gap between ratings of M and O aspects decreased from 0.17 to 0.07 points.
15.: After normalization by population, Krasnodar and Rostov-on-Don lead in review activity. Class distribution remains stable across all cities (M 44–49%, O 18–24%, and C 31–37%), confirming the universality of the developed classification.
16.: Surgical specialties dominate in M reviews, observational specialties in O reviews. The highest proportion of M reviews is among dental surgeons (54.0%); the lowest is among gynecologists (37.8%) and obstetricians (38.1%).

5.5. Recommendations for Implementation

17.: Architecture choice should be determined by the requirements of the specific medical organization. For maximum safety and accuracy, the Stacking Ensemble (92.9% Practical Accuracy) is recommended; for the optimal balance of accuracy and robustness, the BERT + TF-IDF Hybrid (92.7%); for CPU-based real-time processing, TF-IDF + Logistic Regression (89.3%); and for rapid prototyping, the baseline Logistic Regression (81.3%).
18.: For production deployment, regular model retraining (annually) is recommended, considering the dynamic nature of language and the changing patterns of reviews.

5.6. Final Reflections

The present study demonstrates that systematic analysis of patient feedback at scale is not merely a technical exercise but a foundational capability for modern, sustainable healthcare systems. By transforming millions of unstructured patient narratives into actionable intelligence—distinguishing medical from organizational complaints, identifying regional and specialty-specific patterns, and quantifying the business impact—we provide a blueprint for data-driven quality management that was previously unattainable through manual methods.

The methodological contributions extend beyond this specific application. The hybrid expert + LLM annotation approach offers a template for creating high-quality labeled datasets in resource-constrained domains. The practical metrics framework bridges technical model performance and operational business value. The adaptive architecture shows how sophisticated NLP systems can be deployed efficiently in production environments.

Addressing the challenges identified in the recent literature [61], this work provides concrete evidence of NLP integration into healthcare quality management with measurable outcomes: 93% automation of feedback processing, 156,000 additional correctly routed complaints, and the identification of critical patterns in extremely negative feedback. These results move beyond proof-of-concept to demonstrate real-world viability.

5.7. Limitations and Future Directions

The study has several limitations that shape its applicability and suggest directions for future work. Language specificity (Russian-language data only), subject domain focus (patient reviews about doctors and clinics), and temporal coverage (2011–2025) require adaptation for other contexts. Technical limitations include GPU requirements for hybrid models and the need for periodic retraining. LLM-related limitations include reproducibility concerns and computational costs, though the full prompt text in Appendix A mitigates the former.

Building on these foundations, future research should pursue several directions: multimodal analysis integrating text with patient and doctor metadata; detailed dissection of combined review structures; predictive modeling for the early warning of systemic issues; adaptation to other languages and healthcare systems; and—most critically—integration with CRM systems to close the quality management loop from complaint detection to intervention tracking.

In conclusion, this work establishes that hybrid human–AI approaches can transform patient feedback from an underutilized resource into a strategic asset for healthcare quality management. The demonstrated scale, accuracy, and business impact provide both methodological guidance and empirical evidence for organizations seeking to build sustainable, data-driven systems that respond effectively to patient needs while optimizing operational resources.

Author Contributions

Conceptualization, I.E.K. and A.V.K.; methodology, A.V.K.; software, A.V.K.; validation, I.E.K., A.V.K. and V.S.M.; formal analysis, A.V.K.; investigation, A.V.K.; resources, I.E.K.; data curation, A.V.K.; writing—original draft preparation, A.V.K.; writing—review and editing, I.E.K., A.V.K. and V.S.M.; visualization, A.V.K.; supervision, I.E.K.; project administration, I.E.K.; funding acquisition, I.E.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Non-commercial Foundation for the Advancement of Science and Education “INTELLECT” [Grant Number 1/GKMU-2024, dated 30 August 2024] for the research project “Classification of patient reviews from Russian clinics for managerial decisions in healthcare system improvement using machine learning methods,” headed by I.E. Kalabikhina. The research was also supported by the Faculty of Economics of Lomonosov Moscow State University as part of the research project “Population reproduction in socio-economic development” [No. 122041800047-9 (2017–2027)], headed by I.E. Kalabikhina.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article were derived from publicly available sources (infodoctor.ru, prodoctorov.ru) under their respective terms of service. The processed datasets, including the 22,417-review annotated corpus, and the code used for analysis are available from the corresponding author upon reasonable request. The full classification manual (version v15) is provided in Appendix A.

Acknowledgments

During the preparation of this manuscript, the authors used DeepSeek (version DeepSeek-V3) for prompt engineering iterations and for assistance with code debugging. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
C	Combined Content (both medical and organizational complaints)
CV	Coefficient of Variation
eWOM	Electronic Word of Mouth
κ_mod	Modified Cohen’s Kappa with Hierarchical Weighting
LLM	Large Language Model
LR	Logistic Regression
M	Medical Content (complaints about treatment/diagnosis)
ML	Machine Learning
NLP	Natural Language Processing
O	Organizational Content (complaints about service organization)
PA	Practical Accuracy
PRW	Physician Rating Websites
TF_IDF	Term Frequency–Inverse Document Frequency

Appendix A. Classification Manual for M, O, and C Reviews (Version v15)

This appendix presents the final version (v15) of the classification manual used for LLM prompt engineering and data annotation. The original manual was developed in Russian, as the study focuses on Russian-language patient reviews. The following is an English translation provided for the international readership. The translation has been carefully checked to preserve the precise meaning of the original classification rules and decision criteria.

Category Definitions

Category M (Medical)

Focus: Professional medical aspects and doctor competence.

Key features:

Specific medical actions: diagnosis, diagnostic interpretation, test analysis;
Prescription of treatment, procedures, medications, surgeries;
Assessment of quality of medical procedures and their outcomes;
Specific treatment results, effectiveness, medical outcomes;
Description of medical manipulations, examinations, operations;
Quality of conducted studies (ultrasound, MRI, X-ray, CT), with details;
Specific errors in diagnosis or treatment;
Criticism of medical incompetence, with examples;
Specific medical consultations, with description of actions.

Critical distinctions:

Specific assessment of medical decisions and their justification → M;
Criticism of medical competence with examples → M;
“Check moles for safety” → M (specific diagnosis);
“Recommendations and treatment” → M (specific actions);
“Consultation, examination, treatment” → M (specific actions).

Category O (Organizational)

Focus: Service, administrative, and communication aspects.

Key features:

Appointment scheduling, waiting times, schedule delays;
Administration work, reception, staff;
Service costs, finances, pricing policy;
Location convenience, accessibility, parking;
Space organization, comfort, cleanliness;
Communication aspects: politeness, attentiveness, talkativeness;
Personal characteristics without assessment of professionalism;
General service quality, service;
Staff attitude, clinic atmosphere;
General thanks without specifics of medical actions;
General phrases about professionalism without medical specifics;
Psychological comfort during communication;
Aesthetic results without medical assessment;
Criticism of personal characteristics without medical context.

Critical distinctions:

Only communication aspects without assessment of professionalism → O;
General assessment of service without medical specifics → O;
“God-given doctor, polite, attentive” → O (general phrases);
“I’ve been treated for many years” → O (without specifics);
“Not self-confident” → O (personal characteristic).

Category C (Combined)

Focus: Balanced combination of medical and organizational aspects.

Key features:

Both topics are present approximately equally and are significant;
Difficult to determine the dominant category;
Specific medical actions combined with substantial service aspects;
Specific medical actions + equivalent criticism/evaluation of service;
Medical prescriptions combined with financial issues;
Criticism of treatment combined with communication problems.

Decision Algorithm

Step 1. Determine the presence of specific medical actions: injections, tests, treatment, diagnosis, examinations, surgeries, specific prescriptions, medical procedures with description.

Step 2. Determine the presence of general evaluations without medical specifics: only general phrases about professionalism, only service aspects, only personal characteristics.

Step 3. Assess balance and significance:

If specific medical actions dominate → assign M
If only general or service evaluations without medical specifics are present → assign O
If specific medical actions appear with equivalent criticism or evaluation of service → assign C

Classification Priorities

Specific medical actions → M
Only general or service evaluations → O
Equivalent combination → C

Table A1. Typical cases and decisions for M, O, and C classification.

Case	Decision	Rationale
“Check moles for safety”	M	Specific medical diagnosis
“Recommendations and treatment from doctor”	M	Specific medical prescriptions
“Consultation, examination, treatment”	M	Specific medical actions
“Prescribed to take tests”	M	Specific prescriptions
“Incorrect temperature measurement”	M	Medical error
“God-given doctor, polite, attentive”	O	General phrases without medical specifics
“I’ve been treated for many years”	O	No specifics of medical actions
“Not self-confident”	O	Personal characteristic
“Polite, pleasant”	O	Communication aspects
“Prescribed tests but was nervous”	C	Medical prescriptions + communication
“Treatment is effective but expensive”	C	Medical outcome + financial issue
“Diagnosed correctly but had to wait long”	C	Medical competence + organizational problem

Critical Rules Summary

Unambiguously M: Contains specific descriptions of medical actions, diagnoses, treatments, prescriptions, or medical errors with concrete examples.
Unambiguously O: Contains only general praise, service comments, personal characteristics, or administrative issues without any specific medical content.
Unambiguously C: Contains both specific medical content AND significant organizational/service aspects in roughly equal measure.

Note: When in doubt between M and C, consider whether the organizational aspects are substantial enough to warrant separate attention. When in doubt between O and C, consider whether any specific medical actions are described—if yes, prefer C over O. The goal is to ensure that combined reviews—the most informative category—are correctly identified while maintaining practical routing accuracy.

References

Litvin, S.W.; Goldsmith, R.E.; Pan, B. Electronic word-of-mouth in hospitality and tourism management. Tour. Manag. 2008, 29, 458–468. [Google Scholar] [CrossRef]
Cantallops, A.S.; Salvi, F. New consumer behavior: A review of research on eWOM and hotels. Int. J. Hosp. Manag. 2014, 36, 41–51. [Google Scholar] [CrossRef]
Ismagilova, E.; Dwivedi, Y.K.; Slade, E.; Williams, M.D. Electronic Word of Mouth (eWOM). In Electronic Word of Mouth (eWOM) in the Marketing Context: A State of the Art Analysis and Future Directions; Springer: Cham, Switzerland, 2017; pp. 17–30. [Google Scholar] [CrossRef]
Emmert, M.; McLennan, S. One decade of online patient feedback: Longitudinal analysis of data from a German physician rating website. J. Med. Internet Res. 2021, 23, e24229. [Google Scholar] [CrossRef]
Kleefstra, S.M.; Zandbelt, L.C.; Borghans, I.; de Haes, H.J.; Kool, R.B. Investigating the potential contribution of patient rating sites to hospital supervision: Exploratory results from an interview study in The Netherlands. J. Med. Internet Res. 2016, 18, e201. [Google Scholar] [CrossRef]
Bardach, N.S.; Asteria-Peñaloza, R.; Boscardin, W.J.; Dudley, R.A. The relationship between commercial website ratings and traditional hospital performance measures in the USA. BMJ Qual. Saf. 2013, 22, 194–202. [Google Scholar] [CrossRef]
Van de Belt, T.H.; Engelen, L.J.; Berben, S.A.; Teerenstra, S.; Samsom, M.; Schoonhoven, L. Internet and social media for health-related information and communication in health care: Preferences of the Dutch general population. J. Med. Internet Res. 2013, 15, e220. [Google Scholar] [CrossRef]
Hao, H.; Zhang, K.; Wang, W.; Gao, G. A tale of two countries: International comparison of online doctor reviews between China and the United States. Int. J. Med. Inform. 2017, 99, 37–44. [Google Scholar] [CrossRef]
Lantzy, S.; Anderson, D. Can consumers use online reviews to avoid unsuitable doctors? Evidence from RateMDs.com and the Federation of State Medical Boards. Decis. Sci. 2020, 51, 962–984. [Google Scholar] [CrossRef]
Gilbert, K.; Hawkins, C.M.; Hughes, D.R.; Patel, K.; Gogia, N.; Sekhar, A.; Duszak, R., Jr. Physician Rating Websites: Do Radiologists Have Online Presence? J. Am. Coll. Radiol. 2015, 12, 867–871. [Google Scholar] [CrossRef]
Okike, K.; Peter-Bibb, T.K.; Xie, K.C.; Okike, O.N. Association between physician online rating and quality of care. J. Med. Internet Res. 2016, 18, e324. [Google Scholar] [CrossRef]
Mostaghimi, A.; Crotty, B.H.; Landon, B.E. The availability and nature of physician information on the internet. J. Gen. Intern. Med. 2010, 25, 1152–1156. [Google Scholar] [CrossRef]
Lagu, T.; Hannon, N.S.; Rothberg, M.B.; Lindenauer, P.K. Patients’ evaluations of health care providers in the era of social networking: An analysis of physician-rating websites. J. Gen. Intern. Med. 2010, 25, 942–946. [Google Scholar] [CrossRef]
López, A.; Detz, A.; Ratanawongsa, N.; Sarkar, U. What patients say about their doctors online: A qualitative content analysis. J. Gen. Intern. Med. 2012, 27, 685–692. [Google Scholar] [CrossRef]
Shah, A.M.; Yan, X.; Qayyum, A.; Naqvi, R.A.; Shah, S.J. Mining topic and sentiment dynamics in physician rating websites during the early wave of the COVID-19 pandemic: Machine learning approach. Int. J. Med. Inform. 2021, 149, 104434. [Google Scholar] [CrossRef] [PubMed]
Ghimire, B.; Shanaev, S.; Lin, Z. Effects of official versus online review ratings. Ann. Tour. Res. 2022, 92, 103247. [Google Scholar] [CrossRef]
Xu, Y.; Xu, X. Rating deviation and manipulated reviews on the Internet—A multi-method study. Inf. Manag. 2023, 60, 103829. [Google Scholar] [CrossRef]
Hu, N.; Bose, I.; Koh, N.S.; Liu, L. Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decis. Support Syst. 2012, 52, 674–684. [Google Scholar] [CrossRef]
Luca, M.; Zervas, G. Fake it till you make it: Reputation, competition, and Yelp review fraud. Manag. Sci. 2016, 62, 3412–3427. [Google Scholar] [CrossRef]
Namatherdhala, B.; Mazher, N.; Sriram, G.K. Artificial Intelligence in Product Management: Systematic review. Int. Res. J. Mod. Eng. Technol. Sci. 2022, 4, 2914–2917. [Google Scholar]
Jabeur, S.B.; Ballouk, H.; Arfi, W.B.; Sahut, J.M. Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for research. J. Bus. Res. 2023, 158, 113631. [Google Scholar] [CrossRef]
Bidmon, S.; Elshiewy, O.; Terlutter, R.; Boztug, Y. What patients value in physicians: Analyzing drivers of patient satisfaction using physician-rating website data. J. Med. Internet Res. 2020, 22, e13830. [Google Scholar] [CrossRef]
Shah, A.M.; Yan, X.; Tariq, S.; Ali, M. What patients like or dislike in physicians: Analyzing drivers of patient satisfaction and dissatisfaction using a digital topic modeling approach. Inf. Process. Manag. 2021, 58, 102516. [Google Scholar] [CrossRef]
Emmert, M.; Meier, F. An analysis of online evaluations on a physician rating website: Evidence from a German public reporting instrument. J. Med. Internet Res. 2013, 15, e2655. [Google Scholar] [CrossRef]
Nwachukwu, B.U.; Adjei, J.; Trehan, S.K.; Chang, B.; Amoo-Achampong, K.; Nguyen, J.T.; Taylor, S.A.; McCormick, F.; Ranawat, A.S. Rating a sports medicine surgeon’s “quality” in the modern era: An analysis of popular physician online rating websites. HSS J. 2016, 12, 272–277. [Google Scholar] [CrossRef] [PubMed]
Obele, C.C.; Duszak, R., Jr.; Hawkins, C.M.; Rosenkrantz, A.B. What patients think about their interventional radiologists: Assessment using a leading physician ratings website. J. Am. Coll. Radiol. 2017, 14, 609–614. [Google Scholar] [CrossRef]
Kapoor, N.; Haj-Mirzaian, A.; Yan, H.Z.; Wickner, P.; Giess, C.S.; Eappen, S.; Khorasani, R. Patient experience scores for radiologists: Comparison with nonradiologist physicians and changes after public posting in an institutional online provider directory. Am. J. Roentgenol. 2022, 219, 338–345. [Google Scholar] [CrossRef]
Gao, G.G.; McCullough, J.S.; Agarwal, R.; Jha, A.K. A changing landscape of physician quality reporting: Analysis of patients’ online ratings of their physicians over a 5-year period. J. Med. Internet Res. 2012, 14, e38. [Google Scholar] [CrossRef]
Emmert, M.; Meier, F.; Heider, A.K.; Dürr, C.; Sander, U. What do patients say about their physicians? An analysis of 3000 narrative comments posted on a German physician rating website. Health Policy 2014, 118, 66–73. [Google Scholar] [CrossRef] [PubMed]
Emmert, M.; Meier, F.; Pisch, F.; Sander, U. Physician choice making and characteristics associated with using physician-rating websites: Cross-sectional study. J. Med. Internet Res. 2013, 15, e2702. [Google Scholar] [CrossRef]
Rahim, A.I.A.; Ibrahim, M.I.; Musa, K.I.; Chua, S.L.; Yaacob, N.M. Patient satisfaction and hospital quality of care evaluation in malaysia using servqual and facebook. Healthcare 2021, 9, 1369. [Google Scholar] [CrossRef]
Galizzi, M.M.; Miraldo, M.; Stavropoulou, C.; Desai, M.; Jayatunga, W.; Joshi, M.; Parikh, S. Who is more likely to use doctor-rating websites, and why? A cross-sectional study in London. BMJ Open 2012, 2, e001493. [Google Scholar] [CrossRef]
Hanauer, D.A.; Zheng, K.; Singer, D.C.; Gebremariam, A.; Davis, M.M. Public awareness, perception, and use of online physician rating sites. JAMA 2014, 311, 734–735. [Google Scholar] [CrossRef]
Lin, Y.; Hong, Y.A.; Henson, B.S.; Stevenson, R.D.; Hong, S.; Lyu, T.; Liang, C. Assessing patient experience and healthcare quality of dental care using patient online reviews in the United States: Mixed methods study. J. Med. Internet Res. 2020, 22, e18652. [Google Scholar] [CrossRef] [PubMed]
Daskivich, T.J.; Houman, J.; Fuller, G.; Black, J.T.; Kim, H.L.; Spiegel, B. Online physician ratings fail to predict actual performance on measures of quality, value, and peer review. J. Am. Med. Inform. Assoc. 2018, 25, 401–407. [Google Scholar] [CrossRef] [PubMed]
Gray, B.M.; Vandergrift, J.L.; Gao, G.G.; McCullough, J.S.; Lipner, R.S. Website ratings of physicians and their quality of care. JAMA Intern. Med. 2015, 175, 291–293. [Google Scholar] [CrossRef] [PubMed]
Skrzypecki, J.; Przybek, J. Physician review portals do not favor highly cited US ophthalmologists. In Seminars in Ophthalmology; Taylor & Francis: Abingdon, UK, 2018; Volume 33, pp. 547–551. [Google Scholar] [CrossRef]
Widmer, R.J.; Maurer, M.J.; Nayar, V.R.; Aase, L.A.; Wald, J.T.; Kotsenas, A.L.; Timimi, F.K.; Harper, C.M.; Pruthi, S. Online physician reviews do not reflect patient satisfaction survey responses. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The Netherlands, 2018; Volume 93, pp. 453–457. [Google Scholar] [CrossRef]
Saifee, D.H.; Bardhan, I.; Zheng, Z. Do Online Reviews of Physicians Reflect Healthcare Outcomes? In International Conference of Smart Health; Springer International Publishing: Cham, Switzerland, 2017; pp. 161–168. [Google Scholar] [CrossRef]
Trehan, S.K.; Nguyen, J.T.; Marx, R.; Cross, M.B.; Pan, T.J.; Daluiski, A.; Lyman, S. Online patient ratings are not correlated with total knee replacement surgeon–specific outcomes. HSS J. 2018, 14, 177–180. [Google Scholar] [CrossRef]
Doyle, C.; Lennox, L.; Bell, D. A systematic review of evidence on the links between patient experience and clinical safety and effectiveness. BMJ Open 2013, 3, e001570. [Google Scholar] [CrossRef]
Okike, K.; Uhr, N.R.; Shin, S.Y.; Xie, K.C.; Kim, C.Y.; Funahashi, T.T.; Kanter, M.H. A comparison of online physician ratings and internal patient-submitted ratings from a large healthcare system. J. Gen. Intern. Med. 2019, 34, 2575–2579. [Google Scholar] [CrossRef]
Lu, S.F.; Rui, H. Can we trust online physician ratings? Evidence from cardiac surgeons in Florida. Manag. Sci. 2018, 64, 2557–2573. [Google Scholar] [CrossRef]
Greaves, F.; Ramirez-Cano, D.; Millett, C.; Darzi, A.; Donaldson, L. Harnessing the cloud of patient experience: Using social media to detect poor quality healthcare. BMJ Qual. Saf. 2013, 22, 251–255. [Google Scholar] [CrossRef]
Ranard, B.L.; Werner, R.M.; Antanavicius, T.; Schwartz, H.A.; Smith, R.J.; Meisel, Z.F.; Asch, D.A.; Ungar, L.H.; Merchant, R.M. What can Yelp teach us about measuring hospital quality? Health Aff. (Proj. Hope) 2016, 35, 697–705. [Google Scholar] [CrossRef]
Khanbhai, M.; Anyadi, P.; Symons, J.; Flott, K.; Darzi, A.; Mayer, E. Applying natural language processing and machine learning techniques to patient experience feedback: A systematic review. BMJ Health Care Inform. 2021, 28, e100262. [Google Scholar] [CrossRef] [PubMed]
Doing-Harris, K.; Mowery, D.L.; Daniels, C.; Chapman, W.W.; Conway, M. Understanding Patient Satisfaction with Received Healthcare Services: A Natural Language Processing Approach. AMIA Annu. Symp. Proc. 2017, 2016, 524–533. [Google Scholar] [PubMed]
Nawab, K.; Ramsey, G.; Schreiber, R. Natural language processing to extract meaningful information from patient experience feedback. Appl. Clin. Inform. 2020, 11, 242–252. [Google Scholar] [CrossRef] [PubMed]
Wallace, B.C.; Paul, M.J.; Sarkar, U.; Trikalinos, T.A.; Dredze, M. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J. Am. Med. Inform. Assoc. 2014, 21, 1098–1103. [Google Scholar] [CrossRef]
Hao, H.; Zhang, K. The voice of Chinese health consumers: A text mining approach to web-based physician reviews. J. Med. Internet Res. 2016, 18, e108. [Google Scholar] [CrossRef]
Shah, A.M.; Yan, X.; Shah, S.A.A.; Mamirkulova, G. Mining patient opinion to evaluate the service quality in healthcare: A deep-learning approach. J. Ambient Intell. Humaniz. Comput. 2020, 11, 2925–2942. [Google Scholar] [CrossRef]
Hao, H. The development of online doctor reviews in China: An analysis of the largest online doctor review website in China. J. Med. Internet Res. 2015, 17, e134. [Google Scholar] [CrossRef]
Jiang, S.; Street, R.L. Pathway linking internet health information seeking to better health: A moderated mediation study. Health Commun. 2017, 32, 1024–1031. [Google Scholar] [CrossRef]
Syed, U.A.; Acevedo, D.; Narzikul, A.C.; Coomer, W.; Beredjiklian, P.K.; Abboud, J.A. Physician Rating Websites: An Analysis of Physician Evaluation and Physician Perception. Arch. Bone Jt. Surg. 2019, 7, 136–142. [Google Scholar]
Russkikh, T.N.; Tinyakova, V.I.; Kukharets, D.V. Semantic analysis of patient feedback to provide decision support in the medical services market. Creat. Econ. 2024, 18, 455–474. (In Russian) [Google Scholar] [CrossRef]
Kostrov, S.A.; Potapov, M.P.; Akkuratov, E.G. Personalizing communication with the patient: Large language models. Patient-Oriented Med. Pharm. 2025, 3, 68–79. (In Russian) [Google Scholar] [CrossRef]
Kalabikhina, I.; Moshkin, V.; Kolotusha, A.; Kashin, M.; Klimenko, G.; Kazbekova, Z. Advancing Semantic Classification: A Comprehensive Examination of Machine Learning Techniques in Analyzing Russian-Language Patient Reviews. Mathematics 2024, 12, 566. [Google Scholar] [CrossRef]
Kalabikhina, I.E.; Kolotusha, A.V. Database of negative reviews from patients of medical clinics in Russian cities with a population of over a million (based on infodoctor.ru for the period 2012–2023). Popul. Econ. 2025, 9, 117–126. [Google Scholar] [CrossRef]
Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. Medical vs. Organizational Complaints: A Machine Learning Analysis Reveals Divergent Patterns in Patient Reviews Across Russian Cities. Healthcare 2025, 13, 2641. [Google Scholar] [CrossRef]
Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. How Different Medical Practices Are Associated with Types of Patient Complaints in Russian Clinics. Healthcare, 2026; 14, in press.
Feizollah, A.; Lin, C.; O’Malley, L.; Thompson, W.; Listl, S.; Byrne, M. The Use of Natural Language Processing to Interpret Unstructured Patient Feedback on Health Services: Scoping Review. J. Med. Internet Res. 2025, 27, e72853. [Google Scholar] [CrossRef]
Baines, R.; Regan de Bere, S.; Stevens, S.; Read, J.; Marshall, M.; Lalani, M.; Bryce, M.; Archer, J. The impact of patient feedback on the medical performance of qualified doctors: A systematic review. BMC Med. Educ. 2018, 18, 173. [Google Scholar] [CrossRef] [PubMed]
Wong, E.; Mavondo, F.; Fisher, J. Patient feedback to improve quality of patient-centred care in public hospitals: A systematic review of the evidence. BMC Health Serv. Res. 2020, 20, 530. [Google Scholar] [CrossRef]
Ruksakulpiwat, S.; Thongking, W.; Zhou, W.; Benjasirisan, C.; Phianhasin, L.; Schiltz, N.K.; Brahmbhatt, S. Machine learning-based patient classification system for adults with stroke: A systematic review. Chronic Illn. 2023, 19, 26–39. [Google Scholar] [CrossRef] [PubMed]
Landis, J.R.; Koch, G.G. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977, 33, 363–374. [Google Scholar] [CrossRef] [PubMed]

Table 1. Confusion matrix of LLM classification and Gold Standard (n = 1500).

LLM\Gold Standard	M	O	C	Σ
M	476	28	67	571
O	51	192	164	407
C	220	61	241	522
Σ	747	281	472	1500

Table 2. Comparison of machine learning architectures.

Model	Architecture Type	Input Features	Key Components/Hyperparameters
Stacking Ensemble	Ensemble (Stacking)	BERT embeddings + TF-IDF	Base learners: BERT (rubert-tiny2), TF-IDF + LR, LightGBM; meta-learner: Logistic Regression
BERT + TF-IDF Hybrid	Hybrid (Concatenation)	BERT embeddings + TF-IDF	BERT embeddings (last hidden layer, averaged over tokens) concatenated with TF-IDF features; classifier: Logistic Regression
BERT + Logistic Regression	Fine-tuned Transformer	BERT embeddings	rubert-tiny2 with classification head on [CLS] token; fine-tuned for 3-class classification
TF-IDF + Logistic Regression	Linear (Bag-of-Words)	TF-IDF	Unigrams + bigrams, max 10,000 features; L2 regularization
MLP	Feedforward Neural Network	TF-IDF	Two hidden layers (128 and 64 neurons), ReLU activation; trained on TF-IDF features
LightGBM	Gradient Boosting	TF-IDF	100 trees, max depth 7
Random Forest	Ensemble (Bagging)	TF-IDF	100 trees, Gini criterion
Logistic Regression	Linear (Baseline)	TF-IDF	L2 regularization

Table 3. Comparative analysis of classification models on the combined sample (n = 22,417).

Model	Standard Accuracy	Standard F1	Practical Accuracy ¹	Practical F1 ²	Critical Errors (M ↔ O) ³	CV ⁴	Training Time (s)
Stacking Ensemble	0.680 ± 0.007	0.679 ± 0.007	0.929 ± 0.005	0.101 ± 0.001	317 (1.4%)	0.011	250.0
BERT + TF-IDF Hybrid	0.676 ± 0.006	0.675 ± 0.006	0.927 ± 0.004	0.103 ± 0.000	326 (1.5%)	0.009	227.0
BERT + Logistic Regression	0.653 ± 0.009	0.652 ± 0.009	0.915 ± 0.006	0.120 ± 0.001	379 (1.7%)	0.014	146.0
TF-IDF + Logistic Regression	0.613 ± 0.008	0.610 ± 0.008	0.893 ± 0.006	0.150 ± 0.001	479 (2.1%)	0.013	7.1
MLP	0.498 ± 0.031	0.491 ± 0.030	0.868 ± 0.027	0.216 ± 0.007	591 (2.6%)	0.062	89.3
LightGBM	0.499 ± 0.009	0.499 ± 0.009	0.831 ± 0.007	0.244 ± 0.002	760 (3.4%)	0.018	8.7
Random Forest	0.450 ± 0.008	0.451 ± 0.008	0.822 ± 0.007	0.261 ± 0.002	798 (3.6%)	0.017	12.4
Logistic Regression	0.503 ± 0.007	0.495 ± 0.006	0.813 ± 0.005	0.249 ± 0.002	837 (3.7%)	0.013	5.2

¹ Practical Accuracy represents the proportion of reviews classified without critical M ↔ O errors. ² Practical F1 is the F1 score for class C (combined reviews), indicating the model’s tendency to use this category; low values (<0.1) indicate conservative strategy; and high values (>0.2) indicate active strategy. ³ Critical errors are presented as absolute counts with percentage of total sample (22,417) in parentheses. All accuracy and F1 values are mean ± standard error from 5-fold stratified cross-validation. ⁴ CV (Coefficient of Variation) measures model stability across folds, with CV < 0.01 indicating excellent stability, 0.01–0.02 very good, 0.02–0.05 acceptable, and >0.05 low (unsuitable for production).

Table 4. Class distribution on 4 million Prodoctorov.ru reviews (2011–2025).

Class	Count	Percentage
M (Medical)	1,999,467	46.1%
O (Organizational)	911,901	21.0%
C (Combined)	1,429,323	32.9%
Total	4,340,691	100%

Table 5. Average ratings by class (4 million Prodoctorov.ru reviews, 2011–2025).

Class	Mean Rating	Median	Std. Deviation
M (Medical)	4.75	5.0	1.00
O (Organizational)	4.59	5.0	1.29
C (Combined)	4.55	5.0	1.29

Table 6. Review length by class (characters) based on 4 million Prodoctorov.ru reviews (2011–2025).

Class	Mean	Median	25th Percentile	75th Percentile	Maximum
C (Combined)	802	692	491	1003	10,792
M (Medical)	513	423	294	627	10,492
O (Organizational)	320	278	154	415	10,024

Table 7. Dynamics of review count by year (Prodoctorov.ru, 2011–2025).

Year	M (Medical)	O (Organizational)	C (Combined)	Total
2011	176	182	107	465
2012	1096	1522	460	3078
2013	4819	6977	1890	13,686
2014	7649	10,737	3086	21,472
2015	18,164	23,807	6515	48,486
2016	58,335	53,216	24,449	136,000
2017	90,562	77,216	36,355	204,133
2018	132,596	99,371	59,103	291,070
2019	125,040	90,855	83,318	299,213
2020	94,251	61,139	88,022	243,412
2021	195,189	83,946	115,871	395,006
2022	248,051	74,135	179,769	501,955
2023	325,888	74,237	253,046	653,171
2024	434,350	90,748	361,453	886,551
2025 *	251,098	57,959	208,262	517,319

* Note: Data for 2025 include reviews from January to July, 2025 (partial year).

Table 8. Dynamics of average ratings by year (Prodoctorov.ru, 2011–2025).

Year	M (Medical)	O (Organizational)	C (Combined)
2011	3.64	3.81	3.59
2012	3.94	4.03	3.30
2013	3.96	4.13	3.27
2014	4.05	4.15	3.40
2015	4.25	4.34	3.47
2016	4.43	4.40	4.02
2017	4.43	4.43	3.89
2018	4.48	4.47	3.97
2019	4.53	4.50	4.15
2020	4.68	4.60	4.34
2021	4.77	4.67	4.43
2022	4.81	4.72	4.60
2023	4.84	4.75	4.69
2024	4.87	4.77	4.74
2025	4.88	4.81	4.75

Table 9. Top 5 cities by number of reviews normalized per 1 million inhabitants (Prodoctorov.ru, 2011–2025). M = medical, O = organizational, and C = combined.

City	Total Reviews	% of Total	Population (Avg 2019–2023), Millions	Reviews per 1 Million	M (%)	O (%)	C (%)
Moscow	828,864	19.1%	12.65	65,525	49.0%	19.3%	31.7%
Saint Petersburg	588,178	13.6%	5.38	109,344	44.7%	18.1%	37.2%
Krasnodar	463,383	10.7%	1.04	444,966	45.3%	23.4%	31.3%
Rostov-on-Don	312,601	7.2%	1.14	275,068	45.4%	23.5%	31.1%
Kazan	231,391	5.3%	1.26	183,895	44.6%	22.0%	33.4%

Table 10. Top 10 specialties by number of reviews (Prodoctorov.ru, 4 million reviews, 2011–2025). M = medical, O = organizational, and C = combined.

Specialty	Total Reviews	M (%)	O (%)	C (%)
Gynecologist	321,205	37.8%	30.7%	31.5%
Ultrasound Physician	224,481	37.7%	27.5%	34.8%
Dentist	197,403	40.0%	25.2%	34.8%
Obstetrician	174,825	38.1%	32.7%	29.2%
Dental Surgeon	116,345	54.0%	18.2%	27.8%
ENT Specialist	95,760	42.4%	22.6%	35.0%
Therapist	95,333	41.5%	24.7%	33.8%
Dermatologist	79,530	42.7%	22.2%	35.1%
Neurologist	70,746	40.8%	24.2%	35.0%
Ophthalmologist	66,324	41.6%	24.0%	34.4%

Table 11. Rating distribution on 4 million Prodoctorov.ru reviews (2011–2025).

Rating Category	Definition	Count	Percentage
High	4–5 (inclusive)	3,895,449	89.7%
Medium	>0 and <4	281,965	6.5%
Zero	0	163,277	3.8%
Total		4,340,691	100%

Table 12. Comparison with previous studies by the authors.

Characteristic	Kalabikhina et al., 2024 [57] (Mathematics)	Kalabikhina et al., 2025 [59] (Healthcare)	Present Study
Data Volume	60 thousand	18.7 thousand (negative only)	4.34 million
Classes	Binary (positive/negative)	Binary (M, O)	Three-class (M, O, and C)
Sentiment Coverage	All	Negative only	All
Dataset Creation	Expert annotation	Expert annotation	Hybrid (expert + LLM)
Primary Metrics	Standard accuracy	Standard accuracy	Practical Accuracy
Models Compared	3 architectures (GRU, LSTM, and CNN)	Logistic regression	8 architectures
Scaling	No	No	4.34 million reviews
Business Effect	Not quantified	Not quantified	+156,000 correctly processed complaints

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data Cogn. Comput. 2026, 10, 114. https://doi.org/10.3390/bdcc10040114

AMA Style

Kalabikhina IE, Kolotusha AV, Moshkin VS. Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data and Cognitive Computing. 2026; 10(4):114. https://doi.org/10.3390/bdcc10040114

Chicago/Turabian Style

Kalabikhina, Irina Evgenievna, Anton Vasilyevich Kolotusha, and Vadim Sergeevich Moshkin. 2026. "Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare" Big Data and Cognitive Computing 10, no. 4: 114. https://doi.org/10.3390/bdcc10040114

APA Style

Kalabikhina, I. E., Kolotusha, A. V., & Moshkin, V. S. (2026). Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data and Cognitive Computing, 10(4), 114. https://doi.org/10.3390/bdcc10040114

Article Menu

Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare

Abstract

1. Introduction

1.1. Context and Significance of the Study

1.2. Literature Review and Previous Work

1.3. Limitations of Previous Work and Bridge to the Present Study

1.4. The Present Study: Aim and Objectives

1.5. Scientific Novelty and Contribution of the Work

1.6. Structure of the Article

2. Materials and Methods

2.1. Data Sources and Sampling Principles

2.2. Development and Validation of the LLM-Based Classification System

2.3. Formation of the Final Training Dataset

2.4. Machine Learning Models and Experimental Setup

3. Results

3.1. Results of LLM Classification Validation

3.2. Comparative Analysis of Models on the Combined Sample

3.3. Scaling on Prodoctorov.ru Data (4,340,691 Reviews)

3.4. Substantive Insights from the Analysis of 4 Million Reviews

4. Discussion

4.1. Interpretation of Main Results

4.2. Comparison with Previous Studies

4.3. Practical Significance and Business Value

4.4. Limitations of the Study

4.5. Directions for Future Research

5. Conclusions

5.1. Methodological Conclusions

5.2. Technical Conclusions

5.3. Applied Conclusions

5.4. Analytical Conclusions (Insights from 4 Million Reviews)

5.5. Recommendations for Implementation

5.6. Final Reflections

5.7. Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Classification Manual for M, O, and C Reviews (Version v15)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI